Hidden-Web Databases: Classification and Search

advertisement
Querying Text Databases for
Efficient Information Extraction
Eugene Agichtein
Luis Gravano
Columbia University
Extracting Structured Information
“Buried” in Text Documents
Organization
Microsoft's central headquarters in Redmond
is home to almost every product group and division.
Brent Barlow, 27, a software analyst and
beta-tester at Apple Computer’s headquarters
in Cupertino, was fired Monday for "thinking
a little too different."
Location
Microsoft
Redmond
Apple
Computer
Cupertino
Nike
Portland
Apple's programmers "think different" on a "campus" in
Cupertino, Cal. Nike employees "just do it" at what the
company refers to as its "World Campus," near Portland,
Ore.
2
Information Extraction Applications
• Over a corporation’s customer report or
email complaint database: enabling
sophisticated querying and analysis
• Over biomedical literature: identifying
drug/condition interactions
• Over newspaper archives: tracking disease
outbreaks, terrorist attacks; intelligence
Significant progress over the last decade [MUC]
3
Information Extraction Example:
Organizations’ Headquarters
Input: Documents
Brent Barlow, a software analyst and beta-tester at Apple
Computer's headquarters in Cupertino, was fired Monday for "thinking
a little too different."
doc2
doc4
Named-Entity Tagging
<PERSON>Brent Barlow</PERSON>,
a software analyst and beta-tester at
<ORGANIZATION>Apple Computer</ORGANIZATION>'s
headquarters in <LOCATION>Cupertino</LOCATION>, was fired
Monday for "thinking a little too different."
doc4
Pattern Matching
Output: Tuples
<ORGANIZATION> = Apple
Computer
<LOCATION> = Cupertino
Pattern = p1
Extraction Patterns
doc4
p1
<ORGANIZATION>'s
headquarters in <LOCATION>
<ORGANIZATION>,
based in <LOCATION>
tid
Organization
Location
W
1
2
Eastman Kodak
Apple Computer
Rochester
Cupertino
0.9
0.8
p2
Useful
doc2
doc4
4
Goal: Extract All Tuples of a
Relation from a Document Database
Information
Extraction
System
Extracted Tuples
• One approach: feed every document to
information extraction system
• Problem: efficiency!
5
Information Extraction is Expensive
• Efficiency is a problem even after training information
extraction system
Example: NYU’s Proteus extraction system takes around
9 seconds per document
• Over 15 days to process 135,000 news articles
• “Filtering” before further processing a document might help
• Can’t afford to “scan the web” to process each page!
• “Hidden-Web” databases don’t allow crawling
6
Information Extraction Without
Processing All Documents
• Observation: Often only small fraction of
database is relevant for an extraction task
• Our approach: Exploit database search
engine to retrieve and process only
“promising” documents
7
Architecture of our QXtract System
User-Provided Seed Tuples
Microsoft
Redmond
Apple
Cupertino
Query Generation
Queries
Search Engine
Promising Documents
Key problem:
Learn queries to
retrieve “promising” documents
Information Extraction
Extracted Relation
Text Database
Microsoft
Redmond
Apple
Cupertino
Exxon
Irving
IBM
Armonk
Intel
Santa Clara
8
Generating Queries to Retrieve
Promising Documents
User-Provided Seed Tuples
Seed Sampling
1. Get document sample with
“likely negative” and “likely
positive” examples.
2. Label sample documents using
information extraction system
as “oracle.”
3. Train classifiers to “recognize”
useful documents.
4. Generate queries from classifier
model/rules.
?
?
?
?
? ?
?
?
Information Extraction
tuple1
tuple2
tuple3
tuple4
tuple5
+
+
-
-
+
+
- -
Classifier Training
+
+
-
-
+
+
Search Engine
-
-
Query Generation
Queries
Text Database
Getting a Training Document Sample
User-Provided Seed Tuples
Microsoft AND Redmond
“Random” Queries
Apple AND Cupertino
Get document
sample with “likely
negative” and
“likely positive”
examples.
Search Engine
Text Database
User-Provided Seed Tuples
?
?
?
?
?
?
Seed Sampling
? ?
Search Engine
Text Database
?
?
?
?
?
?
? ?
10
Labeling the Training Document Sample
?
?
Use information
extraction system as
“oracle” to label
examples as “true
positive” and “true
negative.”
?
?
?
? ?
?
Information Extraction
System
Microsoft
Redmond
Apple
Cupertino
IBM
Armonk
tuple1
tuple2
tuple3
tuple4
tuple5
+
+
-
-
+
+
-
11
Training Classifiers to Recognize
“Useful” Documents
is
based
in
near
city
spokesperson
reported
news
earnings
release
products
made
used
exported
far
past
old
homerun
sponsored
event
Ripper
based AND near
=> Useful
SVM
sponsored
near homerun
earnings based event
far
spokesperson
is
+
+
-
Document features:
words
Okapi (IR)
based
tuple1
tuple2
tuple3
tuple4
tuple5
+
+
-
+
+
-
- -
3
Classifier Training
spokesperson
sponsored
2
-1
+
+
-
-
+
+
-
-
12
Generating Queries from Classifiers
Ripper
based AND near
=> Useful
based AND near
SVM
sponsored
is
near homerun
earnings based event
far
spokesperson
spokesperson
earnings
QCombined
Okapi (IR)
based
3
spokesperson
2
sponsored
-1
based
spokesperson
+
+
based AND near
spokesperson
based
-
-
+
+
-
-
Query Generation
Queries
13
Architecture of our QXtract System
User-Provided Seed Tuples
Microsoft
Redmond
Apple
Cupertino
Query Generation
Queries
Search Engine
Promising Documents
Information Extraction
Extracted Relation
Text Database
Microsoft
Redmond
Apple
Cupertino
Exxon
Irving
IBM
Armonk
Intel
Santa Clara
14
Experimental Evaluation: Data
• Training Set:
– 1996 New York Times archive of 137,000
newspaper articles
– Used to tune QXtract parameters
• Test Set:
– 1995 New York Times archive of 135,000
newspaper articles
15
Final Configuration of QXtract,
from Training
16
Experimental Evaluation:
Information Extraction Systems and
Associated Relations
• DIPRE [Brin 1998]
– Headquarters(Organization, Location)
• Snowball [Agichtein and Gravano 2000]
– Headquarters(Organization, Location)
• Proteus [Grishman et al. 2002]
– DiseaseOutbreaks(DiseaseName, Location,
Country, Date, …)
17
Experimental Evaluation: Seed Tuples
Organization Location
DiseaseName
Location
Microsoft
Redmond
Malaria
Ethiopia
Exxon
Irving
Typhus
Bergen-Belsen
Boeing
Seattle
Flu
The Midwest
IBM
Armonk
Mad Cow Disease The U.K.
Intel
Santa Clara
Pneumonia
Headquarters
The U.S.
DiseaseOutbreaks
18
Experimental Evaluation: Metrics
• Gold standard: relation Rall, obtained by
running information extraction system over
every document in Dall database
• Recall: % of Rall captured in approximation
extracted from retrieved documents
• Precision: % of retrieved documents that
are “useful” (i.e., produced tuples)
19
Experimental Evaluation:
Relation Statistics
Relation and Extraction System
| Dall |
% Useful
| Rall |
Headquarters: Snowball
135,000
23
24,536
Headquarters: DIPRE
135,000
22
20,952
DiseaseOutbreaks: Proteus
135,000
4
8,859
20
Alternative Query Generation Strategies
• QXtract, with final configuration from training
• Tuples: Keep deriving queries from extracted tuples
– Problem: “disconnected” databases
• Patterns: Derive queries from extraction patterns from
information extraction system
– “<ORGANIZATION>, based in <LOCATION>” => “based in”
– Problems: pattern features often not suitable for querying,
or not visible from “black-box” extraction system
• Manual: Construct queries manually [MUC]
– Obtained for Proteus from developers
– Not available for DIPRE and Snowball
Plus simple additional “baseline”: retrieve a random document
sample of appropriate size
21
Recall and Precision
Headquarters Relation; Snowball Extraction System
40
recall (%)
35
QXtract
Patterns
Tuples
Baseline
50
30
25
20
15
10
45
40
35
30
5
25
0
5%
20
5%
10%
15%
QXtract
Patterns
Tuples
Baseline
55
precision (%)
45
20%
M axFractionRetrieved (% |Dall|)
25%
10%
15%
20%
25%
M axFractionRetrieved (% |Dall|)
(a)
(b)
Recall
Precision
22
Recall and Precision
Headquarters Relation; DIPRE Extraction System
QXtract
Patterns
Tuples
Baseline
35
30
25
precision (%)
recall (%)
45
40
20
15
10
5
0
5%
10%
15%
20%
25%
QXtract
Tuples
65
60
55
Patterns
Baseline
50
45
40
35
30
25
20
5%
10%
15%
20%
M axFractionRetrieved (% |Dall|)
M axFractionRetrieved (% |Dall|)
(a)
(b)
Recall
Precision
25%
23
Extraction Efficiency and Recall
DiseaseOutbreaks Relation; Proteus Extraction System
80
16
70
14
15.5
running time (days)
recall (%)
60
50
40
30
20
10
0
5%
10%
25%
12
10
8
Scan
QXtract
6
4
1.4
2
MaxFractionRetrieved
QXtract
Manual
Tuples
Baseline
0
10%
100%
60% of relation extracted from just 10% of
documents of 135,000 newspaper article database
24
Snowball/Headquarters Queries
25
DIPRE/Headquarters Queries
26
Proteus/DiseaseOutbreaks Queries
27
Current Work: Characterizing Databases
for an Extraction Task
Sparse?
no
yes
Scan
tuple1
+
QXtract, Tuples
tuple1
+
tuple1
tuple1
+
QXtract
Connected?
no
yes
tuple1
tuple2
tuple3
tuple4
tuple5
+
+
+
+
+
Tuples
28
Related Work
• Information Extraction: focus on quality of extracted relations
[MUC]; most relevant sub-task: text filtering
– Filters derived from extraction patterns, or consisting of words
(manually created or from supervised learning)
– Grishman et al.’s manual pattern-based filters for disease outbreaks
– Related to Manual and Patterns strategies in our experiments
– Focus not on querying using simple search interface
• Information Retrieval: focus on relevant documents for queries
– In our scenario, relevance determined by “extraction task” and associated
information extraction system
• Automatic Query Generation: several efforts for different
tasks:
– Minority language corpora construction [Ghani et al. 2001]
– Topic-specific document search (e.g., [Cohen & Singer 1996])
29
Contributions: An Unsupervised
Query-Based Technique for
Efficient Information Extraction
• Adapts to “arbitrary” underlying information
extraction system and document database
• Can work over non-crawlable “Hidden Web”
databases
• Minimal user input required
– Handful of example tuples
• Can trade off relation completeness and extraction
efficiency
• Particularly interesting in conjunction with
unsupervised/bootstrapping-based information
extraction systems (e.g., DIPRE, Snowball)
30
Questions?
Overflow Slides
Related Work (II)
• Focused Crawling (e.g., [Chakrabarti et al. 2002]): uses
link and page classification to crawl pages on a topic
• Hidden-Web Crawling [Raghavan & Garcia-Molina
2001]: retrieves pages from non-crawlable Hidden-Web
databases
– Need rich query interface, with distinguishable attributes
– Related to Tuples strategy, but “tuples” derived from pulldown menus, etc. from search interfaces as found
– Our goal: retrieve as few documents as possible from one
database to extract relation
• Question-Answering Systems
33
Related Work (III)
• [Mitchell, Riloff, et al. 1998] use “linguistic
phrases” derived from information extraction
patterns as features for text categorization
Related to Patterns strategy; requires document parsing,
so can’t directly generate simple queries
• [Gaizauskas & Robertson 1997] use 9 manually
generated keywords to search for documents
relevant to a MUC extraction task
34
Recall and Precision
DiseaseOutbreaks Relation; Proteus Extraction System
90
35
QXtract
80
30
Manual
60
Manual+
QXtract
50
40
Tuples
30
20
Baseline
10
0
5% 10% 15% 20% 25%
M axFractionRetrieved (% |Dall|)
Manual
precision (%)
70
recall (%)
QXtract
25
20
Manual+
QXtract
15
Tuples
10
Baseline
5
0
5%
10% 15% 20% 25%
M axFractionRetrieved (% |Dall|)
(a)
(b)
Recall
Precision
35
Running Times
140
120
100
80
60
40
20
0
5%
10%
100%
140
120
FullScan
QuickScan
QXtract
Extraction Training
running time (days)
160
FullScan
QuickScan
QXtract
Extraction Training
running time (minutes)
running time (minutes)
180
100
80
60
40
20
0
16
FullScan
14
QXtract
12
10
8
6
4
2
0
5%
10%
100%
MaxFractionRetrieved (% |Dall|)
MaxFractionRetrieved (% |Dall|)
Snowball
DIPRE
5%
10% 100%
MaxFractionRetrieved (% |Dall|)
Proteus
36
Extracting Relations from Text:
Snowball
•Exploit redundancy on web to
focus on “easy” instances
•Require only minimal training
(handful of seed tuples)
Initial Seed Tuples
LOCATION
REDMOND
ARMONK
SEATTLE
SANTA CLARA
ACM DL’00
Occurrences of Seed Tuples
Generate New Seed Tuples
Augment Table
ORGANIZATION
MICROSOFT
IBM
BOEING
INTEL
Tag Entities
Generate Extraction Patterns
37
Download