Graph-Based Methods for “Open Domain” Information Extraction William W. Cohen

advertisement
Graph-Based Methods for
“Open Domain”
Information Extraction
William W. Cohen
Machine Learning Dept. and Language Technologies Institute
School of Computer Science
Carnegie Mellon University
Traditional IE
vs
Open Domain IE
• Goal: recognize people,
places, companies, times,
dates, … in NL text.
• Supervised learning from
corpus completely annotated
with target entity class (e.g.
“people”)
• Goal: recognize arbitrary
entity sets in text
• Linear-chain CRFs
• Language- and genre-specific
extractors
– from very large corpora
(WWW)
– Minimal info about entity
class
– Example 1: “ICML, NIPS”
– Example 2: “Machine learning
conferences”
• Semi-supervised learning
• Graph-based learning
methods
• Techniques are largely
language-independent (!)
– Graph abstraction fits many
languages
Outline
• History
– Open-domain IE by pattern-matching
• The bootstrapping-with-noise problem
– Bootstrapping as a graph walk
• Open-domain IE as finding nodes “near”
seeds on a graph
– Approach 1: A “natural” graph derived from
a smaller corpus + learned similarity
– Approach 2: A carefully-engineered graph
derived from huge corpus
History: Open-domain IE by patternmatching (Hearst, 92)
•
•
Start with seeds: “NIPS”, “ICML”
Look thru a corpus for certain
patterns:
•
•
•
… “at NIPS, AISTATS, KDD and other
learning conferences…”
Expand from seeds to new instances
Repeat….until ___
–
“on PC of KDD, SIGIR, … and…”
Bootstrapping as graph proximity
NIPS
“…at NIPS, AISTATS, KDD and
other learning conferences…”
AISTATS
SNOWBIRD
“For skiiers, NIPS, SNOWBIRD,…
and…”
SIGIR
KDD
… “on PC of KDD, SIGIR, … and…”
“… AISTATS,KDD,…”
shorter paths ~ earlier iterations
many paths ~ additional evidence
Outline
• Open-domain IE as finding nodes “near”
seeds on a graph
– Approach 1: A “natural” graph derived from
a smaller corpus + learned similarity
“with” Einat Minkov (Nokia)
– Approach 2: A carefully-engineered graph
derived from huge corpus (e.g.’s above)
“with” Richard Wang (CMU  ?)
Learning Similarity Measures for Parsed Text
(Minkov & Cohen, EMNLP 2008)
nsubj
boys
partmod
like
prep.with
playing
all
kinds
det
NN
VB
VB
DT
cars
prep.of
NN
NN
Dependency parsed
sentence is a naturally
represented as a tree
Learning Similarity Measures for Parsed Text
(Minkov & Cohen, EMNLP 2008)
Dependency parsed
corpus is “naturally”
represented as a
graph
Learning Similarity Measures for Parsed Text
(Minkov & Cohen, EMNLP 2008)
Open IE Goal:
• Find “coordinate terms”
(eg, girl/boy, dolls/cars) in
the graph, or find
• Similarity measure S so
S(girl,boy) is high
• What about off-the-shelf
similarity measures:
• PPR/RWR
• Hitting time
• Commute time
•…?
Personalized PR/RWR
A query language:
Q: {
,
}
The graph
Nodes
Node type
Edge label
Edge weight
graph walk parameters:
edge weights Θ , walk length K
and reset probability γ.
M[x,y] = Prob. of reaching y
from x in one step:
the edge weight from x to y, over
the total outgoing weight from x.
`Personalized PageRank’:
reset probability biased towards
initial distribution.
Returns a list of nodes
(of type
) ranked by
the graph walk probs.
Approximate with power
iteration, cut off after fixed
number of iterations K.
mention
girls
nsubj
girls1
mention-1
like1
like
mention
nsubj-1
like2
mention
boys2
-1
boys
mention
girls
girls1
mention
girls
nsubj
nsubj
girls1
mention-1
like1
like
partmod
like1
mention
nsubj-1
like2
mention-1
playing1
mention
boys2
mention
playing
-1
boys
mention
…
-1
boys
mention
girls
nsubj
girls1
mention-1
like1
Prep.with
playing1
dolls1
mention-1
dolls
Useful but not our
goal here…
Learning a better similarity metric
Task T
(query class)
Query a
+ Rel. answers a
Query b
+ Rel. answers b
…
Query q
+ Rel. answers q
GRAPH WALK
 node rank 1
 node rank 1
 node rank 1
 node rank 2
 node rank 2
 node rank 2
 node rank 3
 node rank 3
 node rank 3
 node rank 4
 node rank 4
 node rank 4
…
…
…
 node rank 10
 node rank 10
 node rank 10
 node rank 11
 node rank 11
 node rank 11
 node rank 12
 node rank 12
 node rank 12
…
…
…
 node rank 50
 node rank 50
 node rank 50
Seed words
(“girl”,
“boy”,
…)
Potential new
instances of
the target
concept
(“doll”,
“child”,
“toddler”,
…)
Learning methods
Weight tuning – weights learned per edge type
[Diligenti et al, 2005]
Reranking – re-order the retrieved list using global features
of all paths from source to destination [Minkov et al, 2006]
FEATURES
boys
 Edge label sequences
nsubj.nsubj-inv
 Lexical unigrams
nsubj  partmod 
partmod-inv 
nsubj-inv
…
“like”, “playing”
dolls
nsubj  partmod 
prep.in
“like”, “playing”
Learning methods:
Path-Constrained Graph Walk
PCW (summary): for each node x, learn
 P(xz : relevant(z) | history(Vq,x) )
 History(Vq,x) = seq of edge labels leading from Vq to x,
with all histories stored in a tree
Vq
“girls”
nsubj
x1
nsubj-inv
boys
partmod
partmod-inv
x2
prep.in
x3
dolls
nsubj-inv
boys
boys
nsubj.nsubj-inv
nsubj  partmod 
partmod-inv 
nsubj-inv
dolls
nsubj  partmod 
prep.in
City and person name extraction
words
MUC
MUC+AP
City names:
Person names:
nodes
edges
140K
82K
244K
3K (true)
2,440K
1,030K
3,550K
36K (auto)
Vq = {sydney, stamford, greenville, los_angeles}
Vq = {carter, dave_kingman, pedro_ramos, florio}
– 10 (X4 seeds) queries for each task
• Train queries q1-q5 / test queries q6-q10
–
–
–
–
NEs
Extract nodes of type NE.
GW: 6 steps, uniform/learned weights
Reranking: top 200 nodes (using learned weights)
Path trees: 20 correct / 20 incorrect; threshold 0.5
MUC
precision
City names
Person names
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
11
21
31
41
51
61
71
81
91
Graph Walk
1
11
21
31
41
51
61
71
81
91
rank
MUC
precision
City names
Person names
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
11
21
31
41
51
61
71
81
91
conj-and, prep-in, nn, appos …
Graph Walk
Weight Tuning
1
11
21
31
41
51
61
71
subj, obj, poss, nn …
81
91
rank
MUC
precision
City names
Person names
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
11
21
31
41
51
61
71
81
91
Graph Walk
Weight Tuning
PCW
1
11
21
31
41
51
61
71
conj-and, prep-in, nn, appos …
subj, obj, poss, nn …
prep-in-inv  conj-and
nn-inv  nn
nsubj  nsubj-inv
appos  nn-inv
81
91
rank
MUC
precision
City names
Person names
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
11
21
31
41
51
61
71
81
91
Graph Walk
Weight Tuning
PCW
Reranking
1
11
21
31
41
51
61
71
81
conj-and, prep-in, nn, appos …
subj, obj, poss, nn …
Prep-in-inv  conj-and
nn-inv  nn
nsubj  nsubj-inv
appos  nn-inv
LEX.”based”, LEX.”downtown”
LEX.”mr”, LEX.”president”
91
rank
Vector-space models
• Co-occurrence vectors (counts; window: +/- 2)
• Dependency vectors
[Padó & Lapata, Comp Ling 07]
– A path value function:
• Length-based value:
• Relation based value:
1 / length(path)
subj-5, obj-4, obl-3, gen-2, else-1
– Context selection function:
• Minimal: verbal predicate-argument (length 1)
• Medium: coordination, genitive construction, noun compounds (<=3)
• Maximal: combinations of the above (<=4)
– Similarity function:
• Cosine
• Lin

Only score the top nodes retrieved with reranking (~1000 overall)
GWs – Vector models
MUC
City names
precision
1
Person names
1
0.9
PCW
0.9
0.8
Rerank
0.8
CO
0.7
0.7
DV
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
11
21
31
41
51
61
71
81
91
rank
1
11
21
31
41
51
61
 The graph-based methods are best (syntactic + learning)
71
81
91
GWs – Vector models
MUC + AP
City names
precision
1
Person names
1
0.9
Rerank
0.9
0.8
PCW
0.8
DV
0.7
0.7
CO
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
11
21
31
41
51
61
71
81
91
rank
1
11
21
31
41
51
61
71
 The advantage of the graph based models diminishes with
the amount of data.
 This is hard to evaluate at high ranks (manual labeling)
81
91
Outline
• Open-domain IE as finding nodes “near”
seeds on a graph
– Approach 1: A “natural” graph derived from
a smaller corpus + learned similarity
“with” Einat Minkov (CMU  Nokia)
– Approach 2: A carefully-engineered graph
derived from huge corpus
“with” Richard Wang (CMU  ?)
Set Expansion for Any Language (SEAL) –
(Wang & Cohen, ICDM 07)
• Basic ideas
– Dynamically build the graph using queries to
the web
– Constrain the graph to be as useful as
possible
• Be smart about queries
• Be smart about “patterns”: use clever methods
for finding meaningful structure on web pages
1.
2.
3.
Canon
Nikon
Olympus
System Architecture
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Pentax
Sony
Kodak
Minolta
Panasonic
Casio
Leica
Fuji
Samsung
…
• Fetcher: download web pages from the Web
that contain all the seeds
• Extractor: learn wrappers from web pages
• Ranker: rank entities extracted by wrappers
The Extractor
• Learn wrappers from web documents
and seeds on the fly
– Utilize semi-structured documents
– Wrappers defined at character level
• Very fast
• No tokenization required; thus language
independent
• Wrappers derived from doc d applied to d only
– See ICDM 2007 paper for details
.. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a
ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General
Motors</a> and <a href=“finance/bentley”>Bentley</a> ….
1. Find prefix of each seed and put in reverse order:
•
ford1: /ecnanif”=fer a> yllareneG …
•
Ford2: >”drof/ /ecnanif”=fer a> yllareneG …
•
honda1: /ecnanif”=fer a> ot derapmoc …
•
Honda2: >”adnoh/ /ecnanif”=fer a> ot …
2. Organize these into a trie, tagging each node with a set of seeds:
yllareneG …
{f1,h1}
/ecnanif”=fer a>
>”
{f1,f2,h1,h2}
{f2,h2}
ot derapmoc …
{f1}
{h1}
drof/ /ecnanif”=fer a> yllareneG..
adnoh/ /ecnanif”=fer a> ot ..
{h2}
{f2}
.. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a
ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General
Motors</a> and <a href=“finance/bentley”>Bentley</a> ….
1. Find prefix of each seed and put in reverse order:
2. Organize these into a trie, tagging each node with a set of seeds.
3. A left context for a valid wrapper is a node tagged with one instance of
each seed.
yllareneG …
{f1,h1}
/ecnanif”=fer a>
>”
{f1,f2,h1,h2}
{f2,h2}
ot derapmoc …
{f1}
{h1}
drof/ /ecnanif”=fer a> yllareneG..
adnoh/ /ecnanif”=fer a> ot ..
{h2}
{f2}
.. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a
ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General
Motors</a> and <a href=“finance/bentley”>Bentley</a> ….
1. Find prefix of each seed and put in reverse order:
2. Organize these into a trie, tagging each node with a set of seeds.
3. A left context for a valid wrapper is a node tagged with one instance of
each seed.
4. The corresponding right context is the longest common suffix of the
corresponding seed instances.
{f1,h1} “>
yllareneG …
/ecnanif”=fer a>
>”
{f1,f2,h1,h2}
ot derapmoc …
{f1}
”>Ford</a> sales …
{h1}
”>Honda</a> while …
drof/ /ecnanif”=fer a> yllareneG..
{f2,h2}
</a>
adnoh/ /ecnanif”=fer a> ot ..
{h2}
{f2}
<li class="ford"><a href="http://www.curryauto.com/">
<img src="/common/logos/ford/logo-horiz-rgb-lg-dkbg.gif" alt="3"></a>
<ul><li class="last"><a
href="http://www.curryauto.com/">
I am
noise
<span class="dName">Curry Ford</span>...</li></ul>
</li>
<li class="honda"><a href="http://www.curryauto.com/">
<img src="/common/logos/honda/logo-horiz-rgb-lg-dkbg.gif" alt="4"></a>
<ul><li><a href="http://www.curryhonda-ga.com/">
<span class="dName">Curry Honda Atlanta</span>...</li>
Me too!
<li><a href="http://www.curryhondamass.com/">
<span class="dName">Curry Honda</span>...</li>
<li class="last"><a href="http://www.curryhondany.com/">
<span class="dName">Curry Honda Yorktown</span>...</li></ul>
</li>
<li class="acura"><a href="http://www.curryauto.com/">
<img src="/curryautogroup/images/logo-horiz-rgb-lg-dkbg.gif" alt="5"></a>
<ul><li class="last"><a href="http://www.curryacura.com/">
<span class="dName">Curry Acura</span>...</li></ul>
</li>
<li class="nissan"><a href="http://www.curryauto.com/">
<img src="/common/logos/nissan/logo-horiz-rgb-lg-dkbg.gif" alt="6"></a>
<ul><li class="last"><a href="http://www.geisauto.com/">
<span class="dName">Curry Nissan</span>...</li></ul>
</li>
<li class="toyota"><a href="http://www.curryauto.com/">
<img src="/common/logos/toyota/logo-horiz-rgb-lg-dkbg.gif" alt="7"></a>
<ul><li class="last"><a href="http://www.geisauto.com/toyota/">
<span class="dName">Curry Toyota</span>...</li></ul>
</li>
The Ranker
•
Rank candidate entity mentions based
on “similarity” to seeds
–
•
Noisy mentions should be ranked lower
Random Walk with Restart (GW)
•
•
…as before…
What’s the graph?
Building a Graph
“ford”, “nissan”, “toyota”
Wrapper #2
find
northpointcars.com
extract
curryauto.com
“chevrolet”
22.5%
“honda”
26.1%
Wrapper #3
derive
Wrapper #1
“acura”
34.6%
“volvo chicago”
8.4%
Wrapper #4
“bmw pittsburgh”
8.4%
• A graph consists of a fixed set of…
– Node Types: {seeds, document, wrapper, mention}
– Labeled Directed Edges: {find, derive, extract}
• Each edge asserts that a binary relation r holds
• Each edge has an inverse relation r-1 (graph is cyclic)
– Intuition: good extractions are extracted by many good
wrappers, and good wrappers extract many good extractions,
Evaluation Datasets: closed sets
Evaluation Method
•
Mean Average Precision
–
–
–
–
Commonly used for evaluating ranked lists in IR
Contains recall and precision-oriented aspects
Sensitive to the entire ranking
Mean of average precisions for each ranked list
Prec(r) = precision at rank r
NewEntity (r ) 
1 if (a) and (b) are true

otherwise
0
(a) Extracted mention at r
matches any true mention
where L = ranked list of extracted mentions, r = rank
•
Evaluation Procedure
1.
(per dataset)
Randomly select three true entities and use their
first listed mentions as seeds
2. Expand the three seeds obtained from step 1
3. Repeat steps 1 and 2 five times
4. Compute MAP for the five ranked lists
(b) There exist no other
extracted mention at rank
less than r that is of the
same entity as the one at r
# True Entities = total number
of true entities in this dataset
Experimental Results: 3 seeds
Overall MAP vs. Various Methods
100%
MAP (%)
95%
80%
90%
60%
85%
40%
80%
20%
75%
93.13%
94.03%
82.39%
94.18%
93.13%
87.61%
82.39%
43.76%
14.59%
70%
0%
E1+EF+100
G.Sets
E2+GW+100
E2+EF+100
G.Sets
(Eng)
E2+GW+200
E2+GW+100
E1+EF+100
E2+GW+300
Methods
Vary: [Extractor] + [Ranker] + [Top N URLs]
Extractor:
• E1: Baseline Extractor (longest common context for all seed occurrences )
• E2: Smarter Extractor (longest common context for 1 occurrence of each seed)
Ranker: { EF: Baseline (Most Frequent), GW: Graph Walk }
N URLs: { 100, 200, 300 }
Side by side comparisons
Telukdar, Brants, Liberman, Pereira, CoNLL 06
Side by side comparisons
EachMovie vs WWW
Ghahramani & Heller, NIPS 2005
NIPS vs WWW
A limitation of the original SEAL
Preliminary Study on Seed Sizes
85%
Mean Average Precision
84%
83%
82%
81%
80%
79%
78%
RW
PR
BS
WL
77%
76%
75%
2
3
4
# Seeds (Seed Size)
5
6
Proposed Solution: Iterative SEAL (iSEAL)
(Wang & Cohen, ICDM 2008)
• Makes several calls to SEAL, each call…
– Expands a couple of seeds
– Aggregates statistics
• Evaluate iSEAL using…
– Two iterative processes
• Supervised vs. Unsupervised (Bootstrapping)
– Two seeding strategies
• Fixed Seed Size vs. Increasing Seed Size
– Five ranking methods
ISeal (Fixed Seed Size, Supervised)
Initial Seeds
• Finally rank nodes by proximity
to seeds in the full graph
• Refinement (ISS): Increase
size of seed set for each
expansion over time: 2,3,4,4,…
• Variant (Bootstrap): use highconfidence extractions when
seeds run out
Ranking Methods
Random Graph Walk with Restart
–
H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart
and its application. In ICDM, 2006.
PageRank
–
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation
ranking: Bringing order to the web. 1998.
Bayesian Sets (over flattened graph)
–
Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2005.
Wrapper Length
–
Weights each item based on the length of common contextual string
of that item and the seeds
Wrapper Frequency
–
Weights each item based on the number of wrappers that extract
the item
Fixed Seed Size (Supervised)
98%
Mean Average Precision
97%
96%
95%
94%
93%
92%
RW
91%
PR
BS
90%
WL
WF
89%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
Fixed Seed Size (Bootstrap)
92%
Mean Average Precision
91%
90%
89%
88%
RW
PR
BS
WL
87%
WF
86%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
Increasing Seed Size (Supervised)
97%
Mean Average Precision
96%
95%
94%
93%
RW
92%
PR
BS
91%
WL
WF
90%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
Increasing Seed Size (Bootstrapping)
Mean Average Precision
94%
93%
92%
91%
RW
PR
90%
BS
WL
WF
89%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
Fixed Seed Size (Supervised)
98%
Increasing Seed Size (Supervised)
97%
96%
96%
95%
94%
93%
92%
RW
PR
91%
90%
89%
1
2
95%
94%
93%
RW
92%
PR
Little differenceBSbetween
91% ranking methods
WL
for supervisedWFcase (all
seeds correct);
90%
2
3
4
5
6
7
8
3
4
5
6
7 differences
8
9
10
large
when1 bootstrapping
# Iterations (Cumulative Expansions)
# Iterations (Cumulative Expansions)
BS
WL
WF
9
10
Increasingmakes
Seed Size (Bootstrapping)
Increasing seed size {2,3,4,4,…}
all
94%
ranking methods improve steadily in
93% case
bootstrapping
Fixed Seed Size (Bootstrap)
92%
Mean Average Precision
91%
Mean Average Precision
Mean Average Precision
Mean Average Precision
97%
90%
89%
88%
RW
PR
BS
WL
87%
92%
91%
RW
PR
90%
BS
WL
WF
86%
WF
89%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
Fixed Seed Size (Supervised)
Increasing Seed Size (Supervised)
97%
97%
96%
96%
95%
94%
93%
92%
RW
91%
PR
Mean Average Precision
Mean Average Precision
98%
94%
93%
WL
PR
BS
WL
WF
90%
WF
89%
RW
92%
91%
BS
90%
95%
1
2
3
4
5
6
7
8
9
10
# Iterations (Cumulative Expansions)
1
2
3
4
5
6
7
8
9
10
# Iterations (Cumulative Expansions)
Increasing Seed Size (Bootstrapping)
Mean Average Precision
94%
Fixed Seed Size (Bootstrap)
92%
Mean Average Precision
91%
90%
93%
92%
91%
RW
PR
90%
BS
WL
89%
WF
89%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
88%
RW
PR
BS
WL
87%
WF
86%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
9
10
Current work
• Start with name of concept (e.g., “NFL
teams”)
• Look for (language-dependent) patterns:
– “… for successful NFL teams (e.g.,
Pittsburgh Steelers, New York Giants, …)”
• Take most frequent answers as seeds
• Run bootstrapping iSEAL with seed sizes
2,3,4,4….
Datasets with concept names
Experimental results
Direct use of text patterns
Comparison to Kozareva,
Riloff & Hovy (which uses
concept name plus a
single instance as
seed)…no seed used.
Comparison to
Pasca (using web
search queries,
CIKM 07)
Comparison to WordNet + Nk
• Snow et al: series of experiments learning
hyper/hyponyms
–
–
–
–
Bootstrap from Wordnet examples
Use dependency-parsed free text
E.g., added 30k new instances with fairly high precision
Many are concepts + named-entity instances:
• Experiments with ASIA on concepts from Wordnet
shows a fairly common problem:
– E.g., “movies” gives as “instances”: “comedy”,
“action/adventure”, “family”, “drama”, ….
– I.e., ASIA finds a lower level in a hierarchy, maybe not the
one you want
Comparison to WordNet + Nk
• Filter: a simulated sanity check:
– Consider only concepts expanded in Wordnet + 30k that seem
to have named-entities as instances and have at least
instances
– Run ASIA on each concept
– Discard result if less than 50% of the Wordnet instances are
in ASIA’s output
Two More Systems to Compare to
•
Van Durme & Pasca, 2008
– Requires an English part-of-speech tagger.
– Analyzed 100 million cached Web documents in English (for many classes).
•
Talukdar et al, 2008
– Requires 5 seed instances as input (for each class).
– Utilizes output from Van Durme’s system and 154 million tables from the
WebTables database (for many classes).
•
ASIA
–
–
–
–
–
Does not require any part-of-speech tagger (nearly language-independent).
Supports multiple languages such as English, Chinese, and Japanese.
Analyzes around 200~400 Web documents (for each class).
Requires only the class name as input.
Given a class name, extraction usually finishes within a minute (including network
latency of fetching web pages).
Comparisons of Instance Extraction Performance
100
90
Precision @ 100
80
70
60
50
Talukdar et al., 2008
Van Durme & Pasca, 2008
40
ASIA
30
Book Publishers
•
Federal
Agencies
NFL Players
Scientific
Journals
Mammals
Precisions of Talukdar and Van Durme’s systems were obtained
from Figure 2 in Talukdar et al, 2008.
Top 10 Instances from ASIA
Book Publishers
mcgraw hill
prentice hall
cambridge university press
random house
mit press
academic press
crc press
oxford university press
columbia university press
harvard university press
Federal
Agencies
usda
dot
hud
epa
nasa
ssa
nsf
dol
va
sec
NFL Players
Scientific Journals
Mammals
tom brady
ladainian tomlinson
peyton manning
brian westbrook
ben roethlisberger
donovan mcnabb
reggie bush
brett favre
terrell owens
anquan boldin
scientific american
science
nature
new scientist
cell
journal of biological chemistry
biophysical journal
nature medicine
evolution
new england journal of medicine
bats
dogs
rodents
cats
bears
rabbits
elephants
primates
whales
marsupials
Summary/Conclusions
• Open-domain IE as finding nodes “near” seeds
on a graph
NIPS
“…at NIPS, AISTATS, KDD and
other learning conferences…”
AISTATS
SNOWBIRD
For skiiers, NIPS, SNOWBIRD,…
and…”
SIGIR
KDD
… “on PC of KDD, SIGIR, … and…”
“… AISTATS,KDD,…”
shorter paths ~ earlier iterations
many paths ~ additional evidence
Summary/Conclusions
• Open-domain IE as finding nodes “near” seeds
on a graph, approach 1:
–
–
–
–
Minkov & Cohen, EMNLP 08:
Graph ~ dependency-parsed corpus
Off-the-shelf distance metrics not great
With a little bit of learning:
• Results significantly better than
state-of-the-art on small corpora
(e.g. a personal email corpus)
• Results competitive on 2M+ word
corpora
Summary/Conclusions
• Open-domain IE as finding nodes “near” seeds
on a graph, approach 2:
– Wang & Cohen, ICDM 07, 08:
– Graph built on-the-fly with web queries
• A good graph matters!
– Off-the-shelf distance metrics work
• Differences are minimal for clean seeds
• Modest improvements from learning w/ clean seeds
– E.g., reranking (not described here)
• Bigger differences in similarity measures
with noisy seeds
Thanks to
• DARPA PAL program
Sponsored links:
– Minkov, Cohen, Wang
http://boowa.com
(Richard’s demo)
• Yahoo! Research Labs
– Minkov
• Google Research Grant program
– Wang
Download