Graph-Based Methods for “Open Domain” Information Extraction William W. Cohen

advertisement
Graph-Based Methods for
“Open Domain”
Information Extraction
William W. Cohen
Machine Learning Dept. and Language Technologies Institute
School of Computer Science
Carnegie Mellon University
Joint work with Richard Wang
Traditional IE
vs
Open Domain IE
• Goal: recognize people,
places, companies, times,
dates, … in NL text.
• Supervised learning from
corpus completely annotated
with target entity class (e.g.
“people”)
• Goal: recognize arbitrary
entity sets in text
• Linear-chain CRFs
• Language- and genre-specific
extractors
– from very large corpora
(WWW)
– Minimal info about entity
class
– Example 1: “ICML, NIPS”
– Example 2: “Machine learning
conferences”
• Semi-supervised learning
• Graph-based learning
methods
• Techniques are largely
language-independent (!)
– Graph abstraction fits many
languages
Outline
• History
– Open-domain IE by pattern-matching
• The bootstrapping-with-noise problem
– Bootstrapping as a graph walk
• Open-domain IE as finding nodes “near” seeds
on a graph
–
–
–
–
–
Set expansion - from a few clean seeds
Iterative set expansion – from many noisy seeds
Relational set expansion
Multilingual set expansion
Iterative set expansion – from a concept name alone
History: Open-domain IE by patternmatching (Hearst, 92)
•
•
Start with seeds: “NIPS”, “ICML”
Look thru a corpus for certain
patterns:
•
•
•
… “at NIPS, AISTATS, KDD and other
learning conferences…”
Expand from seeds to new instances
Repeat….until ___
–
“on PC of KDD, SIGIR, … and…”
Bootstrapping as graph proximity
NIPS
“…at NIPS, AISTATS, KDD and
other learning conferences…”
AISTATS
SNOWBIRD
“For skiiers, NIPS, SNOWBIRD,…
and…”
SIGIR
KDD
… “on PC of KDD, SIGIR, … and…”
“… AISTATS,KDD,…”
shorter paths ~ earlier iterations
many paths ~ additional evidence
Set Expansion for Any Language (SEAL) –
(Wang & Cohen, ICDM 07)
• Basic ideas
– Dynamically build the graph using queries to
the web
– Constrain the graph to be as useful as
possible
• Be smart about queries
• Be smart about “patterns”: use clever methods
for finding meaningful structure on web pages
1.
2.
3.
Canon
Nikon
Olympus
System Architecture
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Pentax
Sony
Kodak
Minolta
Panasonic
Casio
Leica
Fuji
Samsung
…
• Fetcher: download web pages from the Web
that contain all the seeds
• Extractor: learn wrappers from web pages
• Ranker: rank entities extracted by wrappers
The Extractor
• Learn wrappers from web documents
and seeds on the fly
– Utilize semi-structured documents
– Wrappers defined at character level
• Very fast
• No tokenization required; thus language
independent
• Wrappers derived from doc d applied to d only
– See ICDM 2007 paper for details
.. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a
ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General
Motors</a> and <a href=“finance/bentley”>Bentley</a> ….
1. Find prefix of each seed and put in reverse order:
•
ford1: /ecnanif”=fer a> yllareneG …
•
Ford2: >”drof/ /ecnanif”=fer a> yllareneG …
•
honda1: /ecnanif”=fer a> ot derapmoc …
•
Honda2: >”adnoh/ /ecnanif”=fer a> ot …
2. Organize these into a trie, tagging each node with a set of seeds:
yllareneG …
{f1,h1}
/ecnanif”=fer a>
>”
{f1,f2,h1,h2}
{f2,h2}
ot derapmoc …
{f1}
{h1}
drof/ /ecnanif”=fer a> yllareneG..
adnoh/ /ecnanif”=fer a> ot ..
{h2}
{f2}
.. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a
ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General
Motors</a> and <a href=“finance/bentley”>Bentley</a> ….
1. Find prefix of each seed and put in reverse order:
2. Organize these into a trie, tagging each node with a set of seeds.
3. A left context for a valid wrapper is a node tagged with one instance of
each seed.
yllareneG …
{f1,h1}
/ecnanif”=fer a>
>”
{f1,f2,h1,h2}
{f2,h2}
ot derapmoc …
{f1}
{h1}
drof/ /ecnanif”=fer a> yllareneG..
adnoh/ /ecnanif”=fer a> ot ..
{h2}
{f2}
.. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a
ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General
Motors</a> and <a href=“finance/bentley”>Bentley</a> ….
1. Find prefix of each seed and put in reverse order:
2. Organize these into a trie, tagging each node with a set of seeds.
3. A left context for a valid wrapper is a node tagged with one instance of
each seed.
4. The corresponding right context is the longest common suffix of the
corresponding seed instances.
{f1,h1} “>
yllareneG …
/ecnanif”=fer a>
>”
{f1,f2,h1,h2}
ot derapmoc …
{f1}
”>Ford</a> sales …
{h1}
”>Honda</a> while …
drof/ /ecnanif”=fer a> yllareneG..
{f2,h2}
</a>
adnoh/ /ecnanif”=fer a> ot ..
{h2}
{f2}
Nice properties:
•
There are relatively few nodes in the trie:
•
O((#seeds)*(document length))
•
You can tag every node with the complete set of seeds that it covers
•
You can rank of filter nodes by any predicate over this set of seeds
you want: e.g.,
•
covers all seed instances that appear on the page?
•
covers at least one instance of each seed?
•
covers at least k instances, instances with weight > w, …
{f1,h1} “>
yllareneG …
/ecnanif”=fer a>
>”
{f1,f2,h1,h2}
ot derapmoc …
{f1}
”>Ford</a> sales …
{h1}
”>Honda</a> while …
drof/ /ecnanif”=fer a> yllareneG..
{f2,h2}
</a>
adnoh/ /ecnanif”=fer a> ot ..
{h2}
{f2}
<li class="ford"><a href="http://www.curryauto.com/">
<img src="/common/logos/ford/logo-horiz-rgb-lg-dkbg.gif" alt="3"></a>
<ul><li class="last"><a
href="http://www.curryauto.com/">
I am
noise
<span class="dName">Curry Ford</span>...</li></ul>
</li>
<li class="honda"><a href="http://www.curryauto.com/">
<img src="/common/logos/honda/logo-horiz-rgb-lg-dkbg.gif" alt="4"></a>
<ul><li><a href="http://www.curryhonda-ga.com/">
<span class="dName">Curry Honda Atlanta</span>...</li>
Me too!
<li><a href="http://www.curryhondamass.com/">
<span class="dName">Curry Honda</span>...</li>
<li class="last"><a href="http://www.curryhondany.com/">
<span class="dName">Curry Honda Yorktown</span>...</li></ul>
</li>
<li class="acura"><a href="http://www.curryauto.com/">
<img src="/curryautogroup/images/logo-horiz-rgb-lg-dkbg.gif" alt="5"></a>
<ul><li class="last"><a href="http://www.curryacura.com/">
<span class="dName">Curry Acura</span>...</li></ul>
</li>
<li class="nissan"><a href="http://www.curryauto.com/">
<img src="/common/logos/nissan/logo-horiz-rgb-lg-dkbg.gif" alt="6"></a>
<ul><li class="last"><a href="http://www.geisauto.com/">
<span class="dName">Curry Nissan</span>...</li></ul>
</li>
<li class="toyota"><a href="http://www.curryauto.com/">
<img src="/common/logos/toyota/logo-horiz-rgb-lg-dkbg.gif" alt="7"></a>
<ul><li class="last"><a href="http://www.geisauto.com/toyota/">
<span class="dName">Curry Toyota</span>...</li></ul>
</li>
Differences from prior work
• Fast character-level wrapper learning
– Language-independent
– Trie structure allows flexibility in goals
• Cover one copy of each seed, cover all instances
of seeds, …
– Works well for semi-structured pages
• Lists and tables, pull-down menus, javascript data
structures, word documents, …
• High-precision, low-recall data integration vs.
High-precision, low-recall information extraction
The Ranker
•
Rank candidate entity mentions based
on “similarity” to seeds
–
•
Noisy mentions should be ranked lower
Random Walk with Restart (GW)
•
…?
Google’s PageRank
web
site xxx
web
site xxx
web
site xxx
web site a b
cdefg
web
site
web site
yyyy
pdq pdq ..
web site a b
cdefg
web site
yyyy
Inlinks are “good”
(recommendations)
Inlinks from a “good” site
are better than inlinks
from a “bad” site
but inlinks from sites with
many outlinks are not as
“good”...
“Good” and “bad” are
relative.
Google’s PageRank
web
site xxx
web
site xxx
• follows a random link,
or
web site a b
cdefg
web
site
web site
yyyy
pdq pdq ..
web site a b
cdefg
web site
yyyy
Imagine a “pagehopper”
that always either
• jumps to random page
Google’s PageRank
(Brin & Page, http://www-db.stanford.edu/~backrub/google.html)
web
site xxx
web
site xxx
Imagine a “pagehopper”
that always either
• follows a random link,
or
web site a b
cdefg
web
• jumps to random page
site
web site
yyyy
pdq pdq ..
web site a b
cdefg
web site
yyyy
PageRank ranks pages
by the amount of time
the pagehopper spends
on a page:
• or, if there were many
pagehoppers,
PageRank is the
expected “crowd size”
Personalized PageRank
(aka Random Walk with Restart)
web
site xxx
web
site xxx
• follows a random link,
or
web site a b
cdefg
web
site
web site
yyyy
pdq pdq ..
web site a b
cdefg
web site
yyyy
Imagine a “pagehopper”
that always either
• jumps to particular
page
Personalize PageRank
Random Walk with Restart
web
site xxx
web
site xxx
• follows a random link,
or
web site a b
cdefg
web
site
web site
yyyy
Imagine a “pagehopper”
that always either
pdq pdq ..
• jumps to a particular
page P0
• this ranks pages by the
total number of paths
connecting them to P0
web site a b
cdefg
web site
yyyy
• … with each path
downweighted exponentially
with length
The Ranker
•
Rank candidate entity mentions based
on “similarity” to seeds
–
•
Noisy mentions should be ranked lower
Random Walk with Restart (GW)
•
On what graph?
Building a Graph
“ford”, “nissan”, “toyota”
Wrapper #2
find
northpointcars.com
extract
curryauto.com
“chevrolet”
22.5%
“honda”
26.1%
Wrapper #3
derive
Wrapper #1
“acura”
34.6%
“volvo chicago”
8.4%
Wrapper #4
“bmw pittsburgh”
8.4%
• A graph consists of a fixed set of…
– Node Types: {seeds, document, wrapper, mention}
– Labeled Directed Edges: {find, derive, extract}
• Each edge asserts that a binary relation r holds
• Each edge has an inverse relation r-1 (graph is cyclic)
– Intuition: good extractions are extracted by many good
wrappers, and good wrappers extract many good extractions,
Differences from prior work
• Graph-based distances vs. bootstrapping
– Graph constructed on-the-fly
• So it’s not different?
– But there is a clear principle about how to
combine results from earlier/later rounds of
bootstrapping
• i.e., graph proximity
• Fewer parameters to consider
• Robust to “bad wrappers”
Evaluation Datasets: closed sets
Evaluation Method
•
Mean Average Precision
–
–
–
–
Commonly used for evaluating ranked lists in IR
Contains recall and precision-oriented aspects
Sensitive to the entire ranking
Mean of average precisions for each ranked list
Prec(r) = precision at rank r
NewEntity (r ) 
1 if (a) and (b) are true

otherwise
0
(a) Extracted mention at r
matches any true mention
where L = ranked list of extracted mentions, r = rank
•
Evaluation Procedure
1.
(per dataset)
Randomly select three true entities and use their
first listed mentions as seeds
2. Expand the three seeds obtained from step 1
3. Repeat steps 1 and 2 five times
4. Compute MAP for the five ranked lists
(b) There exist no other
extracted mention at rank
less than r that is of the
same entity as the one at r
# True Entities = total number
of true entities in this dataset
Experimental Results: 3 seeds
Overall MAP vs. Various Methods
100%
MAP (%)
95%
80%
90%
60%
85%
40%
80%
20%
75%
93.13%
94.03%
82.39%
94.18%
93.13%
87.61%
82.39%
43.76%
14.59%
70%
0%
E1+EF+100
G.Sets
E2+GW+100
E2+EF+100
G.Sets
(Eng)
E2+GW+200
E2+GW+100
E1+EF+100
E2+GW+300
Methods
Vary: [Extractor] + [Ranker] + [Top N URLs]
Extractor:
• E1: Baseline Extractor (longest common context for all seed occurrences )
• E2: Smarter Extractor (longest common context for 1 occurrence of each seed)
Ranker: { EF: Baseline (Most Frequent), GW: Graph Walk }
N URLs: { 100, 200, 300 }
Side by side comparisons
Telukdar, Brants, Liberman, Pereira, CoNLL 06
Side by side comparisons
EachMovie vs WWW
Ghahramani & Heller, NIPS 2005
NIPS vs WWW
Why does SEAL do so well?
• Hypotheses:
– More information appears in
semi-structured documents than
in free text
– More semi-structured documents
can be (partially) understood with
character-level wrappers than
with HTML-level wrappers
Free-text wrappers are
only 10-15% of all
wrappers learned:
“Used [...] Van Pricing"
“Used [...] Engines"
“Bell Road [...] "
“Alaska [...] dealership"
“www.sunnyking[...].com""
“engine [...] used engines"
“accessories, [...] parts"
“is better [...] or"
Comparing character tries to HTMLbased structures
Outline
• History
– Open-domain IE by pattern-matching
• The bootstrapping-with-noise problem
– Bootstrapping as a graph walk
• Open-domain IE as finding nodes “near” seeds
on a graph
–
–
–
–
–
Set expansion - from a few clean seeds
Iterative set expansion – from many noisy seeds
Iterative set expansion – from a concept name alone
Multilingual set expansion
Relational set expansion
A limitation of the original SEAL
Preliminary Study on Seed Sizes
85%
Mean Average Precision
84%
83%
82%
81%
80%
79%
78%
RW
PR
BS
WL
77%
76%
75%
2
3
4
# Seeds (Seed Size)
5
6
Proposed Solution: Iterative SEAL (iSEAL)
(Wang & Cohen, ICDM 2008)
• Makes several calls to SEAL, each call…
– Expands a couple of seeds
– Aggregates statistics
• Evaluate iSEAL using…
– Two iterative processes
• Supervised vs. Unsupervised (Bootstrapping)
– Two seeding strategies
• Fixed Seed Size vs. Increasing Seed Size
– Five ranking methods
ISeal (Fixed Seed Size, Supervised)
Initial Seeds
• …Finally rank nodes by
proximity to seeds in the full
graph
• Refinement (ISS): Increase
size of seed set for each
expansion over time: 2,3,4,4,…
• Variant (Bootstrap): use highconfidence extractions when
seeds run out
Ranking Methods
Random Graph Walk with Restart
–
H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart
and its application. In ICDM, 2006.
PageRank
–
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation
ranking: Bringing order to the web. 1998.
Bayesian Sets (over flattened graph)
–
Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2005.
Wrapper Length
–
Weights each item based on the length of common contextual string
of that item and the seeds
Wrapper Frequency
–
Weights each item based on the number of wrappers that extract
the item
Fixed Seed Size (Supervised)
98%
Mean Average Precision
97%
96%
95%
94%
93%
92%
RW
91%
PR
BS
90%
WL
WF
89%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
Fixed Seed Size (Bootstrap)
92%
Mean Average Precision
91%
90%
89%
88%
RW
PR
BS
WL
87%
WF
86%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
Increasing Seed Size (Supervised)
97%
Mean Average Precision
96%
95%
94%
93%
RW
92%
PR
BS
91%
WL
WF
90%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
Increasing Seed Size (Bootstrapping)
Mean Average Precision
94%
93%
92%
91%
RW
PR
90%
BS
WL
WF
89%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
Fixed Seed Size (Supervised)
98%
Increasing Seed Size (Supervised)
97%
96%
96%
95%
94%
93%
92%
RW
PR
91%
90%
89%
1
2
95%
94%
93%
RW
92%
PR
Little differenceBSbetween
91% ranking methods
WL
for supervisedWFcase (all
seeds correct);
90%
2
3
4
5
6
7
8
3
4
5
6
7 differences
8
9
10
large
when1 bootstrapping
# Iterations (Cumulative Expansions)
# Iterations (Cumulative Expansions)
BS
WL
WF
9
10
Increasingmakes
Seed Size (Bootstrapping)
Increasing seed size {2,3,4,4,…}
all
94%
ranking methods improve steadily in
93% case
bootstrapping
Fixed Seed Size (Bootstrap)
92%
Mean Average Precision
91%
Mean Average Precision
Mean Average Precision
Mean Average Precision
97%
90%
89%
88%
RW
PR
BS
WL
87%
92%
91%
RW
PR
90%
BS
WL
WF
86%
WF
89%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
Fixed Seed Size (Supervised)
Increasing Seed Size (Supervised)
97%
97%
96%
96%
95%
94%
93%
92%
RW
91%
PR
Mean Average Precision
Mean Average Precision
98%
94%
93%
WL
PR
BS
WL
WF
90%
WF
89%
RW
92%
91%
BS
90%
95%
1
2
3
4
5
6
7
8
9
10
# Iterations (Cumulative Expansions)
1
2
3
4
5
6
7
8
9
10
# Iterations (Cumulative Expansions)
Increasing Seed Size (Bootstrapping)
Mean Average Precision
94%
Fixed Seed Size (Bootstrap)
92%
Mean Average Precision
91%
90%
93%
92%
91%
RW
PR
90%
BS
WL
89%
WF
89%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
88%
RW
PR
BS
WL
87%
WF
86%
1
2
3
4
5
6
7
8
# Iterations (Cumulative Expansions)
9
10
9
10
Outline
• History
– Open-domain IE by pattern-matching
• The bootstrapping-with-noise problem
– Bootstrapping as a graph walk
• Open-domain IE as finding nodes “near” seeds
on a graph
–
–
–
–
–
Set expansion - from a few clean seeds
Iterative set expansion – from many noisy seeds
Relational set expansion
Multilingual set expansion
Iterative set expansion – from a concept name alone
Relational Set Expansion
[Wang & Cohen, EMNLP 2009]
• Seed examples are pairs:
– E.g., audi::germany, acura::japan,
• Extension: find wrappers in which pairs
of seeds occur
– With specific left & right contexts
– In specific order (audi before germany, …)
– With specific string between them
• Variant of trie-based algorithm
Results
First iteration
Tenth iteration
Outline
• History
– Open-domain IE by pattern-matching
• The bootstrapping-with-noise problem
– Bootstrapping as a graph walk
• Open-domain IE as finding nodes “near” seeds
on a graph
–
–
–
–
–
Set expansion - from a few clean seeds
Iterative set expansion – from many noisy seeds
Relational set expansion
Multilingual set expansion
Iterative set expansion – from a concept name alone
Multilingual Set Expansion
Multilingual Set Expansion
• Basic idea:
–
–
–
–
–
–
–
Expand in language 1 (English) with seeds s1,s2 to S1
Expand in language 2 (Spanish) with seeds t1,t2 to T1.
Find first seed s3 in S1 that has a translation t3 in T1.
Expand in language 1 (English) with seeds s1,s2,s3 to S2
Find first seed t4 in T1 that has a translation s4 in S2.
Expand in language 2 (Sp.) with seeds t1,t2,t3 to T2.
Continue….
Multilingual Set Expansion
• What’s needed:
– Set expansion in two languages
– A way to decide if s is a translation of t
Multilingual Set Expansion
Submit s as a query and ask for results in language T.
Find chunks in language T in the snippets that frequently co-occur with s
• Bounded by change in character set (eg English to Chinese) or punctuation
Rank chunks by combination of proximity & frequency
Consider top 3 chunks t1, t2, t3 as likely translations of s.
Multilingual Set Expansion
Multilingual Set Expansion
Outline
• History
– Open-domain IE by pattern-matching
• The bootstrapping-with-noise problem
– Bootstrapping as a graph walk
• Open-domain IE as finding nodes “near” seeds
on a graph
–
–
–
–
–
Set expansion - from a few clean seeds
Iterative set expansion – from many noisy seeds
Relational set expansion
Multilingual set expansion
Iterative set expansion – from a concept name
alone
ASIA: Automatic set instance acquisition
[Wang & Cohen, ACL 2009]
• Start with name of concept (e.g., “NFL teams”)
• Look for instances using (language-dependent)
patterns:
– “… for successful NFL teams (e.g., Pittsburgh Steelers, New
York Giants, …)”
• Take most frequent answers as seeds
• Run bootstrapping iSEAL
– with seed sizes 2,3,4,4….
– and extended for noise-resistance:
• wrappers should cover as many distinct seeds as possible
(not all seeds) …
• … subject to a limit on size
• Modified trie method
Datasets with concept names
Experimental results
Direct use of text patterns
Comparison to Kozareva,
Riloff & Hovy (which uses
concept name plus a
single instance as
seed)…no seed used.
Comparison to
Pasca (using web
search queries,
CIKM 07)
Comparison to WordNet + Nk
• Snow et al, ACL 2005: series of experiments learning
hyper/hyponyms
–
–
–
–
Bootstrap from Wordnet examples
Use dependency-parsed free text
E.g., added 30k new instances with fairly high precision
Many are concepts + named-entity instances:
• Experiments with ASIA on concepts from Wordnet
shows a fairly common problem:
– E.g., “movies” gives as “instances”: “comedy”,
“action/adventure”, “family”, “drama”, ….
– I.e., ASIA finds a lower level in a hierarchy, maybe not the
one you want
Comparison to WordNet + Nk
• Filter: a simulated sanity check:
– Consider only concepts expanded in Wordnet + 30k that seem
to have named-entities as instances and have at least
instances
– Run ASIA on each concept
– Discard result if less than 50% of the Wordnet instances are
in ASIA’s output
Summary:
• Some are good
• Some of Snow’s concepts are low-precision relative to ASIA (4.7% 
100%)
• For the rest ASIA has 2x  100x the coverage (in number of instances)
Two More Systems to Compare to
•
Van Durme & Pasca, 2008
– Requires an English part-of-speech tagger.
– Analyzed 100 million cached Web documents in English (for many classes).
•
Talukdar et al, 2008
– Requires 5 seed instances as input (for each class).
– Utilizes output from Van Durme’s system and 154 million tables from the
WebTables database (for many classes).
•
ASIA
–
–
–
–
–
Does not require any part-of-speech tagger (nearly language-independent).
Supports multiple languages such as English, Chinese, and Japanese.
Analyzes around 200~400 Web documents (for each class).
Requires only the class name as input.
Given a class name, extraction usually finishes within a minute (including network
latency of fetching web pages).
Comparisons of Instance Extraction Performance
100
90
Precision @ 100
80
70
60
50
Talukdar et al., 2008
Van Durme & Pasca, 2008
40
ASIA
30
Book Publishers
•
Federal
Agencies
NFL Players
Scientific
Journals
Mammals
Precisions of Talukdar and Van Durme’s systems were obtained
from Figure 2 in Talukdar et al, 2008.
Top 10 Instances from ASIA
Book Publishers
mcgraw hill
prentice hall
cambridge university press
random house
mit press
academic press
crc press
oxford university press
columbia university press
harvard university press
Federal
Agencies
usda
dot
hud
epa
nasa
ssa
nsf
dol
va
sec
NFL Players
Scientific Journals
Mammals
tom brady
ladainian tomlinson
peyton manning
brian westbrook
ben roethlisberger
donovan mcnabb
reggie bush
brett favre
terrell owens
anquan boldin
scientific american
science
nature
new scientist
cell
journal of biological chemistry
biophysical journal
nature medicine
evolution
new england journal of medicine
bats
dogs
rodents
cats
bears
rabbits
elephants
primates
whales
marsupials
Joint work with Tom Mitchell, Weam AbuZaki, Justin
Betteridge, Andrew Carlson, Estevam R. Hruschka
Jr., Bryan Kisiel, Burr Settles
Learn a large number of concepts at once
person
teamPlaysSport(t,
s)
playsForTeam(a,t)
sport
athlete
coach
NP1
team
playsSport(a,s)
coachesTeam(c,t)
NP2
Krzyzewski coaches the Blue Devils.
Coupled learning of text and HTML patterns
evidence integration
CBL
SEAL
Free-text
extraction
patterns
HTML
extraction
patterns
Ontology
and
populated
KB
the Web
Summary/Conclusions
• Open-domain IE as finding nodes “near” seeds
on a graph
NIPS
“…at NIPS, AISTATS, KDD
and other learning
conferences…”
AISTATS
SNOWBIRD
For skiiers, NIPS,
SNOWBIRD,… and…”
SIGIR
KDD
… “on PC of KDD, SIGIR, … and…”
“… AISTATS,KDD,…”
• RWR as robust proximity measure
• Character tries as flexible pattern language
• high-coverage
• modifiable to handle expectations of noise
Summary/Conclusions
• Open-domain IE as finding nodes “near” seeds
on a graph:
– Graph built on-the-fly with web queries
• A good graph matters!
• A big graph matters!
– character-level tries >> HTML heuristics
– Rank the whole graph
• Don’t confuse iteratively building the graph with ranking!
– Off-the-shelf distance metrics work
• Differences are minimal for clean seeds
• Much bigger differences with noisy seeds
• Bootstrapping (especially from free-text patterns) is noisy
Thanks to
• DARPA PAL program
– Cohen, Wang
• Google Research Grant program
– Wang
Sponsored links:
http://boowa.com
(Richard’s demo)
Download