Language-Independent Class Instance Extraction Using the Web Richard C. Wang

advertisement
Language-Independent
Class Instance Extraction
Using the Web
Richard C. Wang
Thesis Committee:




William W. Cohen (Chair)
Robert E. Frederking
Tom M. Mitchell
Fernando Pereira (Google Research)
Language Technologies Institute, Carnegie Mellon University
Introduction to Class Instance Extraction
Richard C. Wang
Challenge
Discover class instances of any semantic class
with minimum input from users
 x is an instance of class y if x is a (kind of) y
“Bags”
“Hair Styles”
“Failed Banks”
Instances
Class

These are real inputs and outputs from a system called ASIA described in this thesis
Language Technologies Institute, Carnegie Mellon University
2 / 69
Introduction to Class Instance Extraction
Richard C. Wang
Applications

Concept and relation learning


Co-reference resolution


(Pasca, 2004)
Improvements for Question Answering


(Nadeau et al., 2006)(Talukdar et al., 2008)
Query refinement in Web search


(Mccarthy & Lehnert, 1995)
Weakly-supervised learning for NER


(Cohen, 2000)(Etzioni et al., 2005)(Cafarella et al., 2005)
(Pantel & Ravichandran, 2004)(Wang et al., 2008)
Extensions to WordNet

(Snow et al., 2006)(Wang & Cohen, 2009)
Language Technologies Institute, Carnegie Mellon University
3 / 69
Introduction to Class Instance Extraction
Richard C. Wang
Thesis Statement
The World Wide Web is a vast and readily-available
repository of factual information; such as semantic classes
(e.g., fruits), their instances (e.g., orange, banana), and relations
between them. There are many semi-structured
documents on the Web that provide evidence about these
facts. The thesis of this work is that many of these facts can
be revealed using tools built on set expansion. More
generally, we believe that statistics, aggregation, and
simple analysis of the documents are enough to discover
frequent common classes in not only English, but other
languages as well.
Language Technologies Institute, Carnegie Mellon University
4 / 69
Introduction to Class Instance Extraction
Richard C. Wang
What is Set Expansion?
seeds


For example,

Given a query: { survivor, amazing race }

Answer is: { american idol, big brother, ... }
More formally,

Given a small number of seed instances:


Answer is a listing of other probable instances:


x1, x2, …, xk where each xi S
e1, e2, …, en where each ei S
A well-known system is Google Sets™

http://labs.google.com/sets
Language Technologies Institute, Carnegie Mellon University
5 / 69
Richard C. Wang
Outline
How to…
1.
2.
3.
4.
5.
6.
expand a set of instances?
expand noisy instances from QA systems?
bootstrap set expansion?
extract instances given only the class name?
improve accuracy by using two languages?
expand class-instance relations in pairs?
Language Technologies Institute, Carnegie Mellon University
6 / 69
How to expand a set of instances?
Richard C. Wang
Our Set Expander – SEAL

Set Expander for Any Language


(Wang & Cohen, ICDM 2007)
Features


Independent of human & markup language

Support seeds in English, Chinese, Japanese, Korean, ...

Accept documents in HTML, XML, SGML, TeX, WikiML, …
Does not require pre-annotated training data


Utilize readily-available corpus: World Wide Web
Research contributions

Auto-construct wrappers for extracting candidate instances

Rank candidates using random walk
Language Technologies Institute, Carnegie Mellon University
7 / 69
How to expand a set of instances?
Canon
Nikon
Olympus



SEAL’s Pipeline
Richard C. Wang
Pentax
Sony
Kodak
Minolta
Panasonic
Casio
Leica
Fuji
Samsung
…
Fetcher: Download web pages containing all seeds
Extractor: Construct wrappers for extracting candidate items
Ranker: Rank candidate items using random walk
Language Technologies Institute, Carnegie Mellon University
8 / 69
How to expand a set of instances?
Richard C. Wang
The Fetcher
Procedure:
Compose a search query by concatenating all seeds
2. Query Google to request top 100 URLs
3. Fetch web pages and send to the Extractor
1.
Seeds
Boston
Seattle
Carnegie-Mellon
Query
“Boston Seattle Carnegie-Mellon”
Language Technologies Institute, Carnegie Mellon University
9 / 69
How to expand a set of instances?
Richard C. Wang
The Extractor


A wrapper is a pair of L and R context string

Maximally-long contextual strings that bracket
at least one instance of every seed

Extracts all strings between L and R

A wrapper derived from page p is only applied to p
Learns character-level wrappers from
semi-structured documents

No tokenization required (language-independent)
Language Technologies Institute, Carnegie Mellon University
10 / 69
How to expand a set of instances?
Richard C. Wang
<li class="ford"><a href="http://www.curryauto.com/">
…
<liclass="honda"><a
class="ford"><a href="http://www.curryauto.com/">
href="http://www.curryauto.com/">
<li
…
<li class="toyota"><a href="http://www.geisauto.com/">
…
<li
<liclass="acura"><a
class="ford"><a href="http://www.curryauto.com/">
href="http://www.curryauto.com/">
…
It
seems
to be
be
Simple
Extractor
It seems
to
working…
but
finds
maximally-long
working
too…
contexts
bracket
what if that
I add
one
but
how
about
a
all instances
more
instanceofof
real
example?
every
seed
“toyota”?
class="ford"><a href="http://www.curryauto.com/">
href="http://www.curryauto.com/">
<li<li
class="nissan"><a
…
class="ford"><a href="http://www.curryauto.com/">
href="http://www.curryauto.com/">
<li<li
class="toyota"><a
…
Language Technologies Institute, Carnegie Mellon University
11 / 69
How to expand a set of instances?
Richard C. Wang
<li class="ford"><a href="http://www.curryauto.com/">
<img src="/common/logos/ford/logo-horiz-rgb-lg-dkbg.gif"
alt="3"></a>
I am a noisy
<ul><li class="last"><a href="http://www.curryauto.com/">
instanceFord</span>...</li></ul>
<span class="dName">Curry
</li>
<li class="honda"><a href="http://www.curryauto.com/">
<img src="/common/logos/honda/logo-horiz-rgb-lg-dkbg.gif" alt="4"></a>
<ul><li><a href="http://www.curryhonda-ga.com/">
<span class="dName">Curry Honda Atlanta</span>...</li>
<li><a href="http://www.curryhondamass.com/">
<span class="dName">Curry Honda</span>...</li>
you
find
PECan
finds
maximallyI guess
not! like
Horray!
It
seems
<li class="last"><a href="http://www.curryhondany.com/">
common
contexts
long
contexts
that
<span class="dName">Curry Honda Yorktown</span>...</li></ul>
Let’s
try
PE works!
Butour
how do
that rid
bracket
bracket
least
one
</li>
we get at
of those
Proposed
Extractor
<li class="acura"><a href="http://www.curryauto.com/">instance
all instances
of
of every seed
noisy
instances?
and
see
if
it
works…
<img src="/curryautogroup/images/logo-horiz-rgb-lg-dkbg.gif"
alt="5"></a>
every
seed?
Me
too!
<ul><li class="last"><a href="http://www.curryacura.com/">
<span class="dName">Curry Acura</span>...</li></ul>
</li>
<li class="nissan"><a href="http://www.curryauto.com/">
<img src="/common/logos/nissan/logo-horiz-rgb-lg-dkbg.gif" alt="6"></a>
<ul><li class="last"><a href="http://www.geisauto.com/">
<span class="dName">Curry Nissan</span>...</li></ul>
</li>
<li class="toyota"><a href="http://www.curryauto.com/">
<img src="/common/logos/toyota/logo-horiz-rgb-lg-dkbg.gif" alt="7"></a>
<ul><li class="last"><a href="http://www.geisauto.com/toyota/">
<span class="dName">Curry Toyota</span>...</li></ul>
</li>
Language Technologies Institute, Carnegie Mellon University
12 / 69
How to expand a set of instances?
Richard C. Wang
The Ranker
Wrapper #2
“chevrolet”
22.5%
contain
Wrapper #3
extract
curryauto.com
extract
“honda”
26.1%
contain
Wrapper #1
northpointcars.com
Wrapper #4
“acura”
34.6%
“volvo”
8.4%

Build a graph that consists of a fixed set of…


Node Types: { document, wrapper, instance }
Labeled Directed Edges: { contain, extract }



“bmw”
8.4%
Each edge asserts that a binary relation r holds
Each edge has an inverse relation r-1 (so graph is cyclic)
Perform Random Walk (RW) with restart (Tong et al., 2006)
Language Technologies Institute, Carnegie Mellon University
13 / 69
How to expand a set of instances?
Richard C. Wang
36 Evaluation Datasets
Language Technologies Institute, Carnegie Mellon University
14 / 69
How to expand a set of instances?
Richard C. Wang
Initial Experiments

Compare our proposed extractor (PE) to a
simple extractor (SE)
 SE
finds maximally-long contextual strings that
bracket all seed occurrences

Compare random walk (RW) to a simple
ranker based on wrapper frequency (WF)
 WF
scores instance i by the number of
wrappers that extract i
Language Technologies Institute, Carnegie Mellon University
15 / 69
How to expand a set of instances?
Richard C. Wang
Initial Experiments (Wang & Cohen, ICDM 2007)
Mean Average Precision (%)
MAP vs. Various Configurations (with 3 seeds)
100
93.1
87.6
90
82.4
80
70
60
50
40
35.9
30
Google Sets
SE+WF
PE+WF
PE+RW
Various Configurations
Language Technologies Institute, Carnegie Mellon University
16 / 69
How to expand a set of instances?
Richard C. Wang
Alternative Rankers
Compare RW to the following four rankers:
PR – PageRank (Page et al., 1998)
1.

Graph-based approach designed to rank web pages
BS – Bayesian Sets (Ghahramani and Heller, 2005)
2.

Formulates set expansion as a Bayesian inference problem
WL – Wrapper Length
3.

Scores instance i by the length of wrappers that extract i
WF – Wrapper Frequency
4.

Scores instance i by the number of wrappers that extract i
Language Technologies Institute, Carnegie Mellon University
17 / 69
How to expand a set of instances?
Richard C. Wang
Alternative Rankers
Language Technologies Institute, Carnegie Mellon University
18 / 69
How to expand a set of instances?
Richard C. Wang
HTML Wrappers


PE is the character-level wrapper for SEAL
Compare PE to 4 types of HTML wrappers


H1 is least strict, but more strict than PE
H4 is most strict, but less strict than any HTML wrapper
Language Technologies Institute, Carnegie Mellon University
19 / 69
How to expand a set of instances?
Richard C. Wang
HTML Wrappers (Wang & Cohen, EMNLP 2009)
Mean A verag e P rec is ion (% )
95
P erformanc e v s . Wrapper T ypes (with 2 s eeds )
90
85
80
75
70
P E +RW
H1+ R W
H2+ R W
H3+ R W
H4+ R W
Wrapper T ypes
Language Technologies Institute, Carnegie Mellon University
20 / 69
Richard C. Wang
Outline
How to…
1.
2.
3.
4.
5.
6.
expand a set of instances?
expand noisy instances from QA systems?
bootstrap set expansion?
extract instances given only the class name?
improve accuracy by using two languages?
expand class-instance relations in pairs?
Language Technologies Institute, Carnegie Mellon University
21 / 69
How to expand noisy instances from QA systems?
Richard C. Wang
Task


Automatically expand (and improve) answers
generated by Question Answering systems
for list questions
An example of a list question:
 Name
cities that have Starbucks
QA Answers
Boston
Seattle
Carnegie-Mellon
Aquafina
Google
Logitech
Expanded Answers
Seattle
Boston
Chicago
Pittsburgh
Carnegie-Mellon
Google
Language Technologies Institute, Carnegie Mellon University
22 / 69
How to expand noisy instances from QA systems?
Richard C. Wang
Challenge

SEAL requires correct seeds, but answers
produced by QA systems are often noisy

To integrate them together, we propose
Noise-Resistant SEAL (Wang et al., EMNLP 2008)

Three extensions to SEAL
1.
Aggressive Fetcher (AF)
2.
Lenient Extractor (LE)
3.
Hinted Expander (HE)
Language Technologies Institute, Carnegie Mellon University
23 / 69
How to expand noisy instances from QA systems?
Richard C. Wang
Aggressive Fetcher


Sends a two-seed query for every possible
pair of seeds to the search engines
More likely to compose queries containing
only relevant seeds
Seeds
Boston
Seattle
Carnegie-Mellon
Queries
“Boston Seattle”
“Boston Carnegie-Mellon”
“Seattle Carnegie-Mellon”
Language Technologies Institute, Carnegie Mellon University
24 / 69
How to expand noisy instances from QA systems?
Richard C. Wang
Lenient Extractor


Maximally-long contextual strings that bracket at
least one instance of a minimum of two seeds
More likely to find useful contexts that bracket
only relevant seeds
Text
Learned Wrapper (w/o LE)
... in Boston City Hall ...
... in Seattle City Hall ...
... at Boston University ...
... at Seattle University ...
... at Carnegie-Mellon University ...
at <blah> University
Learned Wrappers (w/ LE)
Language Technologies Institute, Carnegie Mellon University
at <blah> University
in <blah> City Hall
25 / 69
How to expand noisy instances from QA systems?
Richard C. Wang
Hinted Expander


Utilizes contexts in the question to constrain
the search space of SEAL on the Web
 Extracts
up to three keywords from the question
 Append
the keywords to the search queries
For example,
 Question:
 Query:

Name cities that have Starbucks
“Boston Seattle cities Starbucks”
More likely to find documents containing
desired set of answers
Language Technologies Institute, Carnegie Mellon University
26 / 69
How to expand noisy instances from QA systems?
Richard C. Wang
Experiment #1: Ephyra

QA System: Ephyra

Evaluate on TREC 13, 14, and 15 datasets
 55,

93, and 89 list questions respectively
Use SEAL to expand top four answers from Ephyra
 Outputs

(Schlaefer et al., TREC 2007)
a list of answers ranked by confidence scores
For each dataset, we report:
 Mean
Average Precision (MAP)
 Average

F1 with Optimal Per-Question Threshold
For each question, cut off the list at a threshold which maximizes
the F1 score for that particular question
Language Technologies Institute, Carnegie Mellon University
27 / 69
How to expand noisy instances from QA systems?
Richard C. Wang
Experiment #1: Ephyra
(Wang et al., EMNLP 2008)
Mean
Avg.
Precision
Avg.
Optimal
F1 (%)(%)
F1 with Optimal
Per-Question
Threshold
Mean Average
Precision
Ephyra
40%
34%
Ephyra's Top
Top 44
Ephyra's
36%
30%
SEAL
SEAL
SEAL+LE
SEAL+LE
26%
32%
SEAL+LE+AF
SEAL+LE+AF
SEAL+LE+AF+HE
22%
28%
SEAL+LE+AF+HE
18%
24%
14%
20%
10%
16%
6%
12%
Trec
Trec 13
13
Trec
Trec 14
14
Trec 15
15
Trec
TREC
TREC Dataset
Dataset
Language Technologies Institute, Carnegie Mellon University
28 / 69
How to expand noisy instances from QA systems?
Richard C. Wang
Experiment #2: Ephyra


In practice, thresholds are unknown
For each dataset, do 5-fold cross validation:
 Train:
Find one optimal threshold for all four folds
 Test: Use the threshold to evaluate the fifth fold

Introduce a fourth dataset: All
 Union

of TREC 13, 14, and 15
Introduce another system: Hybrid
 Intersection
of original answers from Ephyra and
expanded answers from SEAL
Language Technologies Institute, Carnegie Mellon University
29 / 69
How to expand noisy instances from QA systems?
Richard C. Wang
Experiment #2: Ephyra
(Wang et al., EMNLP 2008)
F1 with Trained Threshold
Avg. F1 (%)
32%
Ephyra
30%
SEAL+LE+AF+HE
28%
Hybrid
26%
24%
22%
20%
18%
16%
14%
12%
Trec 13
Trec 14
Trec 15
All
TREC Dataset
Language Technologies Institute, Carnegie Mellon University
30 / 69
How to expand noisy instances from QA systems?
Richard C. Wang
Experiment: Top QA Systems

Top five QA systems that perform the best on
list questions in TREC 15 evaluation
1.
2.
3.
4.
5.

Language Computer Corporation (lccPA06)
The Chinese University of Hong Kong (cuhkqaepisto)
National University of Singapore (NUSCHUAQA1)
Fudan University (FDUQAT15A)
National Security Agency (QACTIS06C)
For each QA system, train thresholds for SEAL
and Hybrid on the union of TREC 13 and 14

Expand top four answers from the QA systems on
TREC 15, and apply the trained threshold
Language Technologies Institute, Carnegie Mellon University
31 / 69
How to expand noisy instances from QA systems?
Richard C. Wang
Experiment: Top QA Systems
(Wang et al., EMNLP 2008)
F1 with Trained Threshold
22%
46%
Average F1 (%)
44%
42%
Baseline
Top 4 Ans.
Google Sets
21%
SEAL+LE+AF+HE
20%
Hybrid
19%
40%
18%
38%
17%
16%
36%
15%
34%
14%
32%
13%
30%
12%
lccPA06
cuhkqaepisto NUSCHUAQA1 FDUQAT15A
QACTIS06C
TREC Dataset
Language Technologies Institute, Carnegie Mellon University
32 / 69
Richard C. Wang
Outline
How to…
1.
2.
3.
4.
5.
6.
expand a set of instances?
expand noisy instances from QA systems?
bootstrap set expansion?
extract instances given only the class name?
improve accuracy by using two languages?
expand class-instance relations in pairs?
Language Technologies Institute, Carnegie Mellon University
33 / 69
How to bootstrap set expansion?
Richard C. Wang
Limitation of SEAL
Preliminary Study on Seed Sizes
85%
Mean Average Precision
84%
Evaluated using Mean
Average Precision on
36 datasets
83%
82%
81%
80%
79%
For each dataset, we
randomly pick n seeds
(and repeat 3 times)
78%
RW
PR
RW
BS
WL
77%
76%
75%
2
3
4
5
6
# Seeds (Seed Size)

Performance drops significantly when given more than 5 seeds
 The Fetcher downloads web pages that contain all seeds
 However, not many pages have more than 5 seeds
Language Technologies Institute, Carnegie Mellon University
34 / 69
How to bootstrap set expansion?
Richard C. Wang
Proposed Solution – iSEAL

iterative SEAL (Wang & Cohen, ICDM 2008)
 makes
several calls to SEAL
 in each call (or iteration):
Expands a few seeds
 Aggregates statistics


We evaluated iSEAL using…
 Two
iterative processes
 Two seeding strategies
 Five ranking methods
Language Technologies Institute, Carnegie Mellon University
35 / 69
How to bootstrap set expansion?
Richard C. Wang
Iterative Process & Seeding Strategy
Iterative Processes
Preliminary Study on Seed Sizes
1.
Supervised
85%

2.
82%
Bootstrapping


At every84%iteration, seeds are obtained from a reliable source
83%
(e.g. human)
Mean Average Precision

81%
At every80%iteration, seeds are selected from candidate items
(except 79%
the 1st iteration)
78%
RW
PR
BS
WL
Seeding Strategies
77%
76%
1.
Fixed Seed Size
75%

2.
2
3
4
Uses 2 seeds
at every
iteration
# Seeds
(Seed Size)
5
6
Increasing Seed Size

Starts with 2 seeds, then 3 seeds for next iteration, and
fixed at 4 seeds afterwards
Language Technologies Institute, Carnegie Mellon University
36 / 69
How to bootstrap set expansion?
Richard C. Wang
Fixed Seed Size (Supervised)
Initial Seeds
Language Technologies Institute, Carnegie Mellon University
37 / 69
How to bootstrap set expansion?
Richard C. Wang
(Wang & Cohen, ICDM 2008)
Fixed Seed Size (Supervised)
98%
Mean Average Precision
97%
96%
95%
94%
93%
92%
RW
91%
PR
BS
90%
WL
WF
89%
1
2
3
4
5
6
7
8
9
10
# Iterations (Cumulative Expansions)
Language Technologies Institute, Carnegie Mellon University
38 / 69
How to bootstrap set expansion?
Richard C. Wang
Fixed Seed Size (Bootstrap)
Initial Seeds
Language Technologies Institute, Carnegie Mellon University
39 / 69
How to bootstrap set expansion?
Richard C. Wang
(Wang & Cohen, ICDM 2008)
Fixed Seed Size (Bootstrap)
92%
Mean Average Precision
91%
90%
89%
88%
RW
PR
BS
WL
87%
WF
86%
1
2
3
4
5
6
7
8
9
10
# Iterations (Cumulative Expansions)
Language Technologies Institute, Carnegie Mellon University
40 / 69
How to bootstrap set expansion?
Richard C. Wang
Increasing Seed Size (Bootstrap)
Initial Seeds
Used Seeds
Language Technologies Institute, Carnegie Mellon University
41 / 69
How to bootstrap set expansion?
Richard C. Wang
(Wang & Cohen, ICDM 2008)
Increasing Seed Size (Bootstrapping)
Mean Average Precision
94%
93%
92%
91%
RW
PR
90%
BS
WL
WF
89%
1
2
3
4
5
6
7
8
9
10
# Iterations (Cumulative Expansions)
Language Technologies Institute, Carnegie Mellon University
42 / 69
Richard C. Wang
Outline
How to…
1.
2.
3.
4.
5.
6.
expand a set of instances?
expand noisy instances from QA systems?
bootstrap set expansion?
extract instances given only the class name?
improve accuracy by using two languages?
expand class-instance relations in pairs?
Language Technologies Institute, Carnegie Mellon University
43 / 69
How to extract instances given only the class name?
Richard C. Wang
Proposed Approach – ASIA
(Wang & Cohen, ACL 2009)
Automatic Set Instance Acquirer (ASIA)
Some
Instances
Semantic
Class Name
Noisy
Instance
Provider
Noisy
Instance
Expander
Bootstrapper
More
Instances
Noisy
Instances
Language Technologies Institute, Carnegie Mellon University
44 / 69
How to extract instances given only the class name?
Richard C. Wang
Noisy Instance Provider (NP)

Manually constructed hyponym patterns


based on Marti Hearst’s work in 1992
Query search engines for each
hyponym pattern + a class name

e.g. “car makers such as”

Extract all candidates I from
returned web snippets
 A snippet often contains multiple
excerpts

Rank each candidate i in I based on

# of patterns, snippets, and excerpts containing i (more = better)
 # of characters between i and C in every excerpt (fewer = better)
Language Technologies Institute, Carnegie Mellon University
45 / 69
How to extract instances given only the class name?
Richard C. Wang
Noisy Instance Expander (NE)


The Extractor in NE is a variation of that used in SEAL
SEAL’s Extractor
NE’s Extractor
Requires the longest common
contexts to bracket at least
one instance of every seed
per web page
Requires the common
contexts that bracket the
largest number of unique
seeds to be as long as
possible per web page
Performs set expansion on web pages queried by
a class name + some list words


List words are words that often appear on list-containing pages
Example query: “car makers” (list OR names OR famous OR common)
Language Technologies Institute, Carnegie Mellon University
46 / 69
How to extract instances given only the class name?
Richard C. Wang
Bootstrapper (BS)

Utilizes iSEAL (Wang & Cohen, ICDM 2008)
 an

iterative version of SEAL
iSEAL makes several calls to SEAL,
where in each call, iSEAL…
 expands
a few seeds, and
 aggregates statistics

Configured to bootstrap with
increasing seed size
Language Technologies Institute, Carnegie Mellon University
47 / 69
How to extract instances given only the class name?
Richard C. Wang
Evaluation Datasets

36 datasets and each of their class names
used as input to ASIA
Language Technologies Institute, Carnegie Mellon University
48 / 69
How to extract instances given only the class name?
Richard C. Wang
Evaluation Results (Wang & Cohen, ACL 2009)
MAP Performance vs. System Configurations
Mean Average Precision (MAP)
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
English
Chinese
Language Technologies Institute, Carnegie Mellon University
NP+NE+BS
NP+NE
NP+BS
NP
NP+NE+BS
NP+NE
NP+BS
NP
NP+NE+BS
NP+NE
NP+BS
NP
0.00
Japanese
49 / 69
How to extract instances given only the class name?
Richard C. Wang
Comparison to:
Kozareva, Riloff, and Hovy, ACL 2008

Input to Kozareva: a class name + a seed
Language Technologies Institute, Carnegie Mellon University
50 / 69
How to extract instances given only the class name?
Richard C. Wang
Comparison to:
Talukdar et al., EMNLP 2008
Van Durme & Pasca, KI 2008
Comparisons of Instance Extraction Performance
100
90
Precision @ 100
80
70
60
50
Talukdar et al., 2008
Van Durme & Pasca, 2008
40
ASIA
30
Book Publishers
Federal
Agencies
NFL Players
Language Technologies Institute, Carnegie Mellon University
Scientific
Journals
Mammals
51 / 69
How to extract instances given only the class name?
Richard C. Wang
Comparison to:
Snow et al., ACL 2006

Definition:
WN – WordNet 2.1
 Extended WN – Snow’s (+30K) extension of WN 2.1
 Original

Selecting semantic classes for evaluation:
 In
Extended WN hierarchy, focus on leaf semantic
classes extended by Snow that have ≥ 3 instances
 Filter out those classes if the instances from ASIA do
not overlap with more than half of the instances in the
Original WN
 Randomly select a dozen remaining classes
Language Technologies Institute, Carnegie Mellon University
52 / 69
How to extract instances given only the class name?
Richard C. Wang
Comparison to:
Snow et al., ACL 2006
Language Technologies Institute, Carnegie Mellon University
53 / 69
Richard C. Wang
Outline
How to…
1.
2.
3.
4.
5.
6.
expand a set of instances?
expand noisy instances from QA systems?
bootstrap set expansion?
extract instances given only the class name?
improve accuracy by using two languages?
expand class-instance relations in pairs?
Language Technologies Institute, Carnegie Mellon University
54 / 69
How to improve accuracy by using two languages?
Richard C. Wang
Proposed Solution – Bilingual SEAL

Utilizes redundant information to minimize the chance
of choosing incorrect seeds when bootstrapping

Expands two sets of instances alternately by using two
separate iSEAL, where both sets represent the same
class but each in a different language


e.g., Disney movies in English and Chinese
Verifies the correctness of a candidate instance by
using ANET (Automatic Named Entity Translator)
Language Technologies Institute, Carnegie Mellon University
55 / 69
How to improve accuracy by using two languages?
Richard C. Wang
Picking a good seed

We select an instance from (i-2)th iteration, whose translation
exist in (i-1)th iteration, to be used as a seed for the ith iteration
Language Technologies Institute, Carnegie Mellon University

Use translations of
instances to select
high-quality seeds

Expansions are
cumulative for each
language
56 / 69
How to improve accuracy by using two languages?
Richard C. Wang
Translating instances


Uses bilingual snippets as a resource
Ranks chunks in the target language based on how
frequently and closely they co-occur with the input string

A chunk is any sequence of characters surrounded by
punctuations or foreign characters
Input:
Language Technologies Institute, Carnegie Mellon University
57 / 69
Richard C. Wang
Experiments


Evaluate bilingual bootstrapping using…
1.
Chinese & English
2.
Japanese & English
Present MAP performance of: (e.g., Chinese & English)

CBB – Chinese results of the bilingual bootstrapping

EBB – English results of the bilingual bootstrapping

CMB – (Monolingual) bootstrapping in only Chinese

EMB – (Monolingual) bootstrapping in only English
Language Technologies Institute, Carnegie Mellon University
58 / 69
How to improve accuracy by using two languages?
Richard C. Wang
Experimental Results
Language Technologies Institute, Carnegie Mellon University
59 / 69
Richard C. Wang
Outline
How to…
1.
2.
3.
4.
5.
6.
expand a set of instances?
expand noisy instances from QA systems?
bootstrap set expansion?
extract instances given only the class name?
improve accuracy by using two languages?
expand class-instance relations in pairs?
Language Technologies Institute, Carnegie Mellon University
60 / 69
How to expand class-instance relations in pairs?
Richard C. Wang
Proposed Solution – Binary SEAL


SEAL was designed to extract
unary relations (e.g., x is a CEO)
Binary SEAL extracts binary
relations (e.g., x is the CEO of company y)
 Discovers
instance pairs having the
same relation as the seed pairs

Real example (output shown at the right):
 Seed
#1: Bill Gates <-> Microsoft
 Seed #2: Larry Page <-> Google
Language Technologies Institute, Carnegie Mellon University
61 / 69
How to expand class-instance relations in pairs?
Richard C. Wang
Binary Extractor

Original Extractor learns unary wrappers
A
unary wrapper consists of left and right
context string
 Extracts all instances that have the same left and
right context as the seeds

Binary Extractor learns binary wrappers
A
binary wrapper has an additional middle
context string
 Extracts all instance-pairs that have the same
left, middle, and right context as the seed-pairs
[left context] Bill Gates [middle context] Microsoft [right context]
[left context] Sergey Brin [middle context] Google [right context]
Language Technologies Institute, Carnegie Mellon University
62 / 69
How to expand class-instance relations in pairs?
Richard C. Wang
Real Binary Wrappers

Acronym vs. Full Name of Federal Agencies
 Seed
#1: CIA <-> Central Intelligence Agency
 Seed #2: USPS <-> United States Postal Service
Left Context
Middle Context
Language Technologies Institute, Carnegie Mellon University
Right Context
63 / 69
How to expand class-instance relations in pairs?
Richard C. Wang
Experiments

Manually constructed five datasets:

Bootstrap results ten times using iSEAL
 the
iterative version of SEAL
Language Technologies Institute, Carnegie Mellon University
64 / 69
How to expand class-instance relations in pairs?
Richard C. Wang
Experiments


RE is the character-level wrapper for Binary SEAL
Compare RE to 4 types of HTML wrappers


R1 is least strict, but more strict than RE
R4 is most strict, but less strict than any HTML wrapper
Language Technologies Institute, Carnegie Mellon University
65 / 69
How to expand class-instance relations in pairs?
Richard C. Wang
Experimental Results
(Wang & Cohen, EMNLP 2009)
P erformanc e vs . Wrapper T ypes
Mean A verag e P rec is ion (% )
95
- B oots trap
90
+ B oots trap
85
80
75
70
65
60
55
50
1
RE
2
3
4
R1
R2
R3
Wrapper T ypes (1 is leas t s tric t)
Language Technologies Institute, Carnegie Mellon University
5
R4
66 / 69
Conclusion and Future Work
Richard C. Wang
Conclusion






Semi-structured documents provide substantial evidence
for discovering class instances
Set expansion at the character-level performs better than
at the HTML-level on semi-structured documents
Set expansion can be used as a tool for improving the
accuracy of QA systems and for extending WordNet
Random walk is an effective ranker for set expansion
Expansion performance can be improved by exploiting
redundant information of classes in different languages
Like unary relations, binary relations can be expanded
using similar techniques
Language Technologies Institute, Carnegie Mellon University
67 / 69
Conclusion and Future Work
Richard C. Wang
Future Work
Develop techniques to automatically…
 verify correctness of candidate instances using
distributional similarity in free text
 classify candidate instances as either subclass
or instance names
 partition expanded instances into subclasses
 identify concept names given example instances
Language Technologies Institute, Carnegie Mellon University
68 / 69
Richard C. Wang
The End – Thank You!!!
Thank You, William, for your guidance since
the SLIF project in the Summer of 2002
 Thank You, Bob, for your guidance since the
RADD project in the Spring of 2003
 Thank You, Tom and Fernando, for all the
comments and support during my thesis

Language Technologies Institute, Carnegie Mellon University
69 / 69
Download