WebSets : Extracting Sets of Entities from the Web Using Unsupervised Information Extraction

advertisement
WebSets: Extracting Sets of Entities
from the Web Using
Unsupervised Information Extraction
Bhavana Dalvi, William W. Cohen and Jamie Callan
Language Technologies Institute,
School of Computer Science
Carnegie Mellon University
1
Outline






Motivation
Previous Work
Our Approach
Evaluation
Applications
Conclusion
2
Motivation :
Acquiring concept-instance pairs
NLP Tasks
Knowledge Bases
 Summarization
 Co-reference resolution
 Named entity extraction
 E.g. NELL, FreeBase
 Limited coverage
Can we use relational
data in tables on the
web?
3
Previous Work
Co-ordinate terms
𝑥 , 𝑦 ∈ same-concept
Pittsburgh , New York
USA , India
Van Durme and Pasca AAAI’08
Shinzato and Torisawa
Distributional
similarityNAACL’04
Uses itemizations in webpages
Table Columns
Hyponym patterns
X “such as” Y
City such as Pittsburgh
Hearst patterns
Queries the Web to find candidate
hypernyms for entities
Hearst Patterns and
Clueweb09 corpus
• Open-domain
Candidate
• Unsupervised
class-instance
• Corpus based
pairs
4
Our Approach

Hypothesis – 1
Entities appearing in a table column probably belong
to the same concept
Relational
table
identification

Hypothesis – 2
If a set of entities co-occur frequently in multiple tablecolumns coming from multiple URL domains, then with
high-probability they represent some meaningful concept.
Entity
clustering
5
Relational Table Identification
20% tables are relational
Country
Capital
City
India
Delhi
China
Beijing
Canada
Ottawa
France
Paris
65% tables are relational
Heuristic based
Table Classifier





#rows
#columns
#non-link columns
length(cells)
whether the table is recursive
6
Data Representation

Every table cell is a potential entity

Every table column is a potential entity set

No labeled data - only co-occurrence information

Clustering the table columns is important
 Some columns are only relevant to the page they appear in
 There are many overlapping columns
7
Triplet Store Representation
HTML tables
Entity-feature file :Triplet Store
Country
Capital
City
India
Delhi
Entities
Table:Column Domains
China
Beijing
Canada, China, India
21:1
Canada
Ottawa
Canada, China, France 21:1, 34:1
France
Paris
dom1.com,
dom2.com
Beijing, Delhi, Ottawa
21:2
dom1.com
Beijing, Ottawa, Paris
21:2, 34:2
dom1.com,
dom2.com
Canada, England,
France
34:1
dom2.com
London, Ottawa, Paris
34:2
dom2.com
TableId=21 , domain=“dom1.com”
Country
Capital
City
Canada
Ottawa
China
Beijing
France
Paris
England
London
dom1.com
Size(Table corpus) = O(N)  size(Triplet store) = O(N)
TableId=34 , domain=“dom2.com”
8
Clustering Algorithm

Number of clusters is unknown
- Parametric algorithms not useful
Scalability is important for web scale
corpus

- Naïve single-link clustering algorithm creates
whole similarity matrix : 𝑂(𝑁 2 ∗ log 𝑁)

Single pass algorithm
- Similar to Chinese Restaurant Process

Complexity : 𝑂(𝑁 ∗ log 𝑁)
9
Bottom-up Entity Clustering
#clusters = 0
Triplets sorted by #domains
Entities
Table:
Column
Domains
Beijing, Ottawa,
Paris
21:2 , 34:2
dom1.com,
dom2.com
China, Canada,
France
21:1, 34:1
dom1.com,
dom2.com
Delhi, Beijing,
Ottawa
21:2
dom1.com
Canada, England,
France
34:1
dom2.com
London, Ottawa,
Paris
34:2
dom2.com
India, China, Canada
21:1
dom1.com
10
Bottom-up Entity Clustering
#clusters
#clusters==01
Triplets sorted by #domains
Entities
Table:
Column
Domains
Beijing, Ottawa,
Paris
21:2 , 34:2
dom1.com,
dom2.com
China, Canada,
France
21:1, 34:1
dom1.com,
dom2.com
Delhi, Beijing,
Ottawa
21:2
dom1.com
Canada, England,
France
34:1
dom2.com
London, Ottawa,
Paris
34:2
dom2.com
India, China, Canada
21:1
dom1.com
Cluster-1
Beijing, Ottawa,
Paris
21:2 , 34:2
dom1.com,
dom2.com
11
Bottom-up Entity Clustering
#clusters
#clusters==12
Triplets sorted by #domains
Entities
Table:
Column
Domains
Beijing, Ottawa,
Paris
21:2 , 34:2
dom1.com,
dom2.com
China, Canada,
France
21:1, 34:1
dom1.com,
dom2.com
Delhi, Beijing,
Ottawa
21:2
dom1.com
Canada, England,
France
34:1
dom2.com
London, Ottawa,
Paris
34:2
dom2.com
India, China, Canada
21:1
dom1.com
Cluster-1
Beijing, Ottawa,
Paris
21:2 , 34:2
dom1.com,
dom2.com
21:1, 34:1
dom1.com,
dom2.com
Cluster-2
China, Canada,
France
12
Bottom-up Entity Clustering
#clusters = 2
Triplets sorted by #domains
Entities
Table:
Column
Domains
Beijing, Ottawa,
Paris
21:2 , 34:2
dom1.com,
dom2.com
China, Canada,
France
21:1, 34:1
dom1.com,
dom2.com
Delhi, Beijing,
Ottawa
21:2
dom1.com
Canada, England,
France
34:1
dom2.com
London, Ottawa,
Paris
34:2
dom2.com
India, China, Canada
21:1
dom1.com
Cluster-1
Beijing, Ottawa,
Paris
21:2 , 34:2
dom1.com,
dom2.com
21:1, 34:1
dom1.com,
dom2.com
Cluster-2
China, Canada,
France
13
Bottom-up Entity Clustering
#clusters = 2
Triplets sorted by #domains
Entities
Table:
Column
Domains
Beijing, Ottawa,
Paris
21:2 , 34:2
dom1.com,
dom2.com
China, Canada,
France
21:1, 34:1
dom1.com,
dom2.com
Delhi, Beijing,
Ottawa
21:2
dom1.com
Canada, England,
France
34:1
dom2.com
London, Ottawa,
Paris
34:2
dom2.com
India, China, Canada
21:1
dom1.com
Cluster-1
Beijing, Delhi,
Ottawa, Paris
21:2 , 34:2
dom1.com,
dom2.com
21:1, 34:1
dom1.com,
dom2.com
Cluster-2
China, Canada,
France
14
Bottom-up Entity Clustering
#clusters = 2
Triplets sorted by #domains
Entities
Table:
Column
Domains
Beijing, Ottawa,
Paris
21:2 , 34:2
dom1.com,
dom2.com
China, Canada,
France
21:1, 34:1
dom1.com,
dom2.com
Delhi, Beijing,
Ottawa
21:2
dom1.com
Canada, England,
France
34:1
dom2.com
London, Ottawa,
Paris
34:2
dom2.com
India, China, Canada
21:1
dom1.com
Cluster-1
Beijing, Delhi,
Ottawa, Paris
21:2 , 34:2
dom1.com,
dom2.com
21:1, 34:1
dom1.com,
dom2.com
Cluster-2
China, Canada,
France
15
Bottom-up Entity Clustering
#clusters = 2
Triplets sorted by #domains
Entities
Beijing, Ottawa,
Paris
China, Canada,
France
Delhi, Beijing,
Ottawa
Table:
Column
Domains
Cluster-1
Beijing, Delhi,
Ottawa, Paris,
London
21:2 , 34:2
21:2 Algorithm
, 34:2
dom1.com,

is order dependent.
dom2.com

optimal ordering is NP-hard.
21:1,Finding
34:1
dom1.com,
dom2.com solution : order triplets in
 Sub-optimal
21:2
dom1.com
descending
order of #domains.
Canada, England,
France
34:1
dom2.com
London, Ottawa,
Paris
34:2
dom2.com
India, China, Canada
21:1
dom1.com
dom1.com,
dom2.com
Cluster-2
China, Canada,
France, England,
India
21:1, 34:1
dom1.com,
dom2.com
16
Hypernym Recommendation
Entity
Hypernym : count
Entities
Table:
Column
Domains
China
Country : 1000
21:1, 34:1
Canada
Country : 300
India, China,
Canada, France,
England
dom1.com,
dom2.com
Delhi
City : 450,
Destination : 10
21:2, 34:2
dom1.com,
dom2.com
London
City : 500
Delhi, Beijing,
Score (Hypernym
| cluster)
Ottawa, London,
= f (co-occurrence
Paris counts of
hypernym with entities in
the cluster)
Hypernym
Entities
Table:Column
Domains
Country : 2
India, China,
Canada, France,
England
21:1, 34:1
dom1.com,
dom2.com
City : 2,
Destination : 1
Delhi, Beijing,
Ottawa, London,
Paris
21:2, 34:2
dom1.com,
dom2.com
17
WebSets Framework
HTML
documents
Relational Table
Identification
Triplet store
Bottom-up
Clustering
Entity
Clusters
Hyponym
Pattern
Data
Hypernym
Recommendation
Labeled
entity sets
<Entities,
hypernym>
18
Datasets
Dataset
Description
# HTML
pages
# Tables
Toy_Apple
Fruits + companies
574
Delicious_Sports
Links from Delicious w/
tag=sports
21K
146.3K
Delicious_Music
Links from Delicious w/
tag=music
183K
643.3K
CSEAL_Useful
Pages SEAL found NELL
entities on
30K
322.8K
ASIA_NELL
ASIA run on NELL
categories
112K
676.9K
ASIA_INT
ASIA run on intelligence
domain
121K
621.3K
Clueweb_HPR
High pagerank sample of
CLueweb
100K
586.9K
2.6K
19
Evaluation : Clustering

WebSets vs. K-means
Ideal # clusters
Toy_Apple
: 27
Delicious_sports : 29
Dataset
Method
K
Purity NMI
RI
Toy_Apple
K-Means
40
0.96
0.71
0.98
WebSets
25
0.99
0.99
1.00
K-Means
50
0.72
0.68
0.98
WebSets
32
0.83
0.64
1.00
Delicious_
Sports


Triplet record vs. entity record (Toy_Apple)
- triplets have entity disambiguation effect
Precision recall treadoff in hypernym
recommendation
20
Evaluation of Entity Sets
Dataset
#Triplets
#Clusters
CSEAL_Useful
165.2K
1090
ASIA_NELL
11.4K
448
ASIA_INT
15.1K
395
Clueweb_HPR
516.0
47
21
Evaluation of Entity Sets
Dataset
#Triplets
#Clusters
# Clusters
with
hypernyms
CSEAL_Useful
165.2K
1090
312
ASIA_NELL
11.4K
448
266
ASIA_INT
15.1K
395
218
47
34
Clueweb_HPR
516.0
 Uniformly sample 100 clusters per dataset
 Manual evaluation : cluster , ranked list of hypernyms
22
Evaluation of Entity Sets
Dataset
#Triplets
#Clusters
# Clusters
with
hypernyms
%Meaningful
CSEAL_Useful
165.2K
1090
312
69.0
ASIA_NELL
11.4K
448
266
73.0
ASIA_INT
15.1K
395
218
63.0
47
34
70.5
Clueweb_HPR
516.0
 Decision : Whether a cluster is meaningful
Most specific hypernym from the list
 Noisy clusters : e.g. navigational / menu items
23
Evaluation of Entity Sets
Dataset
#Triplets
#Clusters
# Clusters
with
hypernyms
%Meaningful
MRR of
hypernym
CSEAL_Useful
165.2K
1090
312
69.0
0.56
ASIA_NELL
11.4K
448
266
73.0
0.59
ASIA_INT
15.1K
395
218
63.0
0.58
47
34
70.5
0.56
Clueweb_HPR
516.0
 Chose most specific label from the ranked list of hypernyms.
 MRR : computed against ranked list of hypernyms per cluster
24
Evaluation of Entity Sets
Dataset
#Triplets
#Clusters
# Clusters
with
hypernyms
%Meaningful
MRR of
hypernym
Precision
CSEAL_Useful
165.2K
1090
312
69.0
0.56
98.6%
ASIA_NELL
11.4K
448
266
73.0
0.59
98.5%
ASIA_INT
15.1K
395
218
63.0
0.58
97.4%
47
34
70.5
0.56
99.0%
Clueweb_HPR
516.0
#entities belonging to hypernym
 Precision(cluster, hypernym) =
size(cluster)
 Amazon Mechanical Turk evaluation to compute precision
 X : entity ,Y : hypernym  “Is X of type Y?”
25
Application : Corpus Summary
Music Domain
Instruments
Flute, Tuba , String Orchestra, Chimes, Harmonium, Bassoon,
Woodwinds, Glockenspiel, French horn, Timpani, Piano
 Concepts very specific to domain.
Intervals  Hypernyms
Whole are
tone,
Major sixth,generated.
Fifth, Perfect fifth, Seventh, Third,
automatically
Diminished fifth, Whole step, Fourth, Minor seventh, Major
third, Minor third
Genres
Smooth jazz, Gothic, Metal rock, Rock, Pop, Hip hop, Rock n
roll, Country, Folk, Punk rock
Audio
Equipments
Audio editor , General midi synthesizer , Audio recorder ,
Multichannel digital audio workstation , Drum sequencer ,
Mixers , Music engraving system , Audio server , Mastering
software , Soundfont sample player
26
Conclusions & Future Work
 We
proposed an unsupervised information
extraction method to extract sets of entities
from the HTML tables on the web.
 Efficient clustering algorithm : O(N * log N)
- high cluster purity (0.83-0.99)
 Unsupervised hypernym recommendation
- intra-cluster precision (97-99%).
 Future work : extend this technique for
unsupervised extraction and naming of
relations.
27
Thank You !!
28
Evaluation: Hypernym Recommendation
WebSets (WS) vs.Van Durme and Pasca method (DPM) AAAI’08
Method
K
J
%Accuracy
DPM
Inf
0.0
34.6
88.6K
30.7K
5
0.2
50.0
0.8K
0.4K
Inf
0.0
21.9
100,828.0K
22,081.3K
5
0.2
44.0
2.8K
1.2K
WS
-
-
67.7
73.7K
45.8K
WSExt
-
-
78.8
64.8K
51.1K
DPMExt
Yield (#pairs
produced)
#Correct pairs
(predicted)
DPM :
Outputs a concept-instance pair if it has appeared in HCD &
#clusters concept assigned to < K
fraction of entities in cluster that match with concept > J
DPMExt : Outputs a concept-instance pair even if absent in HCD.
WS :
Ranks the hypernyms by #entities it matches with
WSExt : Filters Hyponym Concept dataset by threshold on count
29
More stats about various datasets
Dataset
Toy_Apple
#Tables
%Relational
#Filtered
tables
%Relational
filtered
#Triplets
2.6K
50
762
75
15K
Delicious_Sports
146.3K
15
57.0K
55
63K
Delicious_Music
643.3K
20
201.8K
75
93K
CSEAL_Useful
322.8K
30
116.2K
80
1148K
ASIA_NELL
676.9K
20
233.0K
55
421K
ASIA_INT
621.3K
15
216.0K
60
374K
Clueweb_HPR
586.9K
10
176.0K
35
78K
30
Triplet record vs. entity record
( on Toy_Apple dataset )
Method
K
FM w/ Entity
records
FM w/ Triplet records
0.11 (K=25)
0.85 (K=34)
30
0.09
0.35
25
0.08
0.38
WebSets
K-Means
Triplet records can separate
“Apple as fruit” and
“Apple as company”
31
Application :
Contributions to existing NELL KB
 Entity sets produced by WebSets are mapped to NELL categories
 entities in each cluster  contexts in np-context data from Clueweb
 context  NELL categories
(using useful patterns for each NELL category)
 score (NELL concept) = sum (TFIDF scores of contexts)
 TFIDF score (context) depends on
 #entities in the cluster it matches with (generality) and
 #concepts it is associated with (specificity)
Shard #classes
#pairs
%acc
(1-20)
%acc
(1-100)
%acc
(1-500)
%acc
%acc
(1-1000) (1-10000)
100K
13
944
100.00
80.56
81.82
72.97
65.93
200K
3
542
100.00
86.49
77.78
75.00
76.40
300K
24
2236
100.00
89.74
79.31
72.72
67.71
400K
44
4456
93.75
80.65
80.85
74.63
67.86
500K
34
9194
100.00
83.78
85.96
77.03
73.40
32
Download