WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies Institute, School of Computer Science Carnegie Mellon University 1 Outline Motivation Previous Work Our Approach Evaluation Applications Conclusion 2 Motivation : Acquiring concept-instance pairs NLP Tasks Knowledge Bases Summarization Co-reference resolution Named entity extraction E.g. NELL, FreeBase Limited coverage Can we use relational data in tables on the web? 3 Previous Work Co-ordinate terms 𝑥 , 𝑦 ∈ same-concept Pittsburgh , New York USA , India Van Durme and Pasca AAAI’08 Shinzato and Torisawa Distributional similarityNAACL’04 Uses itemizations in webpages Table Columns Hyponym patterns X “such as” Y City such as Pittsburgh Hearst patterns Queries the Web to find candidate hypernyms for entities Hearst Patterns and Clueweb09 corpus • Open-domain Candidate • Unsupervised class-instance • Corpus based pairs 4 Our Approach Hypothesis – 1 Entities appearing in a table column probably belong to the same concept Relational table identification Hypothesis – 2 If a set of entities co-occur frequently in multiple tablecolumns coming from multiple URL domains, then with high-probability they represent some meaningful concept. Entity clustering 5 Relational Table Identification 20% tables are relational Country Capital City India Delhi China Beijing Canada Ottawa France Paris 65% tables are relational Heuristic based Table Classifier #rows #columns #non-link columns length(cells) whether the table is recursive 6 Data Representation Every table cell is a potential entity Every table column is a potential entity set No labeled data - only co-occurrence information Clustering the table columns is important Some columns are only relevant to the page they appear in There are many overlapping columns 7 Triplet Store Representation HTML tables Entity-feature file :Triplet Store Country Capital City India Delhi Entities Table:Column Domains China Beijing Canada, China, India 21:1 Canada Ottawa Canada, China, France 21:1, 34:1 France Paris dom1.com, dom2.com Beijing, Delhi, Ottawa 21:2 dom1.com Beijing, Ottawa, Paris 21:2, 34:2 dom1.com, dom2.com Canada, England, France 34:1 dom2.com London, Ottawa, Paris 34:2 dom2.com TableId=21 , domain=“dom1.com” Country Capital City Canada Ottawa China Beijing France Paris England London dom1.com Size(Table corpus) = O(N) size(Triplet store) = O(N) TableId=34 , domain=“dom2.com” 8 Clustering Algorithm Number of clusters is unknown - Parametric algorithms not useful Scalability is important for web scale corpus - Naïve single-link clustering algorithm creates whole similarity matrix : 𝑂(𝑁 2 ∗ log 𝑁) Single pass algorithm - Similar to Chinese Restaurant Process Complexity : 𝑂(𝑁 ∗ log 𝑁) 9 Bottom-up Entity Clustering #clusters = 0 Triplets sorted by #domains Entities Table: Column Domains Beijing, Ottawa, Paris 21:2 , 34:2 dom1.com, dom2.com China, Canada, France 21:1, 34:1 dom1.com, dom2.com Delhi, Beijing, Ottawa 21:2 dom1.com Canada, England, France 34:1 dom2.com London, Ottawa, Paris 34:2 dom2.com India, China, Canada 21:1 dom1.com 10 Bottom-up Entity Clustering #clusters #clusters==01 Triplets sorted by #domains Entities Table: Column Domains Beijing, Ottawa, Paris 21:2 , 34:2 dom1.com, dom2.com China, Canada, France 21:1, 34:1 dom1.com, dom2.com Delhi, Beijing, Ottawa 21:2 dom1.com Canada, England, France 34:1 dom2.com London, Ottawa, Paris 34:2 dom2.com India, China, Canada 21:1 dom1.com Cluster-1 Beijing, Ottawa, Paris 21:2 , 34:2 dom1.com, dom2.com 11 Bottom-up Entity Clustering #clusters #clusters==12 Triplets sorted by #domains Entities Table: Column Domains Beijing, Ottawa, Paris 21:2 , 34:2 dom1.com, dom2.com China, Canada, France 21:1, 34:1 dom1.com, dom2.com Delhi, Beijing, Ottawa 21:2 dom1.com Canada, England, France 34:1 dom2.com London, Ottawa, Paris 34:2 dom2.com India, China, Canada 21:1 dom1.com Cluster-1 Beijing, Ottawa, Paris 21:2 , 34:2 dom1.com, dom2.com 21:1, 34:1 dom1.com, dom2.com Cluster-2 China, Canada, France 12 Bottom-up Entity Clustering #clusters = 2 Triplets sorted by #domains Entities Table: Column Domains Beijing, Ottawa, Paris 21:2 , 34:2 dom1.com, dom2.com China, Canada, France 21:1, 34:1 dom1.com, dom2.com Delhi, Beijing, Ottawa 21:2 dom1.com Canada, England, France 34:1 dom2.com London, Ottawa, Paris 34:2 dom2.com India, China, Canada 21:1 dom1.com Cluster-1 Beijing, Ottawa, Paris 21:2 , 34:2 dom1.com, dom2.com 21:1, 34:1 dom1.com, dom2.com Cluster-2 China, Canada, France 13 Bottom-up Entity Clustering #clusters = 2 Triplets sorted by #domains Entities Table: Column Domains Beijing, Ottawa, Paris 21:2 , 34:2 dom1.com, dom2.com China, Canada, France 21:1, 34:1 dom1.com, dom2.com Delhi, Beijing, Ottawa 21:2 dom1.com Canada, England, France 34:1 dom2.com London, Ottawa, Paris 34:2 dom2.com India, China, Canada 21:1 dom1.com Cluster-1 Beijing, Delhi, Ottawa, Paris 21:2 , 34:2 dom1.com, dom2.com 21:1, 34:1 dom1.com, dom2.com Cluster-2 China, Canada, France 14 Bottom-up Entity Clustering #clusters = 2 Triplets sorted by #domains Entities Table: Column Domains Beijing, Ottawa, Paris 21:2 , 34:2 dom1.com, dom2.com China, Canada, France 21:1, 34:1 dom1.com, dom2.com Delhi, Beijing, Ottawa 21:2 dom1.com Canada, England, France 34:1 dom2.com London, Ottawa, Paris 34:2 dom2.com India, China, Canada 21:1 dom1.com Cluster-1 Beijing, Delhi, Ottawa, Paris 21:2 , 34:2 dom1.com, dom2.com 21:1, 34:1 dom1.com, dom2.com Cluster-2 China, Canada, France 15 Bottom-up Entity Clustering #clusters = 2 Triplets sorted by #domains Entities Beijing, Ottawa, Paris China, Canada, France Delhi, Beijing, Ottawa Table: Column Domains Cluster-1 Beijing, Delhi, Ottawa, Paris, London 21:2 , 34:2 21:2 Algorithm , 34:2 dom1.com, is order dependent. dom2.com optimal ordering is NP-hard. 21:1,Finding 34:1 dom1.com, dom2.com solution : order triplets in Sub-optimal 21:2 dom1.com descending order of #domains. Canada, England, France 34:1 dom2.com London, Ottawa, Paris 34:2 dom2.com India, China, Canada 21:1 dom1.com dom1.com, dom2.com Cluster-2 China, Canada, France, England, India 21:1, 34:1 dom1.com, dom2.com 16 Hypernym Recommendation Entity Hypernym : count Entities Table: Column Domains China Country : 1000 21:1, 34:1 Canada Country : 300 India, China, Canada, France, England dom1.com, dom2.com Delhi City : 450, Destination : 10 21:2, 34:2 dom1.com, dom2.com London City : 500 Delhi, Beijing, Score (Hypernym | cluster) Ottawa, London, = f (co-occurrence Paris counts of hypernym with entities in the cluster) Hypernym Entities Table:Column Domains Country : 2 India, China, Canada, France, England 21:1, 34:1 dom1.com, dom2.com City : 2, Destination : 1 Delhi, Beijing, Ottawa, London, Paris 21:2, 34:2 dom1.com, dom2.com 17 WebSets Framework HTML documents Relational Table Identification Triplet store Bottom-up Clustering Entity Clusters Hyponym Pattern Data Hypernym Recommendation Labeled entity sets <Entities, hypernym> 18 Datasets Dataset Description # HTML pages # Tables Toy_Apple Fruits + companies 574 Delicious_Sports Links from Delicious w/ tag=sports 21K 146.3K Delicious_Music Links from Delicious w/ tag=music 183K 643.3K CSEAL_Useful Pages SEAL found NELL entities on 30K 322.8K ASIA_NELL ASIA run on NELL categories 112K 676.9K ASIA_INT ASIA run on intelligence domain 121K 621.3K Clueweb_HPR High pagerank sample of CLueweb 100K 586.9K 2.6K 19 Evaluation : Clustering WebSets vs. K-means Ideal # clusters Toy_Apple : 27 Delicious_sports : 29 Dataset Method K Purity NMI RI Toy_Apple K-Means 40 0.96 0.71 0.98 WebSets 25 0.99 0.99 1.00 K-Means 50 0.72 0.68 0.98 WebSets 32 0.83 0.64 1.00 Delicious_ Sports Triplet record vs. entity record (Toy_Apple) - triplets have entity disambiguation effect Precision recall treadoff in hypernym recommendation 20 Evaluation of Entity Sets Dataset #Triplets #Clusters CSEAL_Useful 165.2K 1090 ASIA_NELL 11.4K 448 ASIA_INT 15.1K 395 Clueweb_HPR 516.0 47 21 Evaluation of Entity Sets Dataset #Triplets #Clusters # Clusters with hypernyms CSEAL_Useful 165.2K 1090 312 ASIA_NELL 11.4K 448 266 ASIA_INT 15.1K 395 218 47 34 Clueweb_HPR 516.0 Uniformly sample 100 clusters per dataset Manual evaluation : cluster , ranked list of hypernyms 22 Evaluation of Entity Sets Dataset #Triplets #Clusters # Clusters with hypernyms %Meaningful CSEAL_Useful 165.2K 1090 312 69.0 ASIA_NELL 11.4K 448 266 73.0 ASIA_INT 15.1K 395 218 63.0 47 34 70.5 Clueweb_HPR 516.0 Decision : Whether a cluster is meaningful Most specific hypernym from the list Noisy clusters : e.g. navigational / menu items 23 Evaluation of Entity Sets Dataset #Triplets #Clusters # Clusters with hypernyms %Meaningful MRR of hypernym CSEAL_Useful 165.2K 1090 312 69.0 0.56 ASIA_NELL 11.4K 448 266 73.0 0.59 ASIA_INT 15.1K 395 218 63.0 0.58 47 34 70.5 0.56 Clueweb_HPR 516.0 Chose most specific label from the ranked list of hypernyms. MRR : computed against ranked list of hypernyms per cluster 24 Evaluation of Entity Sets Dataset #Triplets #Clusters # Clusters with hypernyms %Meaningful MRR of hypernym Precision CSEAL_Useful 165.2K 1090 312 69.0 0.56 98.6% ASIA_NELL 11.4K 448 266 73.0 0.59 98.5% ASIA_INT 15.1K 395 218 63.0 0.58 97.4% 47 34 70.5 0.56 99.0% Clueweb_HPR 516.0 #entities belonging to hypernym Precision(cluster, hypernym) = size(cluster) Amazon Mechanical Turk evaluation to compute precision X : entity ,Y : hypernym “Is X of type Y?” 25 Application : Corpus Summary Music Domain Instruments Flute, Tuba , String Orchestra, Chimes, Harmonium, Bassoon, Woodwinds, Glockenspiel, French horn, Timpani, Piano Concepts very specific to domain. Intervals Hypernyms Whole are tone, Major sixth,generated. Fifth, Perfect fifth, Seventh, Third, automatically Diminished fifth, Whole step, Fourth, Minor seventh, Major third, Minor third Genres Smooth jazz, Gothic, Metal rock, Rock, Pop, Hip hop, Rock n roll, Country, Folk, Punk rock Audio Equipments Audio editor , General midi synthesizer , Audio recorder , Multichannel digital audio workstation , Drum sequencer , Mixers , Music engraving system , Audio server , Mastering software , Soundfont sample player 26 Conclusions & Future Work We proposed an unsupervised information extraction method to extract sets of entities from the HTML tables on the web. Efficient clustering algorithm : O(N * log N) - high cluster purity (0.83-0.99) Unsupervised hypernym recommendation - intra-cluster precision (97-99%). Future work : extend this technique for unsupervised extraction and naming of relations. 27 Thank You !! 28 Evaluation: Hypernym Recommendation WebSets (WS) vs.Van Durme and Pasca method (DPM) AAAI’08 Method K J %Accuracy DPM Inf 0.0 34.6 88.6K 30.7K 5 0.2 50.0 0.8K 0.4K Inf 0.0 21.9 100,828.0K 22,081.3K 5 0.2 44.0 2.8K 1.2K WS - - 67.7 73.7K 45.8K WSExt - - 78.8 64.8K 51.1K DPMExt Yield (#pairs produced) #Correct pairs (predicted) DPM : Outputs a concept-instance pair if it has appeared in HCD & #clusters concept assigned to < K fraction of entities in cluster that match with concept > J DPMExt : Outputs a concept-instance pair even if absent in HCD. WS : Ranks the hypernyms by #entities it matches with WSExt : Filters Hyponym Concept dataset by threshold on count 29 More stats about various datasets Dataset Toy_Apple #Tables %Relational #Filtered tables %Relational filtered #Triplets 2.6K 50 762 75 15K Delicious_Sports 146.3K 15 57.0K 55 63K Delicious_Music 643.3K 20 201.8K 75 93K CSEAL_Useful 322.8K 30 116.2K 80 1148K ASIA_NELL 676.9K 20 233.0K 55 421K ASIA_INT 621.3K 15 216.0K 60 374K Clueweb_HPR 586.9K 10 176.0K 35 78K 30 Triplet record vs. entity record ( on Toy_Apple dataset ) Method K FM w/ Entity records FM w/ Triplet records 0.11 (K=25) 0.85 (K=34) 30 0.09 0.35 25 0.08 0.38 WebSets K-Means Triplet records can separate “Apple as fruit” and “Apple as company” 31 Application : Contributions to existing NELL KB Entity sets produced by WebSets are mapped to NELL categories entities in each cluster contexts in np-context data from Clueweb context NELL categories (using useful patterns for each NELL category) score (NELL concept) = sum (TFIDF scores of contexts) TFIDF score (context) depends on #entities in the cluster it matches with (generality) and #concepts it is associated with (specificity) Shard #classes #pairs %acc (1-20) %acc (1-100) %acc (1-500) %acc %acc (1-1000) (1-10000) 100K 13 944 100.00 80.56 81.82 72.97 65.93 200K 3 542 100.00 86.49 77.78 75.00 76.40 300K 24 2236 100.00 89.74 79.31 72.72 67.71 400K 44 4456 93.75 80.65 80.85 74.63 67.86 500K 34 9194 100.00 83.78 85.96 77.03 73.40 32