Language-Independent Class Instance Extraction Using the Web Richard C. Wang Thesis Committee: William W. Cohen (Chair) Robert E. Frederking Tom M. Mitchell Fernando Pereira (Google Research) Language Technologies Institute, Carnegie Mellon University Introduction to Class Instance Extraction Richard C. Wang Challenge Discover class instances of any semantic class with minimum input from users x is an instance of class y if x is a (kind of) y “Bags” “Hair Styles” “Failed Banks” Instances Class These are real inputs and outputs from a system called ASIA described in this thesis Language Technologies Institute, Carnegie Mellon University 2 / 69 Introduction to Class Instance Extraction Richard C. Wang Applications Concept and relation learning Co-reference resolution (Pasca, 2004) Improvements for Question Answering (Nadeau et al., 2006)(Talukdar et al., 2008) Query refinement in Web search (Mccarthy & Lehnert, 1995) Weakly-supervised learning for NER (Cohen, 2000)(Etzioni et al., 2005)(Cafarella et al., 2005) (Pantel & Ravichandran, 2004)(Wang et al., 2008) Extensions to WordNet (Snow et al., 2006)(Wang & Cohen, 2009) Language Technologies Institute, Carnegie Mellon University 3 / 69 Introduction to Class Instance Extraction Richard C. Wang Thesis Statement The World Wide Web is a vast and readily-available repository of factual information; such as semantic classes (e.g., fruits), their instances (e.g., orange, banana), and relations between them. There are many semi-structured documents on the Web that provide evidence about these facts. The thesis of this work is that many of these facts can be revealed using tools built on set expansion. More generally, we believe that statistics, aggregation, and simple analysis of the documents are enough to discover frequent common classes in not only English, but other languages as well. Language Technologies Institute, Carnegie Mellon University 4 / 69 Introduction to Class Instance Extraction Richard C. Wang What is Set Expansion? seeds For example, Given a query: { survivor, amazing race } Answer is: { american idol, big brother, ... } More formally, Given a small number of seed instances: Answer is a listing of other probable instances: x1, x2, …, xk where each xi S e1, e2, …, en where each ei S A well-known system is Google Sets™ http://labs.google.com/sets Language Technologies Institute, Carnegie Mellon University 5 / 69 Richard C. Wang Outline How to… 1. 2. 3. 4. 5. 6. expand a set of instances? expand noisy instances from QA systems? bootstrap set expansion? extract instances given only the class name? improve accuracy by using two languages? expand class-instance relations in pairs? Language Technologies Institute, Carnegie Mellon University 6 / 69 How to expand a set of instances? Richard C. Wang Our Set Expander – SEAL Set Expander for Any Language (Wang & Cohen, ICDM 2007) Features Independent of human & markup language Support seeds in English, Chinese, Japanese, Korean, ... Accept documents in HTML, XML, SGML, TeX, WikiML, … Does not require pre-annotated training data Utilize readily-available corpus: World Wide Web Research contributions Auto-construct wrappers for extracting candidate instances Rank candidates using random walk Language Technologies Institute, Carnegie Mellon University 7 / 69 How to expand a set of instances? Canon Nikon Olympus SEAL’s Pipeline Richard C. Wang Pentax Sony Kodak Minolta Panasonic Casio Leica Fuji Samsung … Fetcher: Download web pages containing all seeds Extractor: Construct wrappers for extracting candidate items Ranker: Rank candidate items using random walk Language Technologies Institute, Carnegie Mellon University 8 / 69 How to expand a set of instances? Richard C. Wang The Fetcher Procedure: Compose a search query by concatenating all seeds 2. Query Google to request top 100 URLs 3. Fetch web pages and send to the Extractor 1. Seeds Boston Seattle Carnegie-Mellon Query “Boston Seattle Carnegie-Mellon” Language Technologies Institute, Carnegie Mellon University 9 / 69 How to expand a set of instances? Richard C. Wang The Extractor A wrapper is a pair of L and R context string Maximally-long contextual strings that bracket at least one instance of every seed Extracts all strings between L and R A wrapper derived from page p is only applied to p Learns character-level wrappers from semi-structured documents No tokenization required (language-independent) Language Technologies Institute, Carnegie Mellon University 10 / 69 How to expand a set of instances? Richard C. Wang <li class="ford"><a href="http://www.curryauto.com/"> … <liclass="honda"><a class="ford"><a href="http://www.curryauto.com/"> href="http://www.curryauto.com/"> <li … <li class="toyota"><a href="http://www.geisauto.com/"> … <li <liclass="acura"><a class="ford"><a href="http://www.curryauto.com/"> href="http://www.curryauto.com/"> … It seems to be be Simple Extractor It seems to working… but finds maximally-long working too… contexts bracket what if that I add one but how about a all instances more instanceofof real example? every seed “toyota”? class="ford"><a href="http://www.curryauto.com/"> href="http://www.curryauto.com/"> <li<li class="nissan"><a … class="ford"><a href="http://www.curryauto.com/"> href="http://www.curryauto.com/"> <li<li class="toyota"><a … Language Technologies Institute, Carnegie Mellon University 11 / 69 How to expand a set of instances? Richard C. Wang <li class="ford"><a href="http://www.curryauto.com/"> <img src="/common/logos/ford/logo-horiz-rgb-lg-dkbg.gif" alt="3"></a> I am a noisy <ul><li class="last"><a href="http://www.curryauto.com/"> instanceFord</span>...</li></ul> <span class="dName">Curry </li> <li class="honda"><a href="http://www.curryauto.com/"> <img src="/common/logos/honda/logo-horiz-rgb-lg-dkbg.gif" alt="4"></a> <ul><li><a href="http://www.curryhonda-ga.com/"> <span class="dName">Curry Honda Atlanta</span>...</li> <li><a href="http://www.curryhondamass.com/"> <span class="dName">Curry Honda</span>...</li> you find PECan finds maximallyI guess not! like Horray! It seems <li class="last"><a href="http://www.curryhondany.com/"> common contexts long contexts that <span class="dName">Curry Honda Yorktown</span>...</li></ul> Let’s try PE works! Butour how do that rid bracket bracket least one </li> we get at of those Proposed Extractor <li class="acura"><a href="http://www.curryauto.com/">instance all instances of of every seed noisy instances? and see if it works… <img src="/curryautogroup/images/logo-horiz-rgb-lg-dkbg.gif" alt="5"></a> every seed? Me too! <ul><li class="last"><a href="http://www.curryacura.com/"> <span class="dName">Curry Acura</span>...</li></ul> </li> <li class="nissan"><a href="http://www.curryauto.com/"> <img src="/common/logos/nissan/logo-horiz-rgb-lg-dkbg.gif" alt="6"></a> <ul><li class="last"><a href="http://www.geisauto.com/"> <span class="dName">Curry Nissan</span>...</li></ul> </li> <li class="toyota"><a href="http://www.curryauto.com/"> <img src="/common/logos/toyota/logo-horiz-rgb-lg-dkbg.gif" alt="7"></a> <ul><li class="last"><a href="http://www.geisauto.com/toyota/"> <span class="dName">Curry Toyota</span>...</li></ul> </li> Language Technologies Institute, Carnegie Mellon University 12 / 69 How to expand a set of instances? Richard C. Wang The Ranker Wrapper #2 “chevrolet” 22.5% contain Wrapper #3 extract curryauto.com extract “honda” 26.1% contain Wrapper #1 northpointcars.com Wrapper #4 “acura” 34.6% “volvo” 8.4% Build a graph that consists of a fixed set of… Node Types: { document, wrapper, instance } Labeled Directed Edges: { contain, extract } “bmw” 8.4% Each edge asserts that a binary relation r holds Each edge has an inverse relation r-1 (so graph is cyclic) Perform Random Walk (RW) with restart (Tong et al., 2006) Language Technologies Institute, Carnegie Mellon University 13 / 69 How to expand a set of instances? Richard C. Wang 36 Evaluation Datasets Language Technologies Institute, Carnegie Mellon University 14 / 69 How to expand a set of instances? Richard C. Wang Initial Experiments Compare our proposed extractor (PE) to a simple extractor (SE) SE finds maximally-long contextual strings that bracket all seed occurrences Compare random walk (RW) to a simple ranker based on wrapper frequency (WF) WF scores instance i by the number of wrappers that extract i Language Technologies Institute, Carnegie Mellon University 15 / 69 How to expand a set of instances? Richard C. Wang Initial Experiments (Wang & Cohen, ICDM 2007) Mean Average Precision (%) MAP vs. Various Configurations (with 3 seeds) 100 93.1 87.6 90 82.4 80 70 60 50 40 35.9 30 Google Sets SE+WF PE+WF PE+RW Various Configurations Language Technologies Institute, Carnegie Mellon University 16 / 69 How to expand a set of instances? Richard C. Wang Alternative Rankers Compare RW to the following four rankers: PR – PageRank (Page et al., 1998) 1. Graph-based approach designed to rank web pages BS – Bayesian Sets (Ghahramani and Heller, 2005) 2. Formulates set expansion as a Bayesian inference problem WL – Wrapper Length 3. Scores instance i by the length of wrappers that extract i WF – Wrapper Frequency 4. Scores instance i by the number of wrappers that extract i Language Technologies Institute, Carnegie Mellon University 17 / 69 How to expand a set of instances? Richard C. Wang Alternative Rankers Language Technologies Institute, Carnegie Mellon University 18 / 69 How to expand a set of instances? Richard C. Wang HTML Wrappers PE is the character-level wrapper for SEAL Compare PE to 4 types of HTML wrappers H1 is least strict, but more strict than PE H4 is most strict, but less strict than any HTML wrapper Language Technologies Institute, Carnegie Mellon University 19 / 69 How to expand a set of instances? Richard C. Wang HTML Wrappers (Wang & Cohen, EMNLP 2009) Mean A verag e P rec is ion (% ) 95 P erformanc e v s . Wrapper T ypes (with 2 s eeds ) 90 85 80 75 70 P E +RW H1+ R W H2+ R W H3+ R W H4+ R W Wrapper T ypes Language Technologies Institute, Carnegie Mellon University 20 / 69 Richard C. Wang Outline How to… 1. 2. 3. 4. 5. 6. expand a set of instances? expand noisy instances from QA systems? bootstrap set expansion? extract instances given only the class name? improve accuracy by using two languages? expand class-instance relations in pairs? Language Technologies Institute, Carnegie Mellon University 21 / 69 How to expand noisy instances from QA systems? Richard C. Wang Task Automatically expand (and improve) answers generated by Question Answering systems for list questions An example of a list question: Name cities that have Starbucks QA Answers Boston Seattle Carnegie-Mellon Aquafina Google Logitech Expanded Answers Seattle Boston Chicago Pittsburgh Carnegie-Mellon Google Language Technologies Institute, Carnegie Mellon University 22 / 69 How to expand noisy instances from QA systems? Richard C. Wang Challenge SEAL requires correct seeds, but answers produced by QA systems are often noisy To integrate them together, we propose Noise-Resistant SEAL (Wang et al., EMNLP 2008) Three extensions to SEAL 1. Aggressive Fetcher (AF) 2. Lenient Extractor (LE) 3. Hinted Expander (HE) Language Technologies Institute, Carnegie Mellon University 23 / 69 How to expand noisy instances from QA systems? Richard C. Wang Aggressive Fetcher Sends a two-seed query for every possible pair of seeds to the search engines More likely to compose queries containing only relevant seeds Seeds Boston Seattle Carnegie-Mellon Queries “Boston Seattle” “Boston Carnegie-Mellon” “Seattle Carnegie-Mellon” Language Technologies Institute, Carnegie Mellon University 24 / 69 How to expand noisy instances from QA systems? Richard C. Wang Lenient Extractor Maximally-long contextual strings that bracket at least one instance of a minimum of two seeds More likely to find useful contexts that bracket only relevant seeds Text Learned Wrapper (w/o LE) ... in Boston City Hall ... ... in Seattle City Hall ... ... at Boston University ... ... at Seattle University ... ... at Carnegie-Mellon University ... at <blah> University Learned Wrappers (w/ LE) Language Technologies Institute, Carnegie Mellon University at <blah> University in <blah> City Hall 25 / 69 How to expand noisy instances from QA systems? Richard C. Wang Hinted Expander Utilizes contexts in the question to constrain the search space of SEAL on the Web Extracts up to three keywords from the question Append the keywords to the search queries For example, Question: Query: Name cities that have Starbucks “Boston Seattle cities Starbucks” More likely to find documents containing desired set of answers Language Technologies Institute, Carnegie Mellon University 26 / 69 How to expand noisy instances from QA systems? Richard C. Wang Experiment #1: Ephyra QA System: Ephyra Evaluate on TREC 13, 14, and 15 datasets 55, 93, and 89 list questions respectively Use SEAL to expand top four answers from Ephyra Outputs (Schlaefer et al., TREC 2007) a list of answers ranked by confidence scores For each dataset, we report: Mean Average Precision (MAP) Average F1 with Optimal Per-Question Threshold For each question, cut off the list at a threshold which maximizes the F1 score for that particular question Language Technologies Institute, Carnegie Mellon University 27 / 69 How to expand noisy instances from QA systems? Richard C. Wang Experiment #1: Ephyra (Wang et al., EMNLP 2008) Mean Avg. Precision Avg. Optimal F1 (%)(%) F1 with Optimal Per-Question Threshold Mean Average Precision Ephyra 40% 34% Ephyra's Top Top 44 Ephyra's 36% 30% SEAL SEAL SEAL+LE SEAL+LE 26% 32% SEAL+LE+AF SEAL+LE+AF SEAL+LE+AF+HE 22% 28% SEAL+LE+AF+HE 18% 24% 14% 20% 10% 16% 6% 12% Trec Trec 13 13 Trec Trec 14 14 Trec 15 15 Trec TREC TREC Dataset Dataset Language Technologies Institute, Carnegie Mellon University 28 / 69 How to expand noisy instances from QA systems? Richard C. Wang Experiment #2: Ephyra In practice, thresholds are unknown For each dataset, do 5-fold cross validation: Train: Find one optimal threshold for all four folds Test: Use the threshold to evaluate the fifth fold Introduce a fourth dataset: All Union of TREC 13, 14, and 15 Introduce another system: Hybrid Intersection of original answers from Ephyra and expanded answers from SEAL Language Technologies Institute, Carnegie Mellon University 29 / 69 How to expand noisy instances from QA systems? Richard C. Wang Experiment #2: Ephyra (Wang et al., EMNLP 2008) F1 with Trained Threshold Avg. F1 (%) 32% Ephyra 30% SEAL+LE+AF+HE 28% Hybrid 26% 24% 22% 20% 18% 16% 14% 12% Trec 13 Trec 14 Trec 15 All TREC Dataset Language Technologies Institute, Carnegie Mellon University 30 / 69 How to expand noisy instances from QA systems? Richard C. Wang Experiment: Top QA Systems Top five QA systems that perform the best on list questions in TREC 15 evaluation 1. 2. 3. 4. 5. Language Computer Corporation (lccPA06) The Chinese University of Hong Kong (cuhkqaepisto) National University of Singapore (NUSCHUAQA1) Fudan University (FDUQAT15A) National Security Agency (QACTIS06C) For each QA system, train thresholds for SEAL and Hybrid on the union of TREC 13 and 14 Expand top four answers from the QA systems on TREC 15, and apply the trained threshold Language Technologies Institute, Carnegie Mellon University 31 / 69 How to expand noisy instances from QA systems? Richard C. Wang Experiment: Top QA Systems (Wang et al., EMNLP 2008) F1 with Trained Threshold 22% 46% Average F1 (%) 44% 42% Baseline Top 4 Ans. Google Sets 21% SEAL+LE+AF+HE 20% Hybrid 19% 40% 18% 38% 17% 16% 36% 15% 34% 14% 32% 13% 30% 12% lccPA06 cuhkqaepisto NUSCHUAQA1 FDUQAT15A QACTIS06C TREC Dataset Language Technologies Institute, Carnegie Mellon University 32 / 69 Richard C. Wang Outline How to… 1. 2. 3. 4. 5. 6. expand a set of instances? expand noisy instances from QA systems? bootstrap set expansion? extract instances given only the class name? improve accuracy by using two languages? expand class-instance relations in pairs? Language Technologies Institute, Carnegie Mellon University 33 / 69 How to bootstrap set expansion? Richard C. Wang Limitation of SEAL Preliminary Study on Seed Sizes 85% Mean Average Precision 84% Evaluated using Mean Average Precision on 36 datasets 83% 82% 81% 80% 79% For each dataset, we randomly pick n seeds (and repeat 3 times) 78% RW PR RW BS WL 77% 76% 75% 2 3 4 5 6 # Seeds (Seed Size) Performance drops significantly when given more than 5 seeds The Fetcher downloads web pages that contain all seeds However, not many pages have more than 5 seeds Language Technologies Institute, Carnegie Mellon University 34 / 69 How to bootstrap set expansion? Richard C. Wang Proposed Solution – iSEAL iterative SEAL (Wang & Cohen, ICDM 2008) makes several calls to SEAL in each call (or iteration): Expands a few seeds Aggregates statistics We evaluated iSEAL using… Two iterative processes Two seeding strategies Five ranking methods Language Technologies Institute, Carnegie Mellon University 35 / 69 How to bootstrap set expansion? Richard C. Wang Iterative Process & Seeding Strategy Iterative Processes Preliminary Study on Seed Sizes 1. Supervised 85% 2. 82% Bootstrapping At every84%iteration, seeds are obtained from a reliable source 83% (e.g. human) Mean Average Precision 81% At every80%iteration, seeds are selected from candidate items (except 79% the 1st iteration) 78% RW PR BS WL Seeding Strategies 77% 76% 1. Fixed Seed Size 75% 2. 2 3 4 Uses 2 seeds at every iteration # Seeds (Seed Size) 5 6 Increasing Seed Size Starts with 2 seeds, then 3 seeds for next iteration, and fixed at 4 seeds afterwards Language Technologies Institute, Carnegie Mellon University 36 / 69 How to bootstrap set expansion? Richard C. Wang Fixed Seed Size (Supervised) Initial Seeds Language Technologies Institute, Carnegie Mellon University 37 / 69 How to bootstrap set expansion? Richard C. Wang (Wang & Cohen, ICDM 2008) Fixed Seed Size (Supervised) 98% Mean Average Precision 97% 96% 95% 94% 93% 92% RW 91% PR BS 90% WL WF 89% 1 2 3 4 5 6 7 8 9 10 # Iterations (Cumulative Expansions) Language Technologies Institute, Carnegie Mellon University 38 / 69 How to bootstrap set expansion? Richard C. Wang Fixed Seed Size (Bootstrap) Initial Seeds Language Technologies Institute, Carnegie Mellon University 39 / 69 How to bootstrap set expansion? Richard C. Wang (Wang & Cohen, ICDM 2008) Fixed Seed Size (Bootstrap) 92% Mean Average Precision 91% 90% 89% 88% RW PR BS WL 87% WF 86% 1 2 3 4 5 6 7 8 9 10 # Iterations (Cumulative Expansions) Language Technologies Institute, Carnegie Mellon University 40 / 69 How to bootstrap set expansion? Richard C. Wang Increasing Seed Size (Bootstrap) Initial Seeds Used Seeds Language Technologies Institute, Carnegie Mellon University 41 / 69 How to bootstrap set expansion? Richard C. Wang (Wang & Cohen, ICDM 2008) Increasing Seed Size (Bootstrapping) Mean Average Precision 94% 93% 92% 91% RW PR 90% BS WL WF 89% 1 2 3 4 5 6 7 8 9 10 # Iterations (Cumulative Expansions) Language Technologies Institute, Carnegie Mellon University 42 / 69 Richard C. Wang Outline How to… 1. 2. 3. 4. 5. 6. expand a set of instances? expand noisy instances from QA systems? bootstrap set expansion? extract instances given only the class name? improve accuracy by using two languages? expand class-instance relations in pairs? Language Technologies Institute, Carnegie Mellon University 43 / 69 How to extract instances given only the class name? Richard C. Wang Proposed Approach – ASIA (Wang & Cohen, ACL 2009) Automatic Set Instance Acquirer (ASIA) Some Instances Semantic Class Name Noisy Instance Provider Noisy Instance Expander Bootstrapper More Instances Noisy Instances Language Technologies Institute, Carnegie Mellon University 44 / 69 How to extract instances given only the class name? Richard C. Wang Noisy Instance Provider (NP) Manually constructed hyponym patterns based on Marti Hearst’s work in 1992 Query search engines for each hyponym pattern + a class name e.g. “car makers such as” Extract all candidates I from returned web snippets A snippet often contains multiple excerpts Rank each candidate i in I based on # of patterns, snippets, and excerpts containing i (more = better) # of characters between i and C in every excerpt (fewer = better) Language Technologies Institute, Carnegie Mellon University 45 / 69 How to extract instances given only the class name? Richard C. Wang Noisy Instance Expander (NE) The Extractor in NE is a variation of that used in SEAL SEAL’s Extractor NE’s Extractor Requires the longest common contexts to bracket at least one instance of every seed per web page Requires the common contexts that bracket the largest number of unique seeds to be as long as possible per web page Performs set expansion on web pages queried by a class name + some list words List words are words that often appear on list-containing pages Example query: “car makers” (list OR names OR famous OR common) Language Technologies Institute, Carnegie Mellon University 46 / 69 How to extract instances given only the class name? Richard C. Wang Bootstrapper (BS) Utilizes iSEAL (Wang & Cohen, ICDM 2008) an iterative version of SEAL iSEAL makes several calls to SEAL, where in each call, iSEAL… expands a few seeds, and aggregates statistics Configured to bootstrap with increasing seed size Language Technologies Institute, Carnegie Mellon University 47 / 69 How to extract instances given only the class name? Richard C. Wang Evaluation Datasets 36 datasets and each of their class names used as input to ASIA Language Technologies Institute, Carnegie Mellon University 48 / 69 How to extract instances given only the class name? Richard C. Wang Evaluation Results (Wang & Cohen, ACL 2009) MAP Performance vs. System Configurations Mean Average Precision (MAP) 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 English Chinese Language Technologies Institute, Carnegie Mellon University NP+NE+BS NP+NE NP+BS NP NP+NE+BS NP+NE NP+BS NP NP+NE+BS NP+NE NP+BS NP 0.00 Japanese 49 / 69 How to extract instances given only the class name? Richard C. Wang Comparison to: Kozareva, Riloff, and Hovy, ACL 2008 Input to Kozareva: a class name + a seed Language Technologies Institute, Carnegie Mellon University 50 / 69 How to extract instances given only the class name? Richard C. Wang Comparison to: Talukdar et al., EMNLP 2008 Van Durme & Pasca, KI 2008 Comparisons of Instance Extraction Performance 100 90 Precision @ 100 80 70 60 50 Talukdar et al., 2008 Van Durme & Pasca, 2008 40 ASIA 30 Book Publishers Federal Agencies NFL Players Language Technologies Institute, Carnegie Mellon University Scientific Journals Mammals 51 / 69 How to extract instances given only the class name? Richard C. Wang Comparison to: Snow et al., ACL 2006 Definition: WN – WordNet 2.1 Extended WN – Snow’s (+30K) extension of WN 2.1 Original Selecting semantic classes for evaluation: In Extended WN hierarchy, focus on leaf semantic classes extended by Snow that have ≥ 3 instances Filter out those classes if the instances from ASIA do not overlap with more than half of the instances in the Original WN Randomly select a dozen remaining classes Language Technologies Institute, Carnegie Mellon University 52 / 69 How to extract instances given only the class name? Richard C. Wang Comparison to: Snow et al., ACL 2006 Language Technologies Institute, Carnegie Mellon University 53 / 69 Richard C. Wang Outline How to… 1. 2. 3. 4. 5. 6. expand a set of instances? expand noisy instances from QA systems? bootstrap set expansion? extract instances given only the class name? improve accuracy by using two languages? expand class-instance relations in pairs? Language Technologies Institute, Carnegie Mellon University 54 / 69 How to improve accuracy by using two languages? Richard C. Wang Proposed Solution – Bilingual SEAL Utilizes redundant information to minimize the chance of choosing incorrect seeds when bootstrapping Expands two sets of instances alternately by using two separate iSEAL, where both sets represent the same class but each in a different language e.g., Disney movies in English and Chinese Verifies the correctness of a candidate instance by using ANET (Automatic Named Entity Translator) Language Technologies Institute, Carnegie Mellon University 55 / 69 How to improve accuracy by using two languages? Richard C. Wang Picking a good seed We select an instance from (i-2)th iteration, whose translation exist in (i-1)th iteration, to be used as a seed for the ith iteration Language Technologies Institute, Carnegie Mellon University Use translations of instances to select high-quality seeds Expansions are cumulative for each language 56 / 69 How to improve accuracy by using two languages? Richard C. Wang Translating instances Uses bilingual snippets as a resource Ranks chunks in the target language based on how frequently and closely they co-occur with the input string A chunk is any sequence of characters surrounded by punctuations or foreign characters Input: Language Technologies Institute, Carnegie Mellon University 57 / 69 Richard C. Wang Experiments Evaluate bilingual bootstrapping using… 1. Chinese & English 2. Japanese & English Present MAP performance of: (e.g., Chinese & English) CBB – Chinese results of the bilingual bootstrapping EBB – English results of the bilingual bootstrapping CMB – (Monolingual) bootstrapping in only Chinese EMB – (Monolingual) bootstrapping in only English Language Technologies Institute, Carnegie Mellon University 58 / 69 How to improve accuracy by using two languages? Richard C. Wang Experimental Results Language Technologies Institute, Carnegie Mellon University 59 / 69 Richard C. Wang Outline How to… 1. 2. 3. 4. 5. 6. expand a set of instances? expand noisy instances from QA systems? bootstrap set expansion? extract instances given only the class name? improve accuracy by using two languages? expand class-instance relations in pairs? Language Technologies Institute, Carnegie Mellon University 60 / 69 How to expand class-instance relations in pairs? Richard C. Wang Proposed Solution – Binary SEAL SEAL was designed to extract unary relations (e.g., x is a CEO) Binary SEAL extracts binary relations (e.g., x is the CEO of company y) Discovers instance pairs having the same relation as the seed pairs Real example (output shown at the right): Seed #1: Bill Gates <-> Microsoft Seed #2: Larry Page <-> Google Language Technologies Institute, Carnegie Mellon University 61 / 69 How to expand class-instance relations in pairs? Richard C. Wang Binary Extractor Original Extractor learns unary wrappers A unary wrapper consists of left and right context string Extracts all instances that have the same left and right context as the seeds Binary Extractor learns binary wrappers A binary wrapper has an additional middle context string Extracts all instance-pairs that have the same left, middle, and right context as the seed-pairs [left context] Bill Gates [middle context] Microsoft [right context] [left context] Sergey Brin [middle context] Google [right context] Language Technologies Institute, Carnegie Mellon University 62 / 69 How to expand class-instance relations in pairs? Richard C. Wang Real Binary Wrappers Acronym vs. Full Name of Federal Agencies Seed #1: CIA <-> Central Intelligence Agency Seed #2: USPS <-> United States Postal Service Left Context Middle Context Language Technologies Institute, Carnegie Mellon University Right Context 63 / 69 How to expand class-instance relations in pairs? Richard C. Wang Experiments Manually constructed five datasets: Bootstrap results ten times using iSEAL the iterative version of SEAL Language Technologies Institute, Carnegie Mellon University 64 / 69 How to expand class-instance relations in pairs? Richard C. Wang Experiments RE is the character-level wrapper for Binary SEAL Compare RE to 4 types of HTML wrappers R1 is least strict, but more strict than RE R4 is most strict, but less strict than any HTML wrapper Language Technologies Institute, Carnegie Mellon University 65 / 69 How to expand class-instance relations in pairs? Richard C. Wang Experimental Results (Wang & Cohen, EMNLP 2009) P erformanc e vs . Wrapper T ypes Mean A verag e P rec is ion (% ) 95 - B oots trap 90 + B oots trap 85 80 75 70 65 60 55 50 1 RE 2 3 4 R1 R2 R3 Wrapper T ypes (1 is leas t s tric t) Language Technologies Institute, Carnegie Mellon University 5 R4 66 / 69 Conclusion and Future Work Richard C. Wang Conclusion Semi-structured documents provide substantial evidence for discovering class instances Set expansion at the character-level performs better than at the HTML-level on semi-structured documents Set expansion can be used as a tool for improving the accuracy of QA systems and for extending WordNet Random walk is an effective ranker for set expansion Expansion performance can be improved by exploiting redundant information of classes in different languages Like unary relations, binary relations can be expanded using similar techniques Language Technologies Institute, Carnegie Mellon University 67 / 69 Conclusion and Future Work Richard C. Wang Future Work Develop techniques to automatically… verify correctness of candidate instances using distributional similarity in free text classify candidate instances as either subclass or instance names partition expanded instances into subclasses identify concept names given example instances Language Technologies Institute, Carnegie Mellon University 68 / 69 Richard C. Wang The End – Thank You!!! Thank You, William, for your guidance since the SLIF project in the Summer of 2002 Thank You, Bob, for your guidance since the RADD project in the Spring of 2003 Thank You, Tom and Fernando, for all the comments and support during my thesis Language Technologies Institute, Carnegie Mellon University 69 / 69