Graph-Based Methods for “Open Domain” Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon University Joint work with Richard Wang Traditional IE vs Open Domain IE • Goal: recognize people, places, companies, times, dates, … in NL text. • Supervised learning from corpus completely annotated with target entity class (e.g. “people”) • Goal: recognize arbitrary entity sets in text • Linear-chain CRFs • Language- and genre-specific extractors – from very large corpora (WWW) – Minimal info about entity class – Example 1: “ICML, NIPS” – Example 2: “Machine learning conferences” • Semi-supervised learning • Graph-based learning methods • Techniques are largely language-independent (!) – Graph abstraction fits many languages Outline • History – Open-domain IE by pattern-matching • The bootstrapping-with-noise problem – Bootstrapping as a graph walk • Open-domain IE as finding nodes “near” seeds on a graph – – – – – Set expansion - from a few clean seeds Iterative set expansion – from many noisy seeds Relational set expansion Multilingual set expansion Iterative set expansion – from a concept name alone History: Open-domain IE by patternmatching (Hearst, 92) • • Start with seeds: “NIPS”, “ICML” Look thru a corpus for certain patterns: • • • … “at NIPS, AISTATS, KDD and other learning conferences…” Expand from seeds to new instances Repeat….until ___ – “on PC of KDD, SIGIR, … and…” Bootstrapping as graph proximity NIPS “…at NIPS, AISTATS, KDD and other learning conferences…” AISTATS SNOWBIRD “For skiiers, NIPS, SNOWBIRD,… and…” SIGIR KDD … “on PC of KDD, SIGIR, … and…” “… AISTATS,KDD,…” shorter paths ~ earlier iterations many paths ~ additional evidence Set Expansion for Any Language (SEAL) – (Wang & Cohen, ICDM 07) • Basic ideas – Dynamically build the graph using queries to the web – Constrain the graph to be as useful as possible • Be smart about queries • Be smart about “patterns”: use clever methods for finding meaningful structure on web pages 1. 2. 3. Canon Nikon Olympus System Architecture 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Pentax Sony Kodak Minolta Panasonic Casio Leica Fuji Samsung … • Fetcher: download web pages from the Web that contain all the seeds • Extractor: learn wrappers from web pages • Ranker: rank entities extracted by wrappers The Extractor • Learn wrappers from web documents and seeds on the fly – Utilize semi-structured documents – Wrappers defined at character level • Very fast • No tokenization required; thus language independent • Wrappers derived from doc d applied to d only – See ICDM 2007 paper for details .. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General Motors</a> and <a href=“finance/bentley”>Bentley</a> …. 1. Find prefix of each seed and put in reverse order: • ford1: /ecnanif”=fer a> yllareneG … • Ford2: >”drof/ /ecnanif”=fer a> yllareneG … • honda1: /ecnanif”=fer a> ot derapmoc … • Honda2: >”adnoh/ /ecnanif”=fer a> ot … 2. Organize these into a trie, tagging each node with a set of seeds: yllareneG … {f1,h1} /ecnanif”=fer a> >” {f1,f2,h1,h2} {f2,h2} ot derapmoc … {f1} {h1} drof/ /ecnanif”=fer a> yllareneG.. adnoh/ /ecnanif”=fer a> ot .. {h2} {f2} .. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General Motors</a> and <a href=“finance/bentley”>Bentley</a> …. 1. Find prefix of each seed and put in reverse order: 2. Organize these into a trie, tagging each node with a set of seeds. 3. A left context for a valid wrapper is a node tagged with one instance of each seed. yllareneG … {f1,h1} /ecnanif”=fer a> >” {f1,f2,h1,h2} {f2,h2} ot derapmoc … {f1} {h1} drof/ /ecnanif”=fer a> yllareneG.. adnoh/ /ecnanif”=fer a> ot .. {h2} {f2} .. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General Motors</a> and <a href=“finance/bentley”>Bentley</a> …. 1. Find prefix of each seed and put in reverse order: 2. Organize these into a trie, tagging each node with a set of seeds. 3. A left context for a valid wrapper is a node tagged with one instance of each seed. 4. The corresponding right context is the longest common suffix of the corresponding seed instances. {f1,h1} “> yllareneG … /ecnanif”=fer a> >” {f1,f2,h1,h2} ot derapmoc … {f1} ”>Ford</a> sales … {h1} ”>Honda</a> while … drof/ /ecnanif”=fer a> yllareneG.. {f2,h2} </a> adnoh/ /ecnanif”=fer a> ot .. {h2} {f2} Nice properties: • There are relatively few nodes in the trie: • O((#seeds)*(document length)) • You can tag every node with the complete set of seeds that it covers • You can rank of filter nodes by any predicate over this set of seeds you want: e.g., • covers all seed instances that appear on the page? • covers at least one instance of each seed? • covers at least k instances, instances with weight > w, … {f1,h1} “> yllareneG … /ecnanif”=fer a> >” {f1,f2,h1,h2} ot derapmoc … {f1} ”>Ford</a> sales … {h1} ”>Honda</a> while … drof/ /ecnanif”=fer a> yllareneG.. {f2,h2} </a> adnoh/ /ecnanif”=fer a> ot .. {h2} {f2} <li class="ford"><a href="http://www.curryauto.com/"> <img src="/common/logos/ford/logo-horiz-rgb-lg-dkbg.gif" alt="3"></a> <ul><li class="last"><a href="http://www.curryauto.com/"> I am noise <span class="dName">Curry Ford</span>...</li></ul> </li> <li class="honda"><a href="http://www.curryauto.com/"> <img src="/common/logos/honda/logo-horiz-rgb-lg-dkbg.gif" alt="4"></a> <ul><li><a href="http://www.curryhonda-ga.com/"> <span class="dName">Curry Honda Atlanta</span>...</li> Me too! <li><a href="http://www.curryhondamass.com/"> <span class="dName">Curry Honda</span>...</li> <li class="last"><a href="http://www.curryhondany.com/"> <span class="dName">Curry Honda Yorktown</span>...</li></ul> </li> <li class="acura"><a href="http://www.curryauto.com/"> <img src="/curryautogroup/images/logo-horiz-rgb-lg-dkbg.gif" alt="5"></a> <ul><li class="last"><a href="http://www.curryacura.com/"> <span class="dName">Curry Acura</span>...</li></ul> </li> <li class="nissan"><a href="http://www.curryauto.com/"> <img src="/common/logos/nissan/logo-horiz-rgb-lg-dkbg.gif" alt="6"></a> <ul><li class="last"><a href="http://www.geisauto.com/"> <span class="dName">Curry Nissan</span>...</li></ul> </li> <li class="toyota"><a href="http://www.curryauto.com/"> <img src="/common/logos/toyota/logo-horiz-rgb-lg-dkbg.gif" alt="7"></a> <ul><li class="last"><a href="http://www.geisauto.com/toyota/"> <span class="dName">Curry Toyota</span>...</li></ul> </li> Differences from prior work • Fast character-level wrapper learning – Language-independent – Trie structure allows flexibility in goals • Cover one copy of each seed, cover all instances of seeds, … – Works well for semi-structured pages • Lists and tables, pull-down menus, javascript data structures, word documents, … • High-precision, low-recall data integration vs. High-precision, low-recall information extraction The Ranker • Rank candidate entity mentions based on “similarity” to seeds – • Noisy mentions should be ranked lower Random Walk with Restart (GW) • …? Google’s PageRank web site xxx web site xxx web site xxx web site a b cdefg web site web site yyyy pdq pdq .. web site a b cdefg web site yyyy Inlinks are “good” (recommendations) Inlinks from a “good” site are better than inlinks from a “bad” site but inlinks from sites with many outlinks are not as “good”... “Good” and “bad” are relative. Google’s PageRank web site xxx web site xxx • follows a random link, or web site a b cdefg web site web site yyyy pdq pdq .. web site a b cdefg web site yyyy Imagine a “pagehopper” that always either • jumps to random page Google’s PageRank (Brin & Page, http://www-db.stanford.edu/~backrub/google.html) web site xxx web site xxx Imagine a “pagehopper” that always either • follows a random link, or web site a b cdefg web • jumps to random page site web site yyyy pdq pdq .. web site a b cdefg web site yyyy PageRank ranks pages by the amount of time the pagehopper spends on a page: • or, if there were many pagehoppers, PageRank is the expected “crowd size” Personalized PageRank (aka Random Walk with Restart) web site xxx web site xxx • follows a random link, or web site a b cdefg web site web site yyyy pdq pdq .. web site a b cdefg web site yyyy Imagine a “pagehopper” that always either • jumps to particular page Personalize PageRank Random Walk with Restart web site xxx web site xxx • follows a random link, or web site a b cdefg web site web site yyyy Imagine a “pagehopper” that always either pdq pdq .. • jumps to a particular page P0 • this ranks pages by the total number of paths connecting them to P0 web site a b cdefg web site yyyy • … with each path downweighted exponentially with length The Ranker • Rank candidate entity mentions based on “similarity” to seeds – • Noisy mentions should be ranked lower Random Walk with Restart (GW) • On what graph? Building a Graph “ford”, “nissan”, “toyota” Wrapper #2 find northpointcars.com extract curryauto.com “chevrolet” 22.5% “honda” 26.1% Wrapper #3 derive Wrapper #1 “acura” 34.6% “volvo chicago” 8.4% Wrapper #4 “bmw pittsburgh” 8.4% • A graph consists of a fixed set of… – Node Types: {seeds, document, wrapper, mention} – Labeled Directed Edges: {find, derive, extract} • Each edge asserts that a binary relation r holds • Each edge has an inverse relation r-1 (graph is cyclic) – Intuition: good extractions are extracted by many good wrappers, and good wrappers extract many good extractions, Differences from prior work • Graph-based distances vs. bootstrapping – Graph constructed on-the-fly • So it’s not different? – But there is a clear principle about how to combine results from earlier/later rounds of bootstrapping • i.e., graph proximity • Fewer parameters to consider • Robust to “bad wrappers” Evaluation Datasets: closed sets Evaluation Method • Mean Average Precision – – – – Commonly used for evaluating ranked lists in IR Contains recall and precision-oriented aspects Sensitive to the entire ranking Mean of average precisions for each ranked list Prec(r) = precision at rank r NewEntity (r ) 1 if (a) and (b) are true otherwise 0 (a) Extracted mention at r matches any true mention where L = ranked list of extracted mentions, r = rank • Evaluation Procedure 1. (per dataset) Randomly select three true entities and use their first listed mentions as seeds 2. Expand the three seeds obtained from step 1 3. Repeat steps 1 and 2 five times 4. Compute MAP for the five ranked lists (b) There exist no other extracted mention at rank less than r that is of the same entity as the one at r # True Entities = total number of true entities in this dataset Experimental Results: 3 seeds Overall MAP vs. Various Methods 100% MAP (%) 95% 80% 90% 60% 85% 40% 80% 20% 75% 93.13% 94.03% 82.39% 94.18% 93.13% 87.61% 82.39% 43.76% 14.59% 70% 0% E1+EF+100 G.Sets E2+GW+100 E2+EF+100 G.Sets (Eng) E2+GW+200 E2+GW+100 E1+EF+100 E2+GW+300 Methods Vary: [Extractor] + [Ranker] + [Top N URLs] Extractor: • E1: Baseline Extractor (longest common context for all seed occurrences ) • E2: Smarter Extractor (longest common context for 1 occurrence of each seed) Ranker: { EF: Baseline (Most Frequent), GW: Graph Walk } N URLs: { 100, 200, 300 } Side by side comparisons Telukdar, Brants, Liberman, Pereira, CoNLL 06 Side by side comparisons EachMovie vs WWW Ghahramani & Heller, NIPS 2005 NIPS vs WWW Why does SEAL do so well? • Hypotheses: – More information appears in semi-structured documents than in free text – More semi-structured documents can be (partially) understood with character-level wrappers than with HTML-level wrappers Free-text wrappers are only 10-15% of all wrappers learned: “Used [...] Van Pricing" “Used [...] Engines" “Bell Road [...] " “Alaska [...] dealership" “www.sunnyking[...].com"" “engine [...] used engines" “accessories, [...] parts" “is better [...] or" Comparing character tries to HTMLbased structures Outline • History – Open-domain IE by pattern-matching • The bootstrapping-with-noise problem – Bootstrapping as a graph walk • Open-domain IE as finding nodes “near” seeds on a graph – – – – – Set expansion - from a few clean seeds Iterative set expansion – from many noisy seeds Iterative set expansion – from a concept name alone Multilingual set expansion Relational set expansion A limitation of the original SEAL Preliminary Study on Seed Sizes 85% Mean Average Precision 84% 83% 82% 81% 80% 79% 78% RW PR BS WL 77% 76% 75% 2 3 4 # Seeds (Seed Size) 5 6 Proposed Solution: Iterative SEAL (iSEAL) (Wang & Cohen, ICDM 2008) • Makes several calls to SEAL, each call… – Expands a couple of seeds – Aggregates statistics • Evaluate iSEAL using… – Two iterative processes • Supervised vs. Unsupervised (Bootstrapping) – Two seeding strategies • Fixed Seed Size vs. Increasing Seed Size – Five ranking methods ISeal (Fixed Seed Size, Supervised) Initial Seeds • …Finally rank nodes by proximity to seeds in the full graph • Refinement (ISS): Increase size of seed set for each expansion over time: 2,3,4,4,… • Variant (Bootstrap): use highconfidence extractions when seeds run out Ranking Methods Random Graph Walk with Restart – H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its application. In ICDM, 2006. PageRank – L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. 1998. Bayesian Sets (over flattened graph) – Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2005. Wrapper Length – Weights each item based on the length of common contextual string of that item and the seeds Wrapper Frequency – Weights each item based on the number of wrappers that extract the item Fixed Seed Size (Supervised) 98% Mean Average Precision 97% 96% 95% 94% 93% 92% RW 91% PR BS 90% WL WF 89% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Fixed Seed Size (Bootstrap) 92% Mean Average Precision 91% 90% 89% 88% RW PR BS WL 87% WF 86% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Increasing Seed Size (Supervised) 97% Mean Average Precision 96% 95% 94% 93% RW 92% PR BS 91% WL WF 90% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Increasing Seed Size (Bootstrapping) Mean Average Precision 94% 93% 92% 91% RW PR 90% BS WL WF 89% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Fixed Seed Size (Supervised) 98% Increasing Seed Size (Supervised) 97% 96% 96% 95% 94% 93% 92% RW PR 91% 90% 89% 1 2 95% 94% 93% RW 92% PR Little differenceBSbetween 91% ranking methods WL for supervisedWFcase (all seeds correct); 90% 2 3 4 5 6 7 8 3 4 5 6 7 differences 8 9 10 large when1 bootstrapping # Iterations (Cumulative Expansions) # Iterations (Cumulative Expansions) BS WL WF 9 10 Increasingmakes Seed Size (Bootstrapping) Increasing seed size {2,3,4,4,…} all 94% ranking methods improve steadily in 93% case bootstrapping Fixed Seed Size (Bootstrap) 92% Mean Average Precision 91% Mean Average Precision Mean Average Precision Mean Average Precision 97% 90% 89% 88% RW PR BS WL 87% 92% 91% RW PR 90% BS WL WF 86% WF 89% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Fixed Seed Size (Supervised) Increasing Seed Size (Supervised) 97% 97% 96% 96% 95% 94% 93% 92% RW 91% PR Mean Average Precision Mean Average Precision 98% 94% 93% WL PR BS WL WF 90% WF 89% RW 92% 91% BS 90% 95% 1 2 3 4 5 6 7 8 9 10 # Iterations (Cumulative Expansions) 1 2 3 4 5 6 7 8 9 10 # Iterations (Cumulative Expansions) Increasing Seed Size (Bootstrapping) Mean Average Precision 94% Fixed Seed Size (Bootstrap) 92% Mean Average Precision 91% 90% 93% 92% 91% RW PR 90% BS WL 89% WF 89% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 88% RW PR BS WL 87% WF 86% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 9 10 Outline • History – Open-domain IE by pattern-matching • The bootstrapping-with-noise problem – Bootstrapping as a graph walk • Open-domain IE as finding nodes “near” seeds on a graph – – – – – Set expansion - from a few clean seeds Iterative set expansion – from many noisy seeds Relational set expansion Multilingual set expansion Iterative set expansion – from a concept name alone Relational Set Expansion [Wang & Cohen, EMNLP 2009] • Seed examples are pairs: – E.g., audi::germany, acura::japan, • Extension: find wrappers in which pairs of seeds occur – With specific left & right contexts – In specific order (audi before germany, …) – With specific string between them • Variant of trie-based algorithm Results First iteration Tenth iteration Outline • History – Open-domain IE by pattern-matching • The bootstrapping-with-noise problem – Bootstrapping as a graph walk • Open-domain IE as finding nodes “near” seeds on a graph – – – – – Set expansion - from a few clean seeds Iterative set expansion – from many noisy seeds Relational set expansion Multilingual set expansion Iterative set expansion – from a concept name alone Multilingual Set Expansion Multilingual Set Expansion • Basic idea: – – – – – – – Expand in language 1 (English) with seeds s1,s2 to S1 Expand in language 2 (Spanish) with seeds t1,t2 to T1. Find first seed s3 in S1 that has a translation t3 in T1. Expand in language 1 (English) with seeds s1,s2,s3 to S2 Find first seed t4 in T1 that has a translation s4 in S2. Expand in language 2 (Sp.) with seeds t1,t2,t3 to T2. Continue…. Multilingual Set Expansion • What’s needed: – Set expansion in two languages – A way to decide if s is a translation of t Multilingual Set Expansion Submit s as a query and ask for results in language T. Find chunks in language T in the snippets that frequently co-occur with s • Bounded by change in character set (eg English to Chinese) or punctuation Rank chunks by combination of proximity & frequency Consider top 3 chunks t1, t2, t3 as likely translations of s. Multilingual Set Expansion Multilingual Set Expansion Outline • History – Open-domain IE by pattern-matching • The bootstrapping-with-noise problem – Bootstrapping as a graph walk • Open-domain IE as finding nodes “near” seeds on a graph – – – – – Set expansion - from a few clean seeds Iterative set expansion – from many noisy seeds Relational set expansion Multilingual set expansion Iterative set expansion – from a concept name alone ASIA: Automatic set instance acquisition [Wang & Cohen, ACL 2009] • Start with name of concept (e.g., “NFL teams”) • Look for instances using (language-dependent) patterns: – “… for successful NFL teams (e.g., Pittsburgh Steelers, New York Giants, …)” • Take most frequent answers as seeds • Run bootstrapping iSEAL – with seed sizes 2,3,4,4…. – and extended for noise-resistance: • wrappers should cover as many distinct seeds as possible (not all seeds) … • … subject to a limit on size • Modified trie method Datasets with concept names Experimental results Direct use of text patterns Comparison to Kozareva, Riloff & Hovy (which uses concept name plus a single instance as seed)…no seed used. Comparison to Pasca (using web search queries, CIKM 07) Comparison to WordNet + Nk • Snow et al, ACL 2005: series of experiments learning hyper/hyponyms – – – – Bootstrap from Wordnet examples Use dependency-parsed free text E.g., added 30k new instances with fairly high precision Many are concepts + named-entity instances: • Experiments with ASIA on concepts from Wordnet shows a fairly common problem: – E.g., “movies” gives as “instances”: “comedy”, “action/adventure”, “family”, “drama”, …. – I.e., ASIA finds a lower level in a hierarchy, maybe not the one you want Comparison to WordNet + Nk • Filter: a simulated sanity check: – Consider only concepts expanded in Wordnet + 30k that seem to have named-entities as instances and have at least instances – Run ASIA on each concept – Discard result if less than 50% of the Wordnet instances are in ASIA’s output Summary: • Some are good • Some of Snow’s concepts are low-precision relative to ASIA (4.7% 100%) • For the rest ASIA has 2x 100x the coverage (in number of instances) Two More Systems to Compare to • Van Durme & Pasca, 2008 – Requires an English part-of-speech tagger. – Analyzed 100 million cached Web documents in English (for many classes). • Talukdar et al, 2008 – Requires 5 seed instances as input (for each class). – Utilizes output from Van Durme’s system and 154 million tables from the WebTables database (for many classes). • ASIA – – – – – Does not require any part-of-speech tagger (nearly language-independent). Supports multiple languages such as English, Chinese, and Japanese. Analyzes around 200~400 Web documents (for each class). Requires only the class name as input. Given a class name, extraction usually finishes within a minute (including network latency of fetching web pages). Comparisons of Instance Extraction Performance 100 90 Precision @ 100 80 70 60 50 Talukdar et al., 2008 Van Durme & Pasca, 2008 40 ASIA 30 Book Publishers • Federal Agencies NFL Players Scientific Journals Mammals Precisions of Talukdar and Van Durme’s systems were obtained from Figure 2 in Talukdar et al, 2008. Top 10 Instances from ASIA Book Publishers mcgraw hill prentice hall cambridge university press random house mit press academic press crc press oxford university press columbia university press harvard university press Federal Agencies usda dot hud epa nasa ssa nsf dol va sec NFL Players Scientific Journals Mammals tom brady ladainian tomlinson peyton manning brian westbrook ben roethlisberger donovan mcnabb reggie bush brett favre terrell owens anquan boldin scientific american science nature new scientist cell journal of biological chemistry biophysical journal nature medicine evolution new england journal of medicine bats dogs rodents cats bears rabbits elephants primates whales marsupials Joint work with Tom Mitchell, Weam AbuZaki, Justin Betteridge, Andrew Carlson, Estevam R. Hruschka Jr., Bryan Kisiel, Burr Settles Learn a large number of concepts at once person teamPlaysSport(t, s) playsForTeam(a,t) sport athlete coach NP1 team playsSport(a,s) coachesTeam(c,t) NP2 Krzyzewski coaches the Blue Devils. Coupled learning of text and HTML patterns evidence integration CBL SEAL Free-text extraction patterns HTML extraction patterns Ontology and populated KB the Web Summary/Conclusions • Open-domain IE as finding nodes “near” seeds on a graph NIPS “…at NIPS, AISTATS, KDD and other learning conferences…” AISTATS SNOWBIRD For skiiers, NIPS, SNOWBIRD,… and…” SIGIR KDD … “on PC of KDD, SIGIR, … and…” “… AISTATS,KDD,…” • RWR as robust proximity measure • Character tries as flexible pattern language • high-coverage • modifiable to handle expectations of noise Summary/Conclusions • Open-domain IE as finding nodes “near” seeds on a graph: – Graph built on-the-fly with web queries • A good graph matters! • A big graph matters! – character-level tries >> HTML heuristics – Rank the whole graph • Don’t confuse iteratively building the graph with ranking! – Off-the-shelf distance metrics work • Differences are minimal for clean seeds • Much bigger differences with noisy seeds • Bootstrapping (especially from free-text patterns) is noisy Thanks to • DARPA PAL program – Cohen, Wang • Google Research Grant program – Wang Sponsored links: http://boowa.com (Richard’s demo)