Graph-Based Methods for “Open Domain” Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon University Traditional IE vs Open Domain IE • Goal: recognize people, places, companies, times, dates, … in NL text. • Supervised learning from corpus completely annotated with target entity class (e.g. “people”) • Goal: recognize arbitrary entity sets in text • Linear-chain CRFs • Language- and genre-specific extractors – from very large corpora (WWW) – Minimal info about entity class – Example 1: “ICML, NIPS” – Example 2: “Machine learning conferences” • Semi-supervised learning • Graph-based learning methods • Techniques are largely language-independent (!) – Graph abstraction fits many languages Outline • History – Open-domain IE by pattern-matching • The bootstrapping-with-noise problem – Bootstrapping as a graph walk • Open-domain IE as finding nodes “near” seeds on a graph – Approach 1: A “natural” graph derived from a smaller corpus + learned similarity – Approach 2: A carefully-engineered graph derived from huge corpus History: Open-domain IE by patternmatching (Hearst, 92) • • Start with seeds: “NIPS”, “ICML” Look thru a corpus for certain patterns: • • • … “at NIPS, AISTATS, KDD and other learning conferences…” Expand from seeds to new instances Repeat….until ___ – “on PC of KDD, SIGIR, … and…” Bootstrapping as graph proximity NIPS “…at NIPS, AISTATS, KDD and other learning conferences…” AISTATS SNOWBIRD “For skiiers, NIPS, SNOWBIRD,… and…” SIGIR KDD … “on PC of KDD, SIGIR, … and…” “… AISTATS,KDD,…” shorter paths ~ earlier iterations many paths ~ additional evidence Outline • Open-domain IE as finding nodes “near” seeds on a graph – Approach 1: A “natural” graph derived from a smaller corpus + learned similarity “with” Einat Minkov (Nokia) – Approach 2: A carefully-engineered graph derived from huge corpus (e.g.’s above) “with” Richard Wang (CMU ?) Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008) nsubj boys partmod like prep.with playing all kinds det NN VB VB DT cars prep.of NN NN Dependency parsed sentence is a naturally represented as a tree Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008) Dependency parsed corpus is “naturally” represented as a graph Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008) Open IE Goal: • Find “coordinate terms” (eg, girl/boy, dolls/cars) in the graph, or find • Similarity measure S so S(girl,boy) is high • What about off-the-shelf similarity measures: • PPR/RWR • Hitting time • Commute time •…? Personalized PR/RWR A query language: Q: { , } The graph Nodes Node type Edge label Edge weight graph walk parameters: edge weights Θ , walk length K and reset probability γ. M[x,y] = Prob. of reaching y from x in one step: the edge weight from x to y, over the total outgoing weight from x. `Personalized PageRank’: reset probability biased towards initial distribution. Returns a list of nodes (of type ) ranked by the graph walk probs. Approximate with power iteration, cut off after fixed number of iterations K. mention girls nsubj girls1 mention-1 like1 like mention nsubj-1 like2 mention boys2 -1 boys mention girls girls1 mention girls nsubj nsubj girls1 mention-1 like1 like partmod like1 mention nsubj-1 like2 mention-1 playing1 mention boys2 mention playing -1 boys mention … -1 boys mention girls nsubj girls1 mention-1 like1 Prep.with playing1 dolls1 mention-1 dolls Useful but not our goal here… Learning a better similarity metric Task T (query class) Query a + Rel. answers a Query b + Rel. answers b … Query q + Rel. answers q GRAPH WALK node rank 1 node rank 1 node rank 1 node rank 2 node rank 2 node rank 2 node rank 3 node rank 3 node rank 3 node rank 4 node rank 4 node rank 4 … … … node rank 10 node rank 10 node rank 10 node rank 11 node rank 11 node rank 11 node rank 12 node rank 12 node rank 12 … … … node rank 50 node rank 50 node rank 50 Seed words (“girl”, “boy”, …) Potential new instances of the target concept (“doll”, “child”, “toddler”, …) Learning methods Weight tuning – weights learned per edge type [Diligenti et al, 2005] Reranking – re-order the retrieved list using global features of all paths from source to destination [Minkov et al, 2006] FEATURES boys Edge label sequences nsubj.nsubj-inv Lexical unigrams nsubj partmod partmod-inv nsubj-inv … “like”, “playing” dolls nsubj partmod prep.in “like”, “playing” Learning methods: Path-Constrained Graph Walk PCW (summary): for each node x, learn P(xz : relevant(z) | history(Vq,x) ) History(Vq,x) = seq of edge labels leading from Vq to x, with all histories stored in a tree Vq “girls” nsubj x1 nsubj-inv boys partmod partmod-inv x2 prep.in x3 dolls nsubj-inv boys boys nsubj.nsubj-inv nsubj partmod partmod-inv nsubj-inv dolls nsubj partmod prep.in City and person name extraction words MUC MUC+AP City names: Person names: nodes edges 140K 82K 244K 3K (true) 2,440K 1,030K 3,550K 36K (auto) Vq = {sydney, stamford, greenville, los_angeles} Vq = {carter, dave_kingman, pedro_ramos, florio} – 10 (X4 seeds) queries for each task • Train queries q1-q5 / test queries q6-q10 – – – – NEs Extract nodes of type NE. GW: 6 steps, uniform/learned weights Reranking: top 200 nodes (using learned weights) Path trees: 20 correct / 20 incorrect; threshold 0.5 MUC precision City names Person names 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 11 21 31 41 51 61 71 81 91 Graph Walk 1 11 21 31 41 51 61 71 81 91 rank MUC precision City names Person names 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 11 21 31 41 51 61 71 81 91 conj-and, prep-in, nn, appos … Graph Walk Weight Tuning 1 11 21 31 41 51 61 71 subj, obj, poss, nn … 81 91 rank MUC precision City names Person names 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 11 21 31 41 51 61 71 81 91 Graph Walk Weight Tuning PCW 1 11 21 31 41 51 61 71 conj-and, prep-in, nn, appos … subj, obj, poss, nn … prep-in-inv conj-and nn-inv nn nsubj nsubj-inv appos nn-inv 81 91 rank MUC precision City names Person names 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 11 21 31 41 51 61 71 81 91 Graph Walk Weight Tuning PCW Reranking 1 11 21 31 41 51 61 71 81 conj-and, prep-in, nn, appos … subj, obj, poss, nn … Prep-in-inv conj-and nn-inv nn nsubj nsubj-inv appos nn-inv LEX.”based”, LEX.”downtown” LEX.”mr”, LEX.”president” 91 rank Vector-space models • Co-occurrence vectors (counts; window: +/- 2) • Dependency vectors [Padó & Lapata, Comp Ling 07] – A path value function: • Length-based value: • Relation based value: 1 / length(path) subj-5, obj-4, obl-3, gen-2, else-1 – Context selection function: • Minimal: verbal predicate-argument (length 1) • Medium: coordination, genitive construction, noun compounds (<=3) • Maximal: combinations of the above (<=4) – Similarity function: • Cosine • Lin Only score the top nodes retrieved with reranking (~1000 overall) GWs – Vector models MUC City names precision 1 Person names 1 0.9 PCW 0.9 0.8 Rerank 0.8 CO 0.7 0.7 DV 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 11 21 31 41 51 61 71 81 91 rank 1 11 21 31 41 51 61 The graph-based methods are best (syntactic + learning) 71 81 91 GWs – Vector models MUC + AP City names precision 1 Person names 1 0.9 Rerank 0.9 0.8 PCW 0.8 DV 0.7 0.7 CO 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 11 21 31 41 51 61 71 81 91 rank 1 11 21 31 41 51 61 71 The advantage of the graph based models diminishes with the amount of data. This is hard to evaluate at high ranks (manual labeling) 81 91 Outline • Open-domain IE as finding nodes “near” seeds on a graph – Approach 1: A “natural” graph derived from a smaller corpus + learned similarity “with” Einat Minkov (CMU Nokia) – Approach 2: A carefully-engineered graph derived from huge corpus “with” Richard Wang (CMU ?) Set Expansion for Any Language (SEAL) – (Wang & Cohen, ICDM 07) • Basic ideas – Dynamically build the graph using queries to the web – Constrain the graph to be as useful as possible • Be smart about queries • Be smart about “patterns”: use clever methods for finding meaningful structure on web pages 1. 2. 3. Canon Nikon Olympus System Architecture 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Pentax Sony Kodak Minolta Panasonic Casio Leica Fuji Samsung … • Fetcher: download web pages from the Web that contain all the seeds • Extractor: learn wrappers from web pages • Ranker: rank entities extracted by wrappers The Extractor • Learn wrappers from web documents and seeds on the fly – Utilize semi-structured documents – Wrappers defined at character level • Very fast • No tokenization required; thus language independent • Wrappers derived from doc d applied to d only – See ICDM 2007 paper for details .. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General Motors</a> and <a href=“finance/bentley”>Bentley</a> …. 1. Find prefix of each seed and put in reverse order: • ford1: /ecnanif”=fer a> yllareneG … • Ford2: >”drof/ /ecnanif”=fer a> yllareneG … • honda1: /ecnanif”=fer a> ot derapmoc … • Honda2: >”adnoh/ /ecnanif”=fer a> ot … 2. Organize these into a trie, tagging each node with a set of seeds: yllareneG … {f1,h1} /ecnanif”=fer a> >” {f1,f2,h1,h2} {f2,h2} ot derapmoc … {f1} {h1} drof/ /ecnanif”=fer a> yllareneG.. adnoh/ /ecnanif”=fer a> ot .. {h2} {f2} .. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General Motors</a> and <a href=“finance/bentley”>Bentley</a> …. 1. Find prefix of each seed and put in reverse order: 2. Organize these into a trie, tagging each node with a set of seeds. 3. A left context for a valid wrapper is a node tagged with one instance of each seed. yllareneG … {f1,h1} /ecnanif”=fer a> >” {f1,f2,h1,h2} {f2,h2} ot derapmoc … {f1} {h1} drof/ /ecnanif”=fer a> yllareneG.. adnoh/ /ecnanif”=fer a> ot .. {h2} {f2} .. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General Motors</a> and <a href=“finance/bentley”>Bentley</a> …. 1. Find prefix of each seed and put in reverse order: 2. Organize these into a trie, tagging each node with a set of seeds. 3. A left context for a valid wrapper is a node tagged with one instance of each seed. 4. The corresponding right context is the longest common suffix of the corresponding seed instances. {f1,h1} “> yllareneG … /ecnanif”=fer a> >” {f1,f2,h1,h2} ot derapmoc … {f1} ”>Ford</a> sales … {h1} ”>Honda</a> while … drof/ /ecnanif”=fer a> yllareneG.. {f2,h2} </a> adnoh/ /ecnanif”=fer a> ot .. {h2} {f2} <li class="ford"><a href="http://www.curryauto.com/"> <img src="/common/logos/ford/logo-horiz-rgb-lg-dkbg.gif" alt="3"></a> <ul><li class="last"><a href="http://www.curryauto.com/"> I am noise <span class="dName">Curry Ford</span>...</li></ul> </li> <li class="honda"><a href="http://www.curryauto.com/"> <img src="/common/logos/honda/logo-horiz-rgb-lg-dkbg.gif" alt="4"></a> <ul><li><a href="http://www.curryhonda-ga.com/"> <span class="dName">Curry Honda Atlanta</span>...</li> Me too! <li><a href="http://www.curryhondamass.com/"> <span class="dName">Curry Honda</span>...</li> <li class="last"><a href="http://www.curryhondany.com/"> <span class="dName">Curry Honda Yorktown</span>...</li></ul> </li> <li class="acura"><a href="http://www.curryauto.com/"> <img src="/curryautogroup/images/logo-horiz-rgb-lg-dkbg.gif" alt="5"></a> <ul><li class="last"><a href="http://www.curryacura.com/"> <span class="dName">Curry Acura</span>...</li></ul> </li> <li class="nissan"><a href="http://www.curryauto.com/"> <img src="/common/logos/nissan/logo-horiz-rgb-lg-dkbg.gif" alt="6"></a> <ul><li class="last"><a href="http://www.geisauto.com/"> <span class="dName">Curry Nissan</span>...</li></ul> </li> <li class="toyota"><a href="http://www.curryauto.com/"> <img src="/common/logos/toyota/logo-horiz-rgb-lg-dkbg.gif" alt="7"></a> <ul><li class="last"><a href="http://www.geisauto.com/toyota/"> <span class="dName">Curry Toyota</span>...</li></ul> </li> The Ranker • Rank candidate entity mentions based on “similarity” to seeds – • Noisy mentions should be ranked lower Random Walk with Restart (GW) • • …as before… What’s the graph? Building a Graph “ford”, “nissan”, “toyota” Wrapper #2 find northpointcars.com extract curryauto.com “chevrolet” 22.5% “honda” 26.1% Wrapper #3 derive Wrapper #1 “acura” 34.6% “volvo chicago” 8.4% Wrapper #4 “bmw pittsburgh” 8.4% • A graph consists of a fixed set of… – Node Types: {seeds, document, wrapper, mention} – Labeled Directed Edges: {find, derive, extract} • Each edge asserts that a binary relation r holds • Each edge has an inverse relation r-1 (graph is cyclic) – Intuition: good extractions are extracted by many good wrappers, and good wrappers extract many good extractions, Evaluation Datasets: closed sets Evaluation Method • Mean Average Precision – – – – Commonly used for evaluating ranked lists in IR Contains recall and precision-oriented aspects Sensitive to the entire ranking Mean of average precisions for each ranked list Prec(r) = precision at rank r NewEntity (r ) 1 if (a) and (b) are true otherwise 0 (a) Extracted mention at r matches any true mention where L = ranked list of extracted mentions, r = rank • Evaluation Procedure 1. (per dataset) Randomly select three true entities and use their first listed mentions as seeds 2. Expand the three seeds obtained from step 1 3. Repeat steps 1 and 2 five times 4. Compute MAP for the five ranked lists (b) There exist no other extracted mention at rank less than r that is of the same entity as the one at r # True Entities = total number of true entities in this dataset Experimental Results: 3 seeds Overall MAP vs. Various Methods 100% MAP (%) 95% 80% 90% 60% 85% 40% 80% 20% 75% 93.13% 94.03% 82.39% 94.18% 93.13% 87.61% 82.39% 43.76% 14.59% 70% 0% E1+EF+100 G.Sets E2+GW+100 E2+EF+100 G.Sets (Eng) E2+GW+200 E2+GW+100 E1+EF+100 E2+GW+300 Methods Vary: [Extractor] + [Ranker] + [Top N URLs] Extractor: • E1: Baseline Extractor (longest common context for all seed occurrences ) • E2: Smarter Extractor (longest common context for 1 occurrence of each seed) Ranker: { EF: Baseline (Most Frequent), GW: Graph Walk } N URLs: { 100, 200, 300 } Side by side comparisons Telukdar, Brants, Liberman, Pereira, CoNLL 06 Side by side comparisons EachMovie vs WWW Ghahramani & Heller, NIPS 2005 NIPS vs WWW A limitation of the original SEAL Preliminary Study on Seed Sizes 85% Mean Average Precision 84% 83% 82% 81% 80% 79% 78% RW PR BS WL 77% 76% 75% 2 3 4 # Seeds (Seed Size) 5 6 Proposed Solution: Iterative SEAL (iSEAL) (Wang & Cohen, ICDM 2008) • Makes several calls to SEAL, each call… – Expands a couple of seeds – Aggregates statistics • Evaluate iSEAL using… – Two iterative processes • Supervised vs. Unsupervised (Bootstrapping) – Two seeding strategies • Fixed Seed Size vs. Increasing Seed Size – Five ranking methods ISeal (Fixed Seed Size, Supervised) Initial Seeds • Finally rank nodes by proximity to seeds in the full graph • Refinement (ISS): Increase size of seed set for each expansion over time: 2,3,4,4,… • Variant (Bootstrap): use highconfidence extractions when seeds run out Ranking Methods Random Graph Walk with Restart – H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its application. In ICDM, 2006. PageRank – L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. 1998. Bayesian Sets (over flattened graph) – Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2005. Wrapper Length – Weights each item based on the length of common contextual string of that item and the seeds Wrapper Frequency – Weights each item based on the number of wrappers that extract the item Fixed Seed Size (Supervised) 98% Mean Average Precision 97% 96% 95% 94% 93% 92% RW 91% PR BS 90% WL WF 89% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Fixed Seed Size (Bootstrap) 92% Mean Average Precision 91% 90% 89% 88% RW PR BS WL 87% WF 86% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Increasing Seed Size (Supervised) 97% Mean Average Precision 96% 95% 94% 93% RW 92% PR BS 91% WL WF 90% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Increasing Seed Size (Bootstrapping) Mean Average Precision 94% 93% 92% 91% RW PR 90% BS WL WF 89% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Fixed Seed Size (Supervised) 98% Increasing Seed Size (Supervised) 97% 96% 96% 95% 94% 93% 92% RW PR 91% 90% 89% 1 2 95% 94% 93% RW 92% PR Little differenceBSbetween 91% ranking methods WL for supervisedWFcase (all seeds correct); 90% 2 3 4 5 6 7 8 3 4 5 6 7 differences 8 9 10 large when1 bootstrapping # Iterations (Cumulative Expansions) # Iterations (Cumulative Expansions) BS WL WF 9 10 Increasingmakes Seed Size (Bootstrapping) Increasing seed size {2,3,4,4,…} all 94% ranking methods improve steadily in 93% case bootstrapping Fixed Seed Size (Bootstrap) 92% Mean Average Precision 91% Mean Average Precision Mean Average Precision Mean Average Precision 97% 90% 89% 88% RW PR BS WL 87% 92% 91% RW PR 90% BS WL WF 86% WF 89% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Fixed Seed Size (Supervised) Increasing Seed Size (Supervised) 97% 97% 96% 96% 95% 94% 93% 92% RW 91% PR Mean Average Precision Mean Average Precision 98% 94% 93% WL PR BS WL WF 90% WF 89% RW 92% 91% BS 90% 95% 1 2 3 4 5 6 7 8 9 10 # Iterations (Cumulative Expansions) 1 2 3 4 5 6 7 8 9 10 # Iterations (Cumulative Expansions) Increasing Seed Size (Bootstrapping) Mean Average Precision 94% Fixed Seed Size (Bootstrap) 92% Mean Average Precision 91% 90% 93% 92% 91% RW PR 90% BS WL 89% WF 89% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 88% RW PR BS WL 87% WF 86% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 9 10 Current work • Start with name of concept (e.g., “NFL teams”) • Look for (language-dependent) patterns: – “… for successful NFL teams (e.g., Pittsburgh Steelers, New York Giants, …)” • Take most frequent answers as seeds • Run bootstrapping iSEAL with seed sizes 2,3,4,4…. Datasets with concept names Experimental results Direct use of text patterns Comparison to Kozareva, Riloff & Hovy (which uses concept name plus a single instance as seed)…no seed used. Comparison to Pasca (using web search queries, CIKM 07) Comparison to WordNet + Nk • Snow et al: series of experiments learning hyper/hyponyms – – – – Bootstrap from Wordnet examples Use dependency-parsed free text E.g., added 30k new instances with fairly high precision Many are concepts + named-entity instances: • Experiments with ASIA on concepts from Wordnet shows a fairly common problem: – E.g., “movies” gives as “instances”: “comedy”, “action/adventure”, “family”, “drama”, …. – I.e., ASIA finds a lower level in a hierarchy, maybe not the one you want Comparison to WordNet + Nk • Filter: a simulated sanity check: – Consider only concepts expanded in Wordnet + 30k that seem to have named-entities as instances and have at least instances – Run ASIA on each concept – Discard result if less than 50% of the Wordnet instances are in ASIA’s output Two More Systems to Compare to • Van Durme & Pasca, 2008 – Requires an English part-of-speech tagger. – Analyzed 100 million cached Web documents in English (for many classes). • Talukdar et al, 2008 – Requires 5 seed instances as input (for each class). – Utilizes output from Van Durme’s system and 154 million tables from the WebTables database (for many classes). • ASIA – – – – – Does not require any part-of-speech tagger (nearly language-independent). Supports multiple languages such as English, Chinese, and Japanese. Analyzes around 200~400 Web documents (for each class). Requires only the class name as input. Given a class name, extraction usually finishes within a minute (including network latency of fetching web pages). Comparisons of Instance Extraction Performance 100 90 Precision @ 100 80 70 60 50 Talukdar et al., 2008 Van Durme & Pasca, 2008 40 ASIA 30 Book Publishers • Federal Agencies NFL Players Scientific Journals Mammals Precisions of Talukdar and Van Durme’s systems were obtained from Figure 2 in Talukdar et al, 2008. Top 10 Instances from ASIA Book Publishers mcgraw hill prentice hall cambridge university press random house mit press academic press crc press oxford university press columbia university press harvard university press Federal Agencies usda dot hud epa nasa ssa nsf dol va sec NFL Players Scientific Journals Mammals tom brady ladainian tomlinson peyton manning brian westbrook ben roethlisberger donovan mcnabb reggie bush brett favre terrell owens anquan boldin scientific american science nature new scientist cell journal of biological chemistry biophysical journal nature medicine evolution new england journal of medicine bats dogs rodents cats bears rabbits elephants primates whales marsupials Summary/Conclusions • Open-domain IE as finding nodes “near” seeds on a graph NIPS “…at NIPS, AISTATS, KDD and other learning conferences…” AISTATS SNOWBIRD For skiiers, NIPS, SNOWBIRD,… and…” SIGIR KDD … “on PC of KDD, SIGIR, … and…” “… AISTATS,KDD,…” shorter paths ~ earlier iterations many paths ~ additional evidence Summary/Conclusions • Open-domain IE as finding nodes “near” seeds on a graph, approach 1: – – – – Minkov & Cohen, EMNLP 08: Graph ~ dependency-parsed corpus Off-the-shelf distance metrics not great With a little bit of learning: • Results significantly better than state-of-the-art on small corpora (e.g. a personal email corpus) • Results competitive on 2M+ word corpora Summary/Conclusions • Open-domain IE as finding nodes “near” seeds on a graph, approach 2: – Wang & Cohen, ICDM 07, 08: – Graph built on-the-fly with web queries • A good graph matters! – Off-the-shelf distance metrics work • Differences are minimal for clean seeds • Modest improvements from learning w/ clean seeds – E.g., reranking (not described here) • Bigger differences in similarity measures with noisy seeds Thanks to • DARPA PAL program Sponsored links: – Minkov, Cohen, Wang http://boowa.com (Richard’s demo) • Yahoo! Research Labs – Minkov • Google Research Grant program – Wang