Knowledge Harvesting from Text and Web Sources Part 3: Knowledge Linking Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/ Quiz Time How many days do you need to visit all Shangri-La places on this planet? Source: geonames.org Answer: 365 3-2 Quiz Time How many days do you need to visit all Shangri-La places on this planet? 3-3 3-3 Linkied Data: RDF Triples on the Web 30 Bio. triples 500 Mio. links http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png 3-4 Linked RDF Triples on the Web yago/wordnet: Artist109812338 yago/wordnet:Actor109765278 yago/wikicategory:ItalianComposer imdb.com/name/nm0910607/ dbpedia.org/resource/Ennio_Morricone imdb.com/title/tt0361748/ dbpedia.org/resource/Rome rdf.freebase.com/ns/en.rome data.nytimes.com/51688803696189142301 geonames.org/5134301/city_of_rome N 43° 12' 46'' W 75° 27' 20'' 3-5 Linked RDF Triples on the Web yago/wordnet: Artist109812338 yago/wordnet:Actor109765278 yago/wikicategory:ItalianComposer imdb.com/name/nm0910607/ dbpedia.org/resource/Ennio_Morricone imdb.com/title/tt0361748/ dbpedia.org/resource/Rome rdf.freebase.com/ns/en.rome_ny ? ? data.nytimes.com/51688803696189142301 Referential data quality? Hand-crafted sameAs links? generated sameAs links? ? geonames.org/5134301/city_of_rome N 43° 12' 46'' W 75° 27' 20'' 3-6 RDF Entities on the Web http://sig.ma 3-7 RDF Entities on the Web http://sig.ma 3-8 Entity-Name Ambiguity http://sameas.org 3-9 Entities in HTML http://sindice.com 3-10 http://schema.org/ Entity Markup in HTML: Towards Standardized Microformats 3-11 http://schema.org/ Entity Markup in HTML: Towards Standardized Microformats 3-12 Web Page in Standard HTML http://schema.org/ Jane Doe <img src="janedoe.jpg" /> Professor 20341 Whitworth Institute 405 Whitworth Seattle WA 98052 (425) 123-4567 <a href="mailto:jane-doe@xyz.edu">jane-doe@illinois.edu</a> Jane's home page: <a href="http://www.janedoe.com">janedoe.com</a> Graduate students: <a href="http://www.xyz.edu/students/alicejones.html">Alice Jones</a> <a href="http://www.xyz.edu/students/bobsmith.html">Bob Smith</a> 3-13 Web Page in HTML with Microdata <div itemscope itemtype="http://schema.org/Person"> <span itemprop="name">Jane Doe</span> <img src="janedoe.jpg" itemprop="image" /> http://schema.org/ <span itemprop="jobTitle">Professor</span> <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress"> <span itemprop="streetAddress"> 20341 Whitworth Institute 405 N. Whitworth </span> <span itemprop="addressLocality">Seattle</span>, <span itemprop="addressRegion">WA</span> <span itemprop="postalCode">98052</span> </div> <span itemprop="telephone">(425) 123-4567</span> <a href="mailto:jane-doe@xyz.edu" itemprop="email"> jane-doe@xyz.edu</a> Jane's home page: <a href="http://www.janedoe.com" itemprop="url">janedoe.com</a> Graduate students: <a href="http://www.xyz.edu/students/alicejones.html" itemprop="colleague"> Alice Jones</a> <a href="http://www.xyz.edu/students/bobsmith.html" itemprop="colleague"> Bob Smith</a> </div> 3-14 Web-of-Data vs. Web-of-Contents Critical for knowledge linkage: entity name ambiguity more structured data combined with text boosted by knowledge harvesting methods 3-15 Embedding RDFa in Web Contents <html … May 2, 2011 May 2, 2011 <div typeof=event:music> Maestro Morricone will perform on the stage of the Smetana Hall to conduct the Czech National Symphony Orchestra and Choir. The concert will feature both Classical compositions and soundtracks such as the Ecstasy of Gold. In programme two concerts for July 14th and 15th. <span id="Maestro_Morricone"> Maestro Morricone <a rel="sameAs" resource="dbpedia…/Ennio_Morricone "/> </span> … <span property = "event:location" > Smetana Hall </span> … <span property="rdf:type" resource="yago:performance"> The concert </span> will feature … RDF data and Web contents need to be interconnected <span property="event:date" RDFa & microformatscontent="14-07-2011"></span> provide the mechanism July 1 Need ways of creating more embedded RDF triples! </div> 3-16 Outline Motivation Entity-Name Disambiguation Mapping Questions into Queries Entity Linkage Wrap-up ... 3-17 Named-Entity Disambiguation Harry fought with you know who. He defeats the dark lord. Dirty Harry Harry Potter Prince Harry of England The Who (band) Lord Voldemort Three NLP tasks: 1) named-entity detection: segment & label by HMM or CRF (e.g. Stanford NER tagger) 2) co-reference resolution: link to preceding NP (trained classifier over linguistic features) 3) named-entity disambiguation: map each mention (name) to canonical entity (entry in KB) 3-18 Named Entity Disambiguation Eli (bible) Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Mentions (surface names) Eli Wallach ? Benny EcstasyGoodman (drug) EcstasyAndersson of Gold Benny Star Wars Trilogy KB Sergio means Sergio_Leone Sergio means Serge_Gainsbourg Ennio means Ennio_Antonelli Ennio means Ennio_Morricone Eli means Eli_(bible) Eli means ExtremeLightInfrastructure Eli means Eli_Wallach Ecstasy means Ecstasy_(drug) Ecstasy means Ecstasy_of_Gold trilogy means Star_Wars_Trilogy trilogy means Lord_of_the_Rings trilogy means Dollars_Trilogy Lord of the Rings Dollars Trilogy Entities (meanings) 3-19 Mention-Entity Graph weighted undirected graph with two types of nodes bag-of-words or Sergio talked to language model: Eli (bible) words, bigrams, Ennio about Eli Wallach phrases Eli‘s role in the Ecstasy (drug) Ecstasy scene. This sequence on Ecstasy of Gold the graveyard Star Wars was a highlight in Sergio‘s trilogy Lord of the Rings of western films. Dollars Trilogy Popularity (m,e): Similarity (m,e): • freq(e|m) • length(e) • #links(e) • cos/Dice/KL (context(m), context(e)) KB+Stats 3-20 Mention-Entity Graph weighted undirected graph with two types of nodes Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Eli (bible) Eli Wallach joint mapping Popularity (m,e): Similarity (m,e): • freq(e|m) • length(e) • #links(e) • cos/Dice/KL (context(m), context(e)) Ecstasy (drug) Ecstasy of Gold Star Wars Lord of the Rings Dollars Trilogy KB+Stats 3-21 Mention-Entity Graph weighted undirected graph with two types of nodes Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Eli (bible) Eli Wallach Ecstasy(drug) Ecstasy of Gold Star Wars Lord of the Rings Popularity (m,e): Similarity (m,e): • freq(m,e|m) • length(e) • #links(e) • cos/Dice/KL (context(m), context(e)) Dollars Trilogy KB+Stats Coherence (e,e‘): • dist(types) • overlap(links) • overlap 22 / 20 3-22 (anchor words) Mention-Entity Graph weighted undirected graph with two types of nodes American Jews Eli (bible) film actors Sergio talked to artists Ennio about Academy Award winners Eli Wallach Eli‘s role in the Metallica songs Ecstasy (drug) Ecstasy scene. Ennio Morricone songs This sequence on artifacts Ecstasy of Gold soundtrack music the graveyard Star Wars was a highlight in spaghetti westerns Sergio‘s trilogy Lord of the Rings film trilogies movies of western films. artifacts Dollars Trilogy Popularity (m,e): Similarity (m,e): • freq(m,e|m) • length(e) • #links(e) • cos/Dice/KL (context(m), context(e)) KB+Stats Coherence (e,e‘): • dist(types) • overlap(links) • overlap 23 / 20 3-23 (anchor words) Mention-Entity Graph weighted undirected graph with two types of nodes http://.../wiki/Dollars_Trilogy Eli (bible) http://.../wiki/The_Good,_the_Bad, _ Sergio talked to http://.../wiki/Clint_Eastwood Ennio about http://.../wiki/Honorary_Academy_A Eli Wallach Eli‘s role in the Ecstasy (drug) http://.../wiki/The_Good,_the_Bad,_t Ecstasy scene. http://.../wiki/Metallica This sequence on Ecstasy of Gold http://.../wiki/Bellagio_(casino) http://.../wiki/Ennio_Morricone the graveyard Star Wars was a highlight in http://.../wiki/Sergio_Leone Sergio‘s trilogy Lord of the Rings http://.../wiki/The_Good,_the_Bad,_ http://.../wiki/For_a_Few_Dollars_M of western films. http://.../wiki/Ennio_Morricone Dollars Trilogy Popularity (m,e): Similarity (m,e): • freq(m,e|m) • length(e) • #links(e) • cos/Dice/KL (context(m), context(e)) KB+Stats Coherence (e,e‘): • dist(types) • overlap(links) • overlap 24 / 20 3-24 (anchor words) Mention-Entity Graph weighted undirected graph with two types of nodes The Magnificent Seven Eli (bible) The Good, the Bad, and the Ugly Sergio talked to Clint Eastwood Ennio about University of Texas at Austin Eli Wallach Eli‘s role in the Metallica on Morricone tribute Ecstasy (drug) Ecstasy scene. Bellagio water fountain show This sequence on Yo-Yo Ma Ecstasy of Gold Ennio Morricone composition the graveyard Star Wars was a highlight in For a Few Dollars More Sergio‘s trilogy Lord of the Rings The Good, the Bad, and the Ugly Man with No Name trilogy of western films. soundtrack by Ennio Morricone Dollars Trilogy Popularity (m,e): Similarity (m,e): • freq(m,e|m) • length(e) • #links(e) • cos/Dice/KL (context(m), context(e)) KB+Stats Coherence (e,e‘): • dist(types) • overlap(links) • overlap 25 / 20 3-25 (anchor words) Different Approaches Combine Popularity, Similarity, and Coherence Features (Cucerzan: EMNLP‘07, Milne/Witten: CIKM‘08): • for sim (context(m), context(e)): consider surrounding mentions and their candidate entities • use their types, links, anchors as features of context(m) • set m-e edge weights accordingly • use greedy methods for solution Collective Learning with Prob. Factor Graphs (Chakrabarti et al.: KDD‘09): • model P[m|e] by similarity and P[e1|e2] by coherence • consider likelihood of P[m1 … mk | e1 … ek] • factorize by all m-e pairs and e1-e2 pairs • use hill-climbing, LP, etc. for solution 3-26 Joint Mapping 50 30 20 30 50 10 10 90 100 30 90 100 5 20 80 30 90 • Build mention-entity graph or joint-inference factor graph from knowledge and statistics in KB • Compute high-likelihood mapping (ML or MAP) or dense subgraph such that: each m is connected to exactly one e (or at most one e) 3-27 Mention-Entity Popularity Weights [Milne/Witten 2008, Spitkovsky/Chang 2012] Need dictionary with entities‘ names: • full names: Arnold Alois Schwarzenegger, Los Angeles, Microsoft Corporation • short names: Arnold, Arnie, Mr. Schwarzenegger, New York, Microsoft, … • nicknames & aliases: Terminator, City of Angels, Evil Empire, … • acronyms: LA, UCLA, MS, MSFT • role names: the Austrian action hero, Californian governor, the CEO of MS, … … plus gender info (useful for resolving pronouns in context): Bill and Melinda met at MS. They fell in love and he kissed her. Collect hyperlink anchor-text / link-target pairs from • Wikipedia redirects • Wikipedia links between articles • Interwiki links between Wikipedia editions • Web links pointing to Wikipedia articles … Build statistics to estimate P[entity | name] 3-28 Mention-Entity Similarity Edges Precompute characteristic keyphrases q for each entity e: anchor texts or noun phrases in e page with high PMI: freq(q, e) „Metallica tribute to Ennio Morricone“ weight (q, e) log freq(q) freq(e) Match keyphrase q of candidate e in context of mention m 1 # matching words wcover(q)weight ( w | e) score (q | e) ~ length of cover(q) weight(w | e) wq Extent of partial matches Weight of matched words The Ecstasy piece was covered by Metallica on the Morricone tribute album. Compute overall similarity of context(m) and candidate e score (e | m) ~ score ( q ) dist ( cover(q) , m ) qkeyphrases(e) in context( m) 3-29 Entity-Entity Coherence Edges Precompute overlap of incoming links for entities e1 and e2 log max( in (e1, e2)) log( in (e1) in (e2)) mw - coh(e1, e2) ~ 1 log | E | log min( in (e1), in (e2)) Alternatively compute overlap of anchor texts for e1 and e2 ngrams(e1) ngrams(e2) ngram - coh(e1, e2) ~ ngrams(e1) ngrams(e2) or overlap of keyphrases, or similarity of bag-of-words, or … Optionally combine with type distance of e1 and e2 (e.g., Jaccard index for type instances) For special types of e1 and e2 (locations, people, etc.) use spatial or temporal distance 3-30 Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11] 50 30 20 30 100 30 90 100 5 140 180 50 50 10 10 90 470 20 80 145 30 90 230 • Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) • Greedy approximation: iteratively remove weakest entity and its edges • Keep alternative solutions, then use local/randomized search 3-31 Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11] 50 30 100 30 90 100 5 140 30 180 170 50 10 90 50 470 80 145 30 90 230 210 • Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) • Greedy approximation: iteratively remove weakest entity and its edges • Keep alternative solutions, then use local/randomized search 3-32 Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11] 140 30 170 120 90 100 30 90 100 5 460 80 145 30 90 210 • Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) • Greedy approximation: iteratively remove weakest entity and its edges • Keep alternative solutions, then use local/randomized search 3-33 Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11] 30 120 90 100 380 90 100 145 90 210 • Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) • Greedy approximation: iteratively remove weakest entity and its edges • Keep alternative solutions, then use local/randomized search 3-34 Alternative: Random Walks 0.5 50 0.3 30 0.2 20 0.23 30 0.83 50 0.1 10 0.17 10 0.7 90 0.77 100 0.25 30 0.75 90 0.96 100 0.04 5 0.2 20 0.4 80 0.15 30 0.75 90 • for each mention run random walks with restart (like personalized PR with jumps to start mention(s)) • rank candidate entities by stationary visiting probability • very efficient, decent accuracy 3-35 AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/ 3-36 AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/ 3-37 AIDA: Very Difficult Example http://www.mpi-inf.mpg.de/yago-naga/aida/ 3-38 AIDA: Very Difficult Example http://www.mpi-inf.mpg.de/yago-naga/aida/ 3-39 AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/ 3-40 AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/ 3-41 AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/ 3-42 AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/ 3-43 Some NED Online Tools for J. Hoffart et al.: EMNLP 2011, VLDB 2011 https://d5gate.ag5.mpi-sb.mpg.de/webaida/ P. Ferragina, U. Scaella: CIKM 2010 http://tagme.di.unipi.it/ R. Isele, C. Bizer: VLDB 2012 http://spotlight.dbpedia.org/demo/index.html Reuters Open Calais http://viewer.opencalais.com/ S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti: KDD 2009 http://www.cse.iitb.ac.in/soumen/doc/CSAW/ D. Milne, I. Witten: CIKM 2008 http://wikipedia-miner.cms.waikato.ac.nz/demos/annotate/ perhaps more some use Stanford NER tagger for detecting mentions http://nlp.stanford.edu/software/CRF-NER.shtml 3-44 NED: Experimental Evaluation Benchmark: • Extended CoNLL 2003 dataset: 1400 newswire articles • originally annotated with mention markup (NER), now with NED mappings to Yago and Freebase • difficult texts: … Australia beats India … … White House talks to Kreml … … EDS made a contract with … Australian_Cricket_Team President_of_the_USA HP_Enterprise_Services Results: Best: AIDA method with prior+sim+coh + robustness test 82% precision @100% recall, 87% mean average precision Comparison to other methods, see paper J. Hoffart et al.: Robust Disambiguation of Named Entities in Text, EMNLP 2011 http://www.mpi-inf.mpg.de/yago-naga/aida/ 3-45 Ongoing Research & Remaining Challenges • More efficient graph algorithms (multicore, etc.) • Allow mentions of unknown entities, mapped to null • Leverage deep-parsing structures, leverage semantic types Example: Page played Kashmir on his Gibson subj obj mod • Short and difficult texts: • tweets, headlines, etc. • fictional texts: novels, song lyrics, etc. • incoherent texts • Structured Web data: tables and lists • Disambiguation beyond entity names: • coreferences: pronouns, paraphrases, etc. • common nouns, verbal phrases (general WSD) 3-46 General Word Sense Disambiguation {songwriter, composer} {cover, perform} Which song writers {cover, report, treat} {cover, help out} covered ballads written by the Stones ? 3-47 Handling Out-of-Wikipedia Entities wikipedia.org/Good_Luck_Cave Cave composed haunting songs like Hallelujah, O Children, and the Weeping Song. wikipedia.org/Nick_Cave wikipedia/Hallelujah_Chorus wikipedia/Hallelujah_(L_Cohen) last.fm/Nick_Cave/Hallelujah wikipedia/Children_(2011 film) last.fm/Nick_Cave/O_Children wikipedia.org/Weeping_(song) last.fm/Nick_Cave/Weeping_Song 3-48 Handling Out-of-Wikipedia Entities Gunung Mulu National Park Sarawak Chamber largest underground chamber wikipedia.org/Good_Luck_Cave Cave composed haunting songs like Hallelujah, O Children, and the Weeping Song. wikipedia.org/Nick_Cave wikipedia/Hallelujah_Chorus wikipedia/Hallelujah_(L_Cohen) Bad Seeds No More Shall We Part Murder Songs Messiah oratorio George Frideric Handel Leonard Cohen Rufus Wainwright Shrek and Fiona last.fm/Nick_Cave/Hallelujah eerie violin Bad Seeds No More Shall We Part wikipedia/Children_(2011 film) South Korean film last.fm/Nick_Cave/O_Children Nick Cave & Bad Seeds Harry Potter 7 movie haunting choir wikipedia.org/Weeping_(song) last.fm/Nick_Cave/Weeping_Song Dan Heymann apartheid system Nick Cave Murder Songs P.J. Harvey Nick and Blixa duet 3-49 Handling Out-of-Wikipedia Entities [J. Hoffart et al.: CIKM‘12] • Characterize all entities (and mentions) by sets of keyphrases • Entity coherence then becomes: keyphrases overlap, no need for href link data • For each mention add a „self“ candidate: out-of-KB entity with keyphrases computed by Web search PO(p,q) = phrases p,q KORE (e,f) = entities e,f wpq wpq min(p(w), q(w)) max(p(w), q(w)) pe,qf PO(p,q)2 min(e(p), f(q)) pe e(p) + qf f(q) with word weights with phrase weights Efficient comparison of two keyphrase-sets two-stage hashing, using min-hash sketches and LSH 3-50 Variants of NED at Web Scale Tools can map short text onto entities in a few seconds • How to run this on big batch of 1 Mio. input texts? partition inputs across distributed machines, organize dictionary appropriately, … exploit cross-document contexts • How to deal with inputs from different time epochs? consider time-dependent contexts, map to entities of proper epoch (e.g. harvested from Wikipedia history) • How to handle Web-scale inputs (100 Mio. pages) restricted to a set of interesting entities? (e.g. tracking politicians and companies) 3-51 Outline Motivation Entity-Name Disambiguation Mapping Questions into Queries Entity Linkage Wrap-up ... 3-52 Word Sense Disambiguation for Question-to-Query Translation QA system DEANNA [M. Yahya et al.: EMNLP‘12] Question Translation with WSD SPARQL KB Answer “Who played in Casablanca and was married to a writer born in Rome?” Select ?p Where { ?p type person. ?p actedIn Casablanca_(film). ?p isMarriedTo ?w. ?w type writer . ?w bornIn Rome . } ?p ?w www.mpi-inf.mpg.de/ yago-naga/deanna/3-53 DEANNA in a Nutshell Phrase detection Question Phrase mapping DEANNA Dependency detection SPARQL Joint Disambig. Query Generation KB Answers 3-54 DEANNA in a Nutshell Phrase detection Question Phrase mapping DEANNA Dependency detection SPARQL Joint Disambig. Query Generation KB Answers 3-55 DEANNA in a Nutshell Phrase detection Question Phrase mapping DEANNA Dependency detection SPARQL Joint Disambig. Query Generation KB Answers 3-56 DEANNA in a Nutshell Phrase detection Question Phrase mapping DEANNA Dependency detection SPARQL Joint Disambig. Query Generation KB Answers 3-57 DEANNA Components 1 Phrase detection Question 2 Phrase mapping DEANNA 3 Dependency detection SPARQL 4 Joint Disambig. Query Generation KB Answers 3-58 Phrase Detection Concepts: entities & classes: dictionary-based Concept Phrase Casablanca Casablanca Casablanca Casablanca, Morocco played Casablanca_(film) Casablanca the film played in Casablanca_(film) Casablanca Who … a writer Casablanca married married to was married to … Relations: mainly use Reverb [Fader et al: EMNLP’11]: V | VP | VW*P … was/VBD married/VBN to/TO a/DT… 3-59 DEANNA Components Question 1 Phrase detection 2 Phrase mapping DEANNA 3 Dependency detection SPARQL 4 Joint Disambig. KB Query Generation 3-60 Phrase Mapping Concepts: entities & classes: dictionary-based Concept Phrase Casablanca Casablanca Casablanca Casablanca, Morocco e:Casablanca_(film) Casablanca_(film) Casablanca the film e:Played_(film) Casablanca_(film) Casablanca Played_(film) Played Casablanca e:White_House played e:Casablanca played in r:actedIn r:hasMusicalRole Relations: Relation DictionaryPhrase -based actedIn acted in actedIn played in hasMusicalRole plays hasMusicalRole mastered 3-61 DEANNA Components Question 1 Phrase detection 2 Phrase mapping DEANNA 3 Dependency detection SPARQL 4 Joint Disambig. KB Query Generation 3-62 Dependency Detection e:Rome Rome e:Sydne_Rome born e:Born_(film) Look for specific patterns in dependency graph [de Marneffe et al. LREC’06] e:Max_Born q1 was born a writer r:bornOnDate r:bornInPlace writer partmod c:writer born prep in pobj Rome 3-63 Disambiguation Graph Phrase-nodes Semantic nodes e:Rome Rome e:Sydne_Rome q-nodes born e:Born_(film) q1 was born e:Max_Born a writer q2 r:bornInPlace c:writer Casablanca e:White_House played e:Casablanca played in Who married q3 r:bornOnDate married to was married to e:Casablanca_(film) e:Played_(film) r:actedIn r:hasMusicalRole c:person e:Married_(series) c: married_person r:isMarriedTo 3-64 DEANNA Components Question 1 Phrase detection 2 Phrase mapping DEANNA 3 Dependency detection SPARQL 4 Joint Disambig. KB Query Generation 3-65 Joint Disambiguation - ILP • ILP: Integer Linear Programming • maximize α Σi,j wi,jYi,j + β Σk,l vk,l Zk,l + … • Subject to: No token in multiple phrases, Triples observe type constraints, … 3-66 Joint Disambiguation – Objective Phrase nodes Rome q-nodes q1 born was born a writer Similarity Edges Semantic nodes e:Rome e:Sydne_Rome Coherence Edges e:Born_(film) e:Max_Born r:bornOnDate r:bornInPlace c:writer α Σi,j wi,jYi,j + β Σk,l vk,l Zk,l Prior 3-67 Joint Disambiguation – Objective Phrase nodes Rome q-nodes q1 born was born a writer Similarity Edges Semantic nodes e:Rome e:Sydne_Rome Coherence Edges e:Born_(film) e:Max_Born r:bornOnDate r:bornInPlace c:writer α Σi,j wi,jYi,j + β Σk,l vk,l Zk,l Coherence 3-68 Joint Disambiguation – Constraints A phrase node can be assigned to only one semantic node: Semantic nodes Phrase nodes a Casablanca Ya,1 Ya,2 Ya,3 e:White_House e:Casablanca e:Casablanca_(film) 1 2 3 α Σi,j wi,jYi,j + β Σk,l vk,l Zk,l 3-69 Joint Disambiguation – Constraints Classes translate to type-constrained variables Every semantic triple should have a class to join & project! person actedIn Casablanca_(film) ▼ ?x type person . ?x actedIn Casablanca_(film) Phrase nodes Semantic nodes e:Rome Rome q-nodes e:Sydne_Rome r:bornOnDate q1 was born r:bornInPlace e:The_Writer (magazine) a writer c:writer 3-70 DEANNA Components Question 1 Phrase detection 2 Phrase mapping DEANNA 3 Dependency detection SPARQL 4 Joint Disambig. KB Query Generation 3-71 Structured Query Generation q1 q2 q3 SELECT ?w ?w ?p ?p ?p Rome e:Rome was born r:bornIn a writer c:writer Casablanca e:Casablanca_(film) played in r:actedIn Who c:person was married to r:isMarriedTo ?p WHERE { type writer . bornIn Rome . type person. actedIn Casablanca_(film). isMarriedTo ?w } 3-72 Outline Motivation Entity-Name Disambiguation Mapping Questions into Queries Entity Linkage Wrap-up ... 3-73 Entity Linkage for the Web of Data 30 Bio. triples 500 Mio. links yago/wordnet:Actor109765278 yago/wordnet: Artist109812338 yago/wikicategory:ItalianComposer imdb.com/name/nm0910607/ dbpedia.org/resource/Ennio_Morricone imdb.com/title/tt0361748/ dbpedia.org/resource/Rome rdf.freebase.com/ns/en.rome_ny ? ? data.nytimes.com/51688803696189142301 sameAs links ? Where? How? ? geonames.org/5134301/city_of_rome N 43° 12' 46'' W 75° 27' 20'' 3-74 Record Linkage (Entity Resolution) … record 1 record 2 record 3 record N Susan B. Davidson Peter Buneman Yi Chen O.P. Buneman S. Davison Y. Chen P. Baumann S. Davidson Cheng Y. Y. Davidson Sean Penn S. Chen University of Pennsylvania Issues in … U Penn Penn State Penn Station Issues in … Issues in … Issues in … Int. Conf. on Very Large Data Bases VLDB Conf. PVLDB XLDB Conference Find equivalence classes of entities, and records, based on: • similarity of values (edit distance, n-gram overlap, etc.) • joint agreement of linkage similarity joins, grouping/clustering, collective learning, etc. often domain-specific customization (similarity measures etc.) Halbert L. Dunn: Record Linkage. American Journal of Public Health. 1946 H.B. Newcombe et al.: Automatic Linkage of Vital Records. Science, 1959. I.P. Fellegi, A.B. Sunter: A Theory of Record Linkage. J. of American Statistical Soc., 1969. 3-75 Entity Linkage via Markov Logic record 1 record 2 record 3 Susan B. Davidson Peter Buneman Yi Chen O.P. Buneman S. Davison Y. Chen P. Baumann S. Davidson Cheng Y. … record N Y. Davidson Sean Penn S. Chen University of U Penn Penn State Penn Station Pennsylvania prob. / uncertain rules: Issues in … sameTitle(x,y) Issues in … Issues in … in … sameAuths(x,y) sameVenue(x,y) Issues sameAs(x,y) sameTitle(x,y) sameAuths(x,y) sameAffil(x,y) sameAs(x,y) Int. Conf. on Very VLDB Conf. PVLDB XLDB overlapAuths(x,y) sameAffil(x,y) sameAuths(x,y) Conference Large Data Bases sameAs(rec1.auth1, rec2.auth1) [0.2] sameAs(rec1.auth1, [0.9] records, based on: Find equivalence classes ofrec2.auth2) entities, and … of values (edit distance, n-gram overlap, etc.) • similarity • specify in of Markov Logic or as factor graph • joint agreement linkage MRF (or …) and solvecollective by MCMClearning, (or …) etc. similarity• generate joins, grouping/clustering, (Singla/Domingos: ICDM’06, Hall/Sutton/McCallum:KDD’08) Halbert L. Dunn: Record Linkage. American Journal of Public Health. 1946 H.B. Newcombe et al.: Automatic Linkage of Vital Records. Science, 1959. I.P. Fellegi, A.B. Sunter: A Theory of Record Linkage. J. of American Statistical Soc., 1969. 3-76 sameAs-Link Test across Sources LOD source 1 ei LOD source 2 sameAs ? ej ? ? ? ? similarity: sim (ei, ej) neighborhoods: N(ei), N(ej) coherence: coh (xN(ei), yN(ej)) record linkage problem sameAs (ei, ej) sim (ei, ej) ≥ … x,y coh(x,y) ≥ … 3-77 sameAs-Link Generation across Sources LOD source 1 ei LOD source 2 sameAs ? sameAs ? LOD source 3 sameAs ? ej ek … … … 3-78 sameAs-Link Generation across Sources LOD source 1 ei LOD source 2 sameAs ? sameAs ? ej LOD source 3 sameAs ? • Joint Mapping • ILP model or prob. factor graph or … • Use your favorite solver • How? ek at Web scale ??? sim(ei, ej): likelihood of being equivalent, mapped to [-1,1] coh(x, y): likelihood of being mentioned together, mapped to [-1,1] 0-1 decision variables: Xij … Xjk … Xik … objective function: ij (Xij sim(ei,ej) + Xij xNi, yNj coh(x,y)) + jk (…) + ik (…) = max! constraints: j Xij 1 for all i … (1Xij ) + (1Xjk ) (1Xik) for all i, j, k … 3-79 Similarity Flooding Graph with record / entity pairs as nodes (sameAs candidates) and edges connecting related pairs: R(x,y) and S(u,w) and sameAs candidates (x,u), (y,w) edge between (x,u) and (y,w) • Node weights: belief strength in sameAs(x,u) • Edge weights: degree of relatedness Iterate until convergence: • propagate node weights to neighbors • new node weight is linear combination of inputs Related to belief propagation algorithms, label propagation, etc. 3-80 Blocking of Match Candidates Avoid computing O(n2) similarities between records / entities • Group potentially matching records • Run more accurate & more expensive method per group at risk of missing some matches 1 2 3 4 5 Name John Doe John Doe Jon Foe Jane Foe Jane Fog Zip 49305 94305 94305 12345 12345 Email jdoe@yahoo jdoe@gmail jdoe@yahoo jane@msn jane@msn Group by zip code: {1,4,5} and {2,3} sameAs(4,5), sameAs(2,3) Group by 1st char of lastname: {1,2} and {3,4,5} sameAs(1,2), sameAs(4,5) • Iterative Blocking: distribute found matches to other blocks, then repeat per-block runs • Multi-type Joint Resolution blocks of different record types (author, venue, etc.) propagate matches to other types, then repeat runs 3-81 Iterative Blocking for Joint Resolution [Whang et al. 2012] with Multiple Entity Types Publications Authors Venues after round 1 after round 2 heuristics for constructing efficient execution plans exploiting „influence graph“ 3-82 RiMOM Method [Juanzi Li et al.;TKDE‘09] Risk Minimization Based Ontology Matching Method for joint matching of concepts (entities, classes) & properties (relations) Strategies using variety of matching criteria: • Linguistic-based: • edit distance • context vector distance … • Structure-based: • similarity flooding … keg.cs.tsinghua.edu.cn/project/RiMOM/ 3-83 COMA++ Framework [E. Rahm et al.] • Joint schema alignment and entity matching • Comprehensive architecture with many plug-ins for customizing to specific application • Blocked matchers parallelizable on Map-Reduce platform dbs.uni-leipzig.de/Research/coma.html/ 3-84 PARIS Method [F. Suchanek et al. 2012] Probabilistic Alignment of Relations, Instances, and Schema: joint reasoning on sameEntity, sameRelation, sameClass with direct probabilistic assessment P[literal1 literal2] = … same constant value P[r1 r2] = … sub-relation P[e1 e2] = … same entity P[c1 c2] = … sub-class Iterate through probabilistic equations Empirically converges to fixpoint Matching entities of DBpedia with YAGO: 90% precision, 73% recall, after 4 iterations, 5 h run-time webdam.inria.fr/paris/ 3-85 PARIS Method P[literal1 literal2] = … P[x y] = [F. Suchanek et al. 2012] based on similarity and co-occurrence P[Shanri-La Zhongdian] = … fun(bornIn-1) P[Jet Li Li Lianjie] (1 r(x,u),r(y,w) (1 fun(r1)P[uw])) r(x,u) (1 fun(r) r(y,w) (1 P[uw]))) same entity if relations were already aligned considering negative evidence where fun(r) = #x: y: r(x,y) #x,y: r(x,y)) degree to which r Is a function webdam.inria.fr/paris/ 3-86 PARIS Method P[s r]: P[s r] = P[s r] = [F. Suchanek et al. 2012] sub-relation #x,u: s(x,u) r(x,u) #x,u: s(x,u) if entities were already resolved s(x,u) (1 r(y,w) (1 P[xy]P[uw])) s(x,u) (1 y,w (1 P[xy]P[uw])) with same-entity probabilities webdam.inria.fr/paris/ 3-87 PARIS Method [F. Suchanek et al. 2012] P[x y] = same entity revisited (1 s(x,u),r(y,w) (1 P[s r]fun(s1)P[uw]) (1 P[s r]fun(r1)P[uw])) with sub-relation probabily s(x,u),r(y,w) (1 P[s r] fun(s) r(y,w) (1 P[uw])) (1 P[s r] fun(r) r(y,w) (1 P[uw])) considering negative evidence webdam.inria.fr/paris/ 3-88 PARIS Method P[c d]: P[c d] = P[c d] = [F. Suchanek et al. 2012] sub-class #x type(x,c)) type(x,d) #x: type(x,c) if entities were already resolved x:type(x,c) (1 y:type(y,d) (1 P[xy])) #x: type(x,c) with same-entity probabilities webdam.inria.fr/paris/ 3-89 Partitioned MLN Method V. Rastogi et al. 2011] • Use Markov Logic Network for entity resolution • Partition MLN with replication of nodes so that: • Each node has its neighborhood in the same partition Repeat • local computation: run MLN inference via MCMC on each partition (in parallel) • message passing: exchange beliefs (on sameAs) among partitions with overlapping node sets Until convergence R1: sim(x,y) sameAuthor(x,y) R2: sim(x,y) coAuthor(x,a) coAuthor(y,b) s ameAuthor(a,b) sameAuthor(x,y) 3-90 LINDA: Linked Data Alignment at Scale [C. Böhm et al. 2012] • uses context sim and joint inference to process sameAs matrix with transitivity and other constraints • alternates between setting sameAs and recomputing sim • puts promising candidate pairs in priority queue • queue is partitioned and processing parallelized ek Node n Q-part 1 (2) notify ei ej y (3) update … … Q-part n ek el y … … G-part 1 e e G-part n e e i j k ej Result Matrix X e1 … em em… e1 read el l … read … Input Queue Q distribute ei Node 1 distribue distribute Input Entity Graph G Queue Updates ei ej y‘ ei ek y‘ … ei ej y ei ek y ek el y … … … … … Experiment with BTC+ dataset: • 3 Bio. quads • 345 Mio. triples • 95 Mio. URIs Result after 30 h run-time: • 12.3 Mio. sameAs • 66% precision • > 80% for Dbpedia-Yago 3-91 Cross-Lingual Linking + simpler than monolingual: natural equivalences, interwiki links harder than monolingual: different terminologies & structures en.wikipedia.org: 3.5 Mio. articles baike.baidu.com: 4 Mio. articles Source: Z. Wang et al.: WWW‘12 Z. Wang et al. WWW‘12: factor-graph learning 200,000 sameAs T. Nguyen et al. VLDB‘12: sim features & LSI infobox mappings 3-92 Challenges Remaining Entity linkage is at the heart of semantic data integration ! More than 50 years of research, still some way to go! • Highly related entities with ambiguous names George W. Bush (jun.) vs. George H.W. Bush (sen.) • Out-of-Wikipedia entities with sparse context • Enterprise data (perhaps combined with Web2.0 data) • Records with complex DB / XML / OWL schemas • Entities with very noisy context (in social media) Benchmarks: • • • OAEI Ontology Alignment & Instance Matching: oaei.ontologymatching.org TAC KBP Entity Linking: www.nist.gov/tac/2012/KBP/ TREC Knowledge Base Acceleration: trec-kba.org 3-93 TREC Task: Knowledge Base Acceleration Goal: assist Wikipedia / KB editors • recommend key citations as evidence of truth • recommend infobox structure and categories • recommend entity links and external links http://trec-kba.org 3-94 TREC Task: Knowledge Base Acceleration + http://trec-kba.org 3-95 Outline Motivation Entity-Name Disambiguation Mapping Questions into Queries Entity Linkage Wrap-up ... 3-96 Take-Home Lessons Web of Linked Data is great 100‘s of KB‘s with 30 Bio. triples and 500 Mio. links mostly reference data, dynamic maintenance is bottleneck connection with Web of Contents needs improvement Entity detection and disambiguation is key for creating sameAs links in text (RDFa, microformats) for machine reading, semantic authoring, knowledge base acceleration, … NED methods come close to human quality combine popularity, similarity, and coherence extend towards general WSD (e.g. for QA) Linking entities across KB‘s is advancing Integrated methods for aligning entities, classes and relations 3-97 Open Problems and Grand Challenges Entity name disambiguation in difficult situations Short and noisy texts about long-tail entities in social media Robust disambiguation of entities, relations and classes Relevant for question answering & question-to-query translation Key building block for KB building and maintenance Combine algorithms and crowdsourcing for NED & ER with active learning, minimizing human effort or cost/accuracy Automatic and continuously maintained sameAs links for Web of Linked Data with high accuracy & coverage 3-98 End of Part 3 Questions? 3-99 Recommended Readings: Disambiguation • J. Hoffart, M. A. Yosef, I. Bordino, et al.: Robust Disambiguation of Named Entities in Text. EMNLP 2011 • J. Hoffart et al.: KORE: Keyphrase Overlap Relatedness for Entity Disambiguation. CIKM 2012 • R.C. Bunescu, M. Pasca: Using Encyclopedic Knowledge for Named entity Disambiguation. EACL 2006 • S. Cucerzan: Large-Scale Named Entity Disambiguation Based on Wikipedia Data. EMNLP 2007 • D.N. Milne, I.H. Witten: Learning to link with wikipedia. CIKM 2008 • S. Kulkarni et al.: Collective annotation of Wikipedia entities in web text. KDD 2009 • G.Limaye et al: Annotating and Searching Web Tables Using Entities, Types and Relationships. PVLDB 2010 • A. Rahman, V. Ng: Coreference Resolution with World Knowledge. ACL 2011 • L. Ratinov et al.: Local and Global Algorithms for Disambiguation to Wikipedia. ACL 2011 • M. Dredze et al.: Entity Disambiguation for Knowledge Base Population. COLING 2010 • P. Ferragina, U. Scaiella: TAGME: on-the-fly annotation of short text fragments. CIKM 2010 • X. Han, L. Sun, J. Zhao: Collective entity linking in web text: a graph-based method. SIGIR 2011 • M. Tsagkias, M. de Rijke, W. Weerkamp.: Linking Online News and Social Media. WSDM 2011 • J. Du et al.: Towards High-Quality Semantic Entity Detection over Online Forums. SocInfo 2011 • V.I. Spitkovsky, A.X. Chang: A Cross-Lingual Dictionary for English Wikipedia Concepts, LREC 2012 • J.R. Finkel, T. Grenager, C. Manning. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. ACL 2005 • V. Ng: Supervised Noun Phrase Coreference Research: The First Fifteen Years. ACL 2010 • S. Singh, A. Subramanya, F.C.N. Pereira, A. McCallum: Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models. ACL 2011 • T . Lin et al.: No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities. EMNLP 2012 • A. Rahman, V. Ng: Inducing Fine-Grained Semantic Classes via Hierarchical Classification. COLING 2010 • X. Ling, D.S. Weld: Fine-Grained Entity Recognition. AAAI 2012 • R. Navigli: Word sense disambiguation: A survey. ACM Comput. Surv. 41(2), 2009 • M. Yahya et al.: Natural Language Questions for the Web of Data. EMNLP 2012 • S. Shekarpour: Automatically Transforming Keyword Queries to SPARQL on Large-Scale KBs. ISWC 2011 3-100 Recommended Readings: Linked Data and Entity Linkage • T. Heath, C. Bizer: Linked Data: Evolving the Web into a Global Data Space. Morgan&Claypool, 2011 • A. Hogan, et al.: An empirical survey of Linked Data conformance. J. Web Sem. 14, 2012 • H. Glaser, A. Jaffri, I.C. Millard: Managing Co-Reference on the Semantic Web. LDOW 2009 • J. Volz, C.Bizer, M.Gaedke, G.Kobilarov : Discovering and Maintaining Links on the Web of Data. ISWC 2009 • F. Naumann, M. Herschel: An Introduction to Duplicate Detection. Morgan&Claypool, 2010 • H.Köpcke et al: Learning-Based Approaches for Matching Web Data Entities. IEEE Internet Computing 2010 • H. Köpcke et al.: Evaluation of entity resolution approaches on real-world match problems. PVLDB 2010 • S. Melnik, H. Garcia-Molina, E. Rahm: Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching. ICDE 2002 • S. Chaudhuri, V. Ganti, R. Motwani: Robust Identification of Fuzzy Duplicates. ICDE 2005 • S.E. Whang et al.: Entity Resolution with Iterative Blocking. SIGMOD 2009 • S.E. Whang, H. Garcia-Molina: Joint Entity Resolution. ICDE 2012 • L. Kolb, A. Thor, E. Rahm: Load Balancing for MapReduce-based Entity Resolution. ICDE 2012 • J.Li, J.Tang, Y.Li, Q.Luo: RiMOM: A dynamic multistrategy ontology alignment framework. TKDE 21(8), 2009 • P. Singla, P. Domingos: Entity Resolution with Markov Logic. ICDM 2006 • I.Bhattacharya, L. Getoor: Collective Entity Resolution in Relational Data. TKDD 1(1), 2007 • R. Hall, C.A. Sutton, A. McCallum: Unsupervised deduplication using cross-field dependencies. KDD 2008 • V. Rastogi, N. Dalvi, M. Garofalakis: Large-Scale Collective Entity Matching. PVLDB 2011 • F. Suchanek et al.: PARIS: Probabilistic Alignment of Relations, Instances, and Schema. PVLDB 2012 • Z. Wang, J. Li, Z. Wang, J. Tang: Cross-lingual knowledge linking across wiki knowledge bases. WWW 2012 • T. Nguyen et al.: Multilingual Schema Matching for Wikipedia Infoboxes. PVLDB 2012 • A.Hogan et al.: Scalable and distributed methods for entity matching. J. Web Sem. 10, 2012 • C. Böhm et al.: LINDA: Distributed Web-of-Data-Scale Entity Matching. CIKM 2012 • J. Wang, T. Kraska, M. Franklin, J. Feng: CrowdER: Crowdsourcing Entity Resolution. PVLDB 2012 3-101 Knowledge Harvesting: Overall Take-Home Lessons KB‘s are great opportunity in the big-data era: revive old AI vision, make it real & large-scale ! challenging, but high pay-off Strong success story on entities and classes Good progress on relational facts Methods for open-domain relation discovery Many opportunities remaining: temporal knowledge, spatial, visual, commonsense vertical domains: health, music, travel, … Search and ranking: Combine facts (SPO triples) with witness text Extend SPARQL, LM‘s for ranking, UI unclear Entity linking: From names in text to entities in KB sameAs between entities in different KB‘s / DB‘s Knowledge Harvesting: Research Opportunities & Challenges Explore & exploit synergies between semantic, statistical, & social Web methods: statistical evidence + logical consistency + wisdom of the crowd ! For DB / AI / IR / NLP / Web researchers: • efficiency & scalability • consistency constraints & reasoning • search and ranking • deep linguistic patterns & statistics • text (& speech) disambiguation • killer app for uncertain data management • knowledge-base life-cycle 1-103 and more cmn: 非常谢谢你 yue: 唔該 wu: 谢谢侬 dai: ขอบคุณ tib: ཐུགས་རྗེ་ཆྗེ་། en: thank you expression of gratitude de: vielen Dank fr: Merci beaucoup es: muchas gracias ru: Большое спасибо 3-104