For a Few Triples More Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/ Acknowledgements LOD: RDF Triples on the Web http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png LOD: Linked RDF Triples on the Web yago/wordnet: Artist109812338 yago/wordnet:Actor109765278 yago/wikicategory:ItalianComposer imdb.com/name/nm0910607/ dbpedia.org/resource/Ennio_Morricone imdb.com/title/tt0361748/ dbpedia.org/resource/Rome rdf.freebase.com/ns/en.rome data.nytimes.com/51688803696189142301 geonames.org/3169070/roma N 41° 54' 10'' E 12° 29' 2'' LOD: Linked RDF Triples on the Web • Size: 30 Billion triples • Linkage: 500 Million links • Dynamics: encyclopedic reference data The Good, the Bad, and the Ugly For a Few Triples More 30 billion triples – still not enough ? No! Consider: 1. Dynamics 2. Linkage 3. Ubiquity Outline Explain Title Why More Triples: Dynamics, Linkage, Ubiquity Linkage & Ubiquity: Named-Entity Disambiguation Web-Scale Linkage Wrap-up 1. Dynamics: in a Fast Paced World Anecdotic examples: • <… rdf:about="http://dbpedia.org/resource/Steve_Jobs"> still there <dbpprop:occupation …> Chairman and CEO, Apple Inc. • <… "http://... Ellen_Johnson_Sirleaf"> <dcterms:subject rdf:resource= "http://... Category:Nobel_Peace_Prize_laureates”/> not there • <… "http://... Scarlett_Johansson"> never there <dbpprop:spouse rdf:resource="http://... Ryan_Reynolds"/> • <… "http://... Clint_Eastwood"> <dbpprop:spouse …>Dina Ruiz 1 child both there <… rdf:about="http://... Clint_Eastwood"> <dbpprop:spouse …>Maggie Johnson 2 children • <… rdf:about="http://... Paul_McCartney"> <dbpprop:spouse rdf:resource=« http://… Linda_McCartney "/> <… rdf:about="http://... Paul_McCartney"> <dbpprop:spouse rdf:resource=« http://… Heather_Mills"/> <… rdf:about="http://... Paul_McCartney"> none there <dbpprop:spouse …>Nancy Shevell 1. Dynamics: As Fresh As Possible http://data.gov.uk/openspending 1. Dynamics: Updates in the Web of Data http://sindice.com 1. Dynamics: Closer to the Sources RDF Data on the Web produced by: • Maintained, but mostly „static“ reference collections (e.g. geo) • Periodic exports from curated databases (e.g. gov, bio, music) • Periodic extraction from Web sources (e.g. encyclopedia, news) • Tags in social streams and advertisements mostly fresh often stale often stale very noisy Get closer to the data origin: • RDF engines (Sparql APIs) for production DBs • view-maintenance by pub-sub push (feeds) • Deep-Web crawl/query for surfacing of RDF data 1. Dynamics: Nothing Lasts Forever Even old and „static“ data often needs temporal scope (timepoint, timespan) for proper interpretation 1: PaulMcCartney hasSpouse HeatherMills 2: PaulMcCartney hasSpouse NancyShevell 3: PaulMcCartney gotHonor SirPaul 1 validFrom 11-Jun-2002 2 validFrom Oct-2011 3 happendOn 1999 [11-Jun-2002, 2008] [Oct-2011, now] [1999] 1 validUntil 2008 with reification, or use quads (quints, pints, etc.) Need to add temporal properties to RDF and SPARQL Select ?w Where { ?id1: PM gotHonor SirPaul . ?id1 happendOn ?t . ?id2: PM hasSpouse ?w . ?id2 validFrom ?b . ?id2 validUntil ?e . ?t containedIn [?b,?e] . } but: principled, expressive, easy-to-use 1. Dynamics: Nothing Lasts Forever http://www.mpi-inf.mpg.de/yago-naga/yago/ 2. Linkage: sameAs Links dbpedia.org/resource/Linda_Louise_Eastman yago-knowledge.org/resource/Linda_McCartney owl:sameAs www.freebase.com/view/en/man_with_no_name dbpedia.org/page/Clint_Eastwood owl:SameAs data.linkedmdb.org/page/film/38166 de.dbpedia.org/page/Zwei_glorreiche_Halunken owl:sameAs LOD statistics: 30 Bio. triples, 500 Mio. links 330 Mio. links trivial (ID-based) within pub, within bio 10‘s Mio. links near-trivial Dbpedia Freebase Yago GeoNames sameas.org: 17 Mio. bundles for 50 Mio. URIs data.nytimes.com: 5000 people, 2000 locations Way too few for a world with: 1 Mio. people, 10 Mio. locations, 10‘s Mio. species, 6 Mio. books, 2 Mio. movies, 10 Mio. songs, etc. etc. 2. Linkage: sameAs Coverage 2. Linkage: sameAs Accuracy http://sameas.org 3. Ubiquity: Web-of-Data & Web-of-Contents 3. Ubiquity: Web of Data & Other Contents RDF data and Web contents need to be interconnected RDFa & microformats provide the mechanism How do we get the Web RDF-annotated (at large scale)? Largely automated, but allow humans in the loop 3. Ubiquity: Web of Data & Other Contents <html … May 2, 2011 May 2, 2011 Maestro Morricone will perform on the stage of the Smetana Hall to conduct the Czech National Symphony Orchestra and Choir. The concert will feature both Classical compositions and soundtracks such as the Ecstasy of Gold. In programme two concerts for July 14th and 15th. <div typeof=event:music> <span id="Maestro_Morricone"> Maestro Morricone <a rel="sameAs" resource="dbpedia…/Ennio_Morricone "/> </span> … <span property = "event:location" > Smetana Hall </span> … <span property="rdf:type" resource="yago:performance"> The concert </span> will feature … <span property="event:date" content="14-07-2011"></span> July 1 </div> Why a Few Triples More? Linked Data is great! But still in its infancy Need to add triples to capture further issues: • Dynamics: Where is the live data? • Linkage: Where are the links in Linked Data? • Ubiquity: Where are the paths between the Web-of-Data and the Web? Outline Explain Title Why More Triples: Dynamics, Linkage, Ubiquity Linkage & Ubiquity: Named-Entity Disambiguation Web-Scale Linkage Wrap-up Entities on the Web http://sig.ma Named-Entity Disambiguation (NED) Harry fought with you know who. He defeats the dark lord. Dirty Harry Harry Potter Prince Harry of England The Who (band) Lord Voldemort Three NLP tasks: 1) named-entity detection: segment & label by HMM or CRF (e.g. Stanford NER tagger) 2) co-reference resolution: link to preceding NP (trained classifier over linguistic features) 3) named-entity disambiguation: map each mention (name) to canonical entity (entry in KB) Mentions, Meanings, Mappings Eli (bible) Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Mentions (surface names) Eli Wallach ? Benny EcstasyGoodman (drug) EcstasyAndersson of Gold Benny Star Wars Trilogy KB Sergio means Sergio_Leone Sergio means Serge_Gainsbourg Ennio means Ennio_Antonelli Ennio means Ennio_Morricone Eli means Eli_(bible) Eli means ExtremeLightInfrastructure Eli means Eli_Wallach Ecstasy means Ecstasy_(drug) Ecstasy means Ecstasy_of_Gold trilogy means Star_Wars_Trilogy trilogy means Lord_of_the_Rings trilogy means Dollars_Trilogy Lord of the Rings Dollars Trilogy Entities (meanings) Mention-Entity Graph weighted undirected graph with two types of nodes bag-of-words or Eli (bible) Sergio talked to language model: words, bigrams, Ennio about Eli Wallach phrases Eli‘s role in the Ecstasy (drug) Ecstasy scene. This sequence on Ecstasy of Gold the graveyard Star Wars was a highlight in Sergio‘s trilogy Lord of the Rings of western films. Dollars Trilogy Popularity (m,e): Similarity (m,e): • freq(e|m) • length(e) • #links(e) • cos/Dice/KL (context(m), context(e)) KB+Stats Mention-Entity Graph weighted undirected graph with two types of nodes Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Eli (bible) Eli Wallach joint mapping Popularity (m,e): Similarity (m,e): • freq(e|m) • length(e) • #links(e) • cos/Dice/KL (context(m), context(e)) Ecstasy (drug) Ecstasy of Gold Star Wars Lord of the Rings Dollars Trilogy KB+Stats Mention-Entity Graph weighted undirected graph with two types of nodes Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Eli (bible) Eli Wallach Ecstasy(drug) Ecstasy of Gold Star Wars Lord of the Rings Popularity (m,e): Similarity (m,e): • freq(m,e|m) • length(e) • #links(e) • cos/Dice/KL (context(m), context(e)) Dollars Trilogy KB+Stats Coherence (e,e‘): • dist(types) • overlap(links) • overlap 28 28 / 20 (anchor words) Mention-Entity Graph weighted undirected graph with two types of nodes American Jews Eli (bible) film actors Sergio talked to artists Ennio about Academy Award winners Eli Wallach Eli‘s role in the Metallica songs Ecstasy (drug) Ecstasy scene. Ennio Morricone songs This sequence on artifacts Ecstasy of Gold soundtrack music the graveyard Star Wars was a highlight in spaghetti westerns Sergio‘s trilogy Lord of the Rings film trilogies movies of western films. artifacts Dollars Trilogy Popularity (m,e): Similarity (m,e): • freq(m,e|m) • length(e) • #links(e) • cos/Dice/KL (context(m), context(e)) KB+Stats Coherence (e,e‘): • dist(types) • overlap(links) • overlap 29 29 / 20 (anchor words) Mention-Entity Graph weighted undirected graph with two types of nodes http://.../wiki/Dollars_Trilogy Eli (bible) http://.../wiki/The_Good,_the_Bad, _ Sergio talked to http://.../wiki/Clint_Eastwood Ennio about http://.../wiki/Honorary_Academy_A Eli Wallach Eli‘s role in the Ecstasy (drug) http://.../wiki/The_Good,_the_Bad,_t Ecstasy scene. http://.../wiki/Metallica This sequence on Ecstasy of Gold http://.../wiki/Bellagio_(casino) http://.../wiki/Ennio_Morricone the graveyard Star Wars was a highlight in http://.../wiki/Sergio_Leone Sergio‘s trilogy Lord of the Rings http://.../wiki/The_Good,_the_Bad,_ http://.../wiki/For_a_Few_Dollars_M of western films. http://.../wiki/Ennio_Morricone Dollars Trilogy Popularity (m,e): Similarity (m,e): • freq(m,e|m) • length(e) • #links(e) • cos/Dice/KL (context(m), context(e)) KB+Stats Coherence (e,e‘): • dist(types) • overlap(links) • overlap 30 30 / 20 (anchor words) Mention-Entity Graph weighted undirected graph with two types of nodes The Magnificent Seven Eli (bible) The Good, the Bad, and the Ugly Sergio talked to Clint Eastwood Ennio about University of Texas at Austin Eli Wallach Eli‘s role in the Metallica on Morricone tribute Ecstasy (drug) Ecstasy scene. Bellagio water fountain show This sequence on Yo-Yo Ma Ecstasy of Gold Ennio Morricone composition the graveyard Star Wars was a highlight in For a Few Dollars More Sergio‘s trilogy Lord of the Rings The Good, the Bad, and the Ugly Man with No Name trilogy of western films. soundtrack by Ennio Morricone Dollars Trilogy Popularity (m,e): Similarity (m,e): • freq(m,e|m) • length(e) • #links(e) • cos/Dice/KL (context(m), context(e)) KB+Stats Coherence (e,e‘): • dist(types) • overlap(links) • overlap 31 31 / 20 (anchor words) Joint Mapping 50 30 20 30 50 10 10 90 100 30 90 100 5 20 80 30 90 • Build mention-entity graph or joint-inference factor graph from knowledge and statistics in KB • Compute high-likelihood mapping (ML or MAP) or dense subgraph such that: each m is connected to exactly one e (or at most one e) Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11] 50 30 20 30 100 30 90 100 5 140 180 50 50 10 10 90 470 20 80 145 30 90 230 • Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) • Greedy approximation: iteratively remove weakest entity and its edges • Keep alternative solutions, then use local/randomized search Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11] 50 30 100 30 90 100 5 140 30 180 170 50 10 90 50 470 80 145 30 90 230 210 • Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) • Greedy approximation: iteratively remove weakest entity and its edges • Keep alternative solutions, then use local/randomized search Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11] 140 30 170 120 90 100 30 90 100 5 460 80 145 30 90 210 • Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) • Greedy approximation: iteratively remove weakest entity and its edges • Keep alternative solutions, then use local/randomized search Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11] 30 120 90 100 380 90 100 145 90 210 • Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) • Greedy approximation: iteratively remove weakest entity and its edges • Keep alternative solutions, then use local/randomized search Named-Entity Disambiguation: State-of-the-Art Literature: • Razvan Bunescu, Marius Pasca: EACL 2006 • Silviu Cucerzan: EMNLP 2007 • David Milne, Ian Witten: CIKM 2008 • S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti: KDD 2009 • G. Limaye, S. Sarawagi, S. Chakrabarti: VLDB 2010 • Paolo Ferragina, Ugo Scaella: CIKM 2010 • Mark Dredze et al.: COLING 2010 • Johannes Hoffart et al.: EMNLP 2011 etc. etc. Online tools: https://d5gate.ag5.mpi-sb.mpg.de/webaida/ http://tagme.di.unipi.it/ http://spotlight.dbpedia.org/demo/index.html http://viewer.opencalais.com/ http://wikipedia-miner.cms.waikato.ac.nz/demos/annotate/ etc. NED: Experimental Evaluation Benchmark: • Extended CoNLL 2003 dataset: 1400 newswire articles • originally annotated with mention markup (NER), now with NED mappings to Yago and Freebase • difficult texts: … Australia beats India … … White House talks to Kreml … … EDS made a contract with … Australian_Cricket_Team President_of_the_USA HP_Enterprise_Services Results: Best: AIDA method with prior+sim+coh + robustness test 82% precision @100% recall, 87% mean average precision Comparison to other methods, see paper J. Hoffart et al.: Robust Disambiguation of Named Entities in Text, EMNLP 2011 http://www.mpi-inf.mpg.de/yago-naga/aida/ AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/ AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/ AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/ AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/ AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/ Interesting Research Issues • More efficient graph algorithms (multicore, etc.) • Allow mentions of unknown entities, mapped to null • Leverage deep-parsing structures, leverage semantic types • Short and difficult texts: • tweets, headlines, etc. • fictional texts: novels, song lyrics, etc. • incoherent texts • Disambiguation beyond entity names: • coreferences: pronouns, paraphrases, etc. • common nouns, verbal phrases (general WSD) Why Named Entity Disambiguation is Key • Linked data is best if it has many good links • New & rich contents mostly in traditional Web • Create sameAs links in (X)HTML contents, via RDFa • Links for named entities give best mileage/effort • Methods & tools greatly advanced & gradually maturing • Keep human in the loop, embed NED in authoring tools Outline Explain Title Why More Triples: Dynamics, Linkage, Ubiquity Linkage & Ubiquity: Named-Entity Disambiguation Web-Scale Linkage Wrap-up Variants of NED at Web Scale Tools can map short text onto entities in a few seconds • How to run this on big batch of 1 Mio. input texts? partition inputs across distributed machines, organize dictionary appropriately, … exploit cross-document contexts • How to deal with inputs from different time epochs? consider time-dependent contexts, map to entities of proper epoch (e.g. harvested from Wikipedia history) • How to handle Web-scale inputs (100 Mio. pages) restricted to a set of interesting entities? (e.g. tracking politicians and companies) Linked RDF Triples on the Web yago/wordnet: Artist109812338 yago/wordnet:Actor109765278 yago/wikicategory:ItalianComposer imdb.com/name/nm0910607/ dbpedia.org/resource/Ennio_Morricone imdb.com/title/tt0361748/ dbpedia.org/resource/Rome rdf.freebase.com/ns/en.rome_ny ? ? data.nytimes.com/51688803696189142301 referential data quality: automatic, dynamic, high coverage ! ? geonames.org/5134301/city_of_rome N 43° 12' 46'' W 75° 27' 20'' Outline Explain Title Why More Triples: Dynamics, Linkage, Ubiquity Linkage & Ubiquity: Named-Entity Disambiguation Web-Scale Linkage Wrap-up Summary Linked Data is great! But it needs more triples to capture: • Dynamics: (Deep-Web) sources feeds, pub-sub, … ? fresh & versioned triples • Linkage: LOD entity mapping user community • Ubiquity: RDFa entity disambiguation authoring Outlook For a Few Triples More Challenge 1: generate high-quality sameAs links in RDFa & across all LOD sources For a Few Triples Less Challenge 2: add efficient top-k ranking to queries over RDF-in-context Thank You !