Semantic Search: from Names and Phrases to Entities and Relations Gerhard Weikum Max Planck Institute for Informatics & Saarland University http://www.mpi-inf.mpg.de/~weikum/ Acknowledgements Big Picture: Opportunities Now ! Very Large Knowledge Bases Entity Linkage Web of Data Web of Users & Contents KB Population Disambiguation Semantic Info Extraction Docs Semantic Authoring Big Picture: Opportunities Now ! Very Large Knowledge Bases Entity Linkage This talk: How Do We Search this World of Disambiguation Knowledge, Data, and Text (and cope with ambiguity) for Knowledge Harvesting see talks at College deDocs France Semantic and at VLDB School in Kunming Info Extraction Semantic Authoring Web of Data Web of Users & Contents KB Population Web of Data: RDF, Tables, Microdata 30 Bio. SPO triples (RDF) and growing SUMO ReadTheWeb Cyc YAGO TextRunner/ ReVerb ConceptNet 5 BabelNet WikiTaxonomy/ WikiNet http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png Web of Data: RDF, Tables, Microdata 30 Bio. SPO triples (RDF) and growing • 4M entities in 250 classes • 500M facts for 6000 properties • live updates • 10M entities in 350K classes • 120M facts for 100 relations • 100 languages • 25M entities in • 95% accuracy YAGO 2000 topics Ennio_Morricone type composer • 100M facts for Ennio_Morricone type GrammyAwardWinner 4000 properties composer subclassOf musician • powers Google Ennio_Morricone bornIn Rome knowledge graph Rome locatedIn Italy Ennio_Morricone created Ecstasy_of_Gold Ennio_Morricone wroteMusicFor The_Good,_the_Bad_,and_the_Ugly Sergio_Leone directed The_Good,_the_Bad_,and_the_Ugly http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png Linked RDF Triples on the Web yago/wordnet: Artist109812338 yago/wordnet:Actor109765278 yago/wikicategory:ItalianComposer imdb.com/name/nm0910607/ dbpedia.org/resource/Ennio_Morricone imdb.com/title/tt0361748/ dbpedia.org/resource/Rome rdf.freebase.com/ns/en.rome data.nytimes.com/51688803696189142301 geonames.org/3169070/roma 500 Mio. links N 41° 54' 10'' E 12° 29' 2'' Embedding (RDF) Microdata in HTML Pages <html … May 2, 2011 Supported by RDFa and microformats like schema.org May 2, 2011 Maestro Morricone will perform on the stage of the Smetana Hall to conduct the Czech National Symphony Orchestra and Choir. The concert will feature both Classical compositions and soundtracks such as the Ecstasy of Gold. In programme two concerts for July 14th and 15th. <div typeof=event:music> <span id="Maestro_Morricone"> Maestro Morricone <a rel="sameAs" resource="dbpedia/Ennio_Morricone "/> </span> … <span property = "event:location" > Smetana Hall </span> … <span property="rdf:type" resource="yago:performance"> The concert </span> will feature … <span property="event:date" content="14-07-2011"></span> July 1 </div> Outline Opportunities Now Semantic Search Today Entity Name Disambiguation Question Answering Disambiguation Reloaded Wrap-Up Semantic Search Today (1) Semantic Search Today (1) Semantic Search Today (1) Semantic Search Today (1) Semantic Search Today (1) Semantic Search Today (2) Select ?x Where { ?x type composer [western movie] . ?x wasBornIn ?y . ?y locatedIn Europe . } Semantic Search Today (2) Select ?x Where { ?x type composer . ?x participatedIn ?y . ?y type western_film . } Semantic Search Today (3) Semantic Search Today (3) Semantic Search Today (3) Semantic Search Today (4) Semantic Search Today (4) Key problem in semantic search: diversity and ambiguity of names and phrases ! Outline Opportunities Now Semantic Search Today Entity Name Disambiguation Question Answering Disambiguation Reloaded Wrap-Up Three Different NLP Problems Harry fought with you know who. He defeats the dark lord. Dirty Harry Harry Potter Prince Harry of England The Who (band) Lord Voldemort Three NLP tasks: 1) named-entity detection: segment & label by HMM or CRF (e.g. Stanford NER tagger) 2) co-reference resolution: link to preceding NP (trained classifier over linguistic features) 3) named-entity disambiguation: map each mention (name) to canonical entity (entry in KB) 3-23 Named Entity Disambiguation Eli (bible) Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Mentions (surface names) Eli Wallach ? Benny Goodman Ecstasy (drug) Ecstasy of Gold Benny Andersson Star Wars Trilogy KB Sergio means Sergio_Leone Sergio means Serge_Gainsbourg Ennio means Ennio_Antonelli Ennio means Ennio_Morricone Eli means Eli_(bible) Eli means ExtremeLightInfrastructure Eli means Eli_Wallach Ecstasy means Ecstasy_(drug) Ecstasy means Ecstasy_of_Gold trilogy means Star_Wars_Trilogy trilogy means Lord_of_the_Rings trilogy means Dollars_Trilogy Lord of the Rings Dollars Trilogy Entities (meanings) 3-24 Mention-Entity Graph weighted undirected graph with two types of nodes bag-of-words or Sergio talked to language model: Eli (bible) words, bigrams, Ennio about Eli Wallach phrases Eli‘s role in the Ecstasy (drug) Ecstasy scene. This sequence on Ecstasy of Gold the graveyard Star Wars was a highlight in Sergio‘s trilogy Lord of the Rings of western films. Dollars Trilogy Popularity (m,e): Similarity (m,e): • freq(e|m) • length(e) • #links(e) • cos/Dice/KL (context(m), context(e)) KB+Stats 3-25 Mention-Entity Graph weighted undirected graph with two types of nodes Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Eli (bible) Eli Wallach joint mapping Popularity (m,e): Similarity (m,e): • freq(e|m) • length(e) • #links(e) • cos/Dice/KL (context(m), context(e)) Ecstasy (drug) Ecstasy of Gold Star Wars Lord of the Rings Dollars Trilogy KB+Stats 3-26 Mention-Entity Graph weighted undirected graph with two types of nodes Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Eli (bible) Eli Wallach Ecstasy(drug) Ecstasy of Gold Star Wars Lord of the Rings Popularity (m,e): Similarity (m,e): • freq(m,e|m) • length(e) • #links(e) • cos/Dice/KL (context(m), context(e)) Dollars Trilogy KB+Stats Coherence (e,e‘): • dist(types) • overlap(links) • overlap 27 / 20 3-27 (anchor words) Mention-Entity Graph weighted undirected graph with two types of nodes American Jews Eli (bible) film actors Sergio talked to artists Ennio about Academy Award winners Eli Wallach Eli‘s role in the Metallica songs Ecstasy (drug) Ecstasy scene. Ennio Morricone songs This sequence on artifacts Ecstasy of Gold soundtrack music the graveyard Star Wars was a highlight in spaghetti westerns Sergio‘s trilogy Lord of the Rings film trilogies movies of western films. artifacts Dollars Trilogy Popularity (m,e): Similarity (m,e): • freq(m,e|m) • length(e) • #links(e) • cos/Dice/KL (context(m), context(e)) KB+Stats Coherence (e,e‘): • dist(types) • overlap(links) • overlap 28 / 20 3-28 (anchor words) Mention-Entity Graph weighted undirected graph with two types of nodes http://.../wiki/Dollars_Trilogy Eli (bible) http://.../wiki/The_Good,_the_Bad, _ Sergio talked to http://.../wiki/Clint_Eastwood Ennio about http://.../wiki/Honorary_Academy_A Eli Wallach Eli‘s role in the Ecstasy (drug) http://.../wiki/The_Good,_the_Bad,_t Ecstasy scene. http://.../wiki/Metallica This sequence on Ecstasy of Gold http://.../wiki/Bellagio_(casino) http://.../wiki/Ennio_Morricone the graveyard Star Wars was a highlight in http://.../wiki/Sergio_Leone Sergio‘s trilogy Lord of the Rings http://.../wiki/The_Good,_the_Bad,_ http://.../wiki/For_a_Few_Dollars_M of western films. http://.../wiki/Ennio_Morricone Dollars Trilogy Popularity (m,e): Similarity (m,e): • freq(m,e|m) • length(e) • #links(e) • cos/Dice/KL (context(m), context(e)) KB+Stats Coherence (e,e‘): • dist(types) • overlap(links) • overlap 29 / 20 3-29 (anchor words) Mention-Entity Graph weighted undirected graph with two types of nodes The Magnificent Seven Eli (bible) The Good, the Bad, and the Ugly Sergio talked to Clint Eastwood Ennio about University of Texas at Austin Eli Wallach Eli‘s role in the Metallica on Morricone tribute Ecstasy (drug) Ecstasy scene. Bellagio water fountain show This sequence on Yo-Yo Ma Ecstasy of Gold Ennio Morricone composition the graveyard Star Wars was a highlight in For a Few Dollars More Sergio‘s trilogy Lord of the Rings The Good, the Bad, and the Ugly Man with No Name trilogy of western films. soundtrack by Ennio Morricone Dollars Trilogy Popularity (m,e): Similarity (m,e): • freq(m,e|m) • length(e) • #links(e) • cos/Dice/KL (context(m), context(e)) KB+Stats Coherence (e,e‘): • dist(types) • overlap(links) • overlap 30 / 20 3-30 (anchor words) Joint Mapping 50 30 20 30 50 10 10 90 100 30 90 100 5 20 80 30 90 • Build mention-entity graph or joint-inference factor graph from knowledge and statistics in KB • Compute high-likelihood mapping (ML or MAP) or dense subgraph such that: each m is connected to exactly one e (or at most one e) 3-31 Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11] 50 30 20 30 100 30 90 100 5 140 180 50 50 10 10 90 470 20 80 145 30 90 230 • Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) • Greedy approximation: iteratively remove weakest entity and its edges • Keep alternative solutions, then use local/randomized search 3-32 Mention-Entity Popularity Weights [Milne/Witten 2008, Spitkovsky/Chang 2012] • Need dictionary with entities‘ names: • full names: Arnold Alois Schwarzenegger, Los Angeles, Microsoft Corp. • short names: Arnold, Arnie, Mr. Schwarzenegger, New York, Microsoft, … • nicknames & aliases: Terminator, City of Angels, Evil Empire, … • acronyms: LA, UCLA, MS, MSFT • role names: the Austrian action hero, Californian governor, CEO of MS, … … plus gender info (useful for resolving pronouns in context): Bill and Melinda met at MS. They fell in love and he kissed her. • Collect hyperlink anchor-text / link-target pairs from • Wikipedia redirects • Wikipedia links between articles • Interwiki links between Wikipedia editions • Web links pointing to Wikipedia articles … • Build statistics to estimate P[entity | name] 3-33 Mention-Entity Similarity Edges Precompute characteristic keyphrases q for each entity e: anchor texts or noun phrases in e page with high PMI: weight ( q , e ) log freq ( q , e ) „Metallica tribute to Ennio Morricone“ freq ( q ) freq ( e ) Match keyphrase q of candidate e in context of mention m # matching words score ( q | e ) ~ length of cover(q) w cover(q) weight ( w | e ) w q weight(w | e) Extent of partial matches 1 Weight of matched words The Ecstasy piece was covered by Metallica on the Morricone tribute album. Compute overall similarity of context(m) and candidate e score ( e | m ) ~ score ( q ) dist ( cover(q) , m ) q keyphrases ( e ) in context ( m ) 3-34 Entity-Entity Coherence Edges Precompute overlap of incoming links for entities e1 and e2 mw - coh(e1, e2) ~ 1 log max( in ( e1, e 2 )) log( in ( e1) in ( e 2 )) log | E | log min( in ( e1), in ( e 2 )) Alternatively compute overlap of anchor texts for e1 and e2 ngram - coh(e1, e2) ~ ngrams ( e1) ngrams ( e 2 ) ngrams ( e1) ngrams ( e 2 ) or overlap of keyphrases, or similarity of bag-of-words, or … Optionally combine with type distance of e1 and e2 (e.g., Jaccard index for type instances) For special types of e1 and e2 (locations, people, etc.) use spatial or temporal distance 3-35 AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/ 3-36 AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/ 3-37 AIDA: Very Difficult Example http://www.mpi-inf.mpg.de/yago-naga/aida/ 3-38 AIDA: Very Difficult Example http://www.mpi-inf.mpg.de/yago-naga/aida/ 3-39 AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/ 3-40 AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/ 3-41 Some NED Online Tools for J. Hoffart et al.: EMNLP 2011, VLDB 2011 https://d5gate.ag5.mpi-sb.mpg.de/webaida/ P. Ferragina, U. Scaella: CIKM 2010 http://tagme.di.unipi.it/ R. Isele, C. Bizer: VLDB 2012 http://spotlight.dbpedia.org/demo/index.html Reuters Open Calais http://viewer.opencalais.com/ S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti: KDD 2009 http://www.cse.iitb.ac.in/soumen/doc/CSAW/ D. Milne, I. Witten: CIKM 2008 http://wikipedia-miner.cms.waikato.ac.nz/demos/annotate/ perhaps more some use Stanford NER tagger for detecting mentions http://nlp.stanford.edu/software/CRF-NER.shtml 3-42 NED: Experimental Evaluation Benchmark: • Extended CoNLL 2003 dataset: 1400 newswire articles • originally annotated with mention markup (NER), now with NED mappings to Yago and Freebase • difficult texts: … Australia beats India … … White House talks to Kreml … … EDS made a contract with … Australian_Cricket_Team President_of_the_USA HP_Enterprise_Services Results: Best: AIDA method with prior+sim+coh + robustness test 82% precision @100% recall, 87% mean average precision Comparison to other methods, see paper J. Hoffart et al.: Robust Disambiguation of Named Entities in Text, EMNLP 2011 http://www.mpi-inf.mpg.de/yago-naga/aida/ 3-43 Ongoing Research & Remaining Challenges • More efficient graph algorithms (multicore, etc.) • Allow mentions of unknown entities, mapped to null • Leverage deep-parsing structures, leverage semantic types Example: Page played Kashmir on his Gibson subj obj mod • Short and difficult texts: • tweets, headlines, etc. • fictional texts: novels, song lyrics, etc. • incoherent texts • Structured Web data: tables and lists • Disambiguation beyond entity names: • coreferences: pronouns, paraphrases, etc. • common nouns, verbal phrases (general WSD) 3-44 Variants of NED at Web Scale Tools can map short text onto entities in a few seconds • How to run this on big batch of 1 Mio. input texts? partition inputs across distributed machines, organize dictionary appropriately, … exploit cross-document contexts • How to handle Web-scale inputs (100 Mio. pages) restricted to a set of interesting entities? (e.g. tracking politicians and companies) 3-45 Outline Opportunities Now Semantic Search Today Entity Name Disambiguation Question Answering Disambiguation Reloaded Wrap-Up Deep Question Answering William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel This town is known as "Sin City" & its downtown is "Glitter Gulch" As of 2010, this is the only former Yugoslav republic in the EU 99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain question classification & decomposition knowledge back-ends YAGO D. Ferrucci et al.: Building Watson. AI Magazine, Fall 2010. IBM Journal of R&D 56(3/4), 2012: This is Watson. Semantic Keyword Search [Ilyas et al. Sigmod‘10] Need to map (groups of) keywords onto entities & relationships based on name-entity similarities/probabilities q: composer Rome scores westerns composer (creator of music) Media Composer video editor western movies film music Rome (Italy) Rome (NY) Lazio Roma … born in … … plays for … western world goal in football AS Roma Western Digital Western (airline) Western (NY) … used in … … recorded at … Natural Language Questions are Natural translate question into Sparql query: • dependency parsing to decompose question • mapping of question units onto entities, classes, relations Who composed scores for westerns and is from Rome? Who composed scores for westerns and is from Rome? map results into tabular or visual presentation or speech From Questions to Queries Dependency parsing exposes structure of question „triploids“ (sub-cues) NL question: Who composed scores for westerns and is from Rome? Who composed scores scores is from Rome for westerns 2-50 From Triploids to Triples Who composed scores for westerns and is from Rome? Who composed scores ?x composed ?s scores ?x type composer ?s type music scores for westerns scores contributesTo ?s contributesTo ?y ?y ?y type westernMovie Who is from Rome ?x bornIn Rome 2-51 Pattern Dictionary for Relations [N. Nakashole et al.: EMNLP 2012] Problem: cope with language diversity & ambiguity Example: composed …, wrote …, created …, … WordNet-style dictionary/taxonomy for relational phrases based on SOL patterns (syntactic-lexical-ontological) • Relational phrases can be synonymous “graduated from” “obtained degree in * from” “and $PRP ADJ advisor” “under the supervision of” • One relational phrase can subsume another “wife of” “ spouse of” • Relational phrases are typed <person> graduated from <university> <singer> released <album> <singer> covered <song> <book> covered <event> PATTY: Pattern Taxonomy for Relations [N. Nakashole et al.: EMNLP 2012, demo at VLDB 2012] 350 000 SOL patterns with 4 Mio. instances Derived from large data (Wikipedia, NYT, ClueWeb) by scalable sequence mining accessible at: www.mpi-inf.mpg.de/yago-naga/patty Who composed scores for westerns and is from Rome? c: person Who c: musician e: WHO composed r: created q1 r: wroteComposition composed r: wroteSoftware scores q2 c:soundtrack scores for r: soundtrackFor r: shootsGoalFor q3 c: western movie westerns e: Western Digital r: actedIn q4 is from r: bornIn e: Rome (Italy) Rome e: Lazio Roma weighted edges (coherence, similarity, etc.) Disambiguation Mapping for Triploids Combinatorial Optimization by ILP (with type constraints etc.) Relaxing Overconstrained Queries Select ?p Where { ?p composed ?s . ?s type music ?s for ?m . ?m type movie . ?p bornIn Rome . } . Select ?p Where { ?p composed ?s . ?s type music . ?s for ?m . ?m type movie [western] . ?p bornIn Rome . } Select ?p Where { ?p ?rel1 ?s [composed] . ?s type music . ?s ?rel2 ?m . ?m type movie [western] . ?p bornIn Rome . } with extended SPARQL-FullText: SPOX quad patterns (S. Elbassuoni et al.: CIKM‘10, ESWC’11, SIGIR‘12) Preliminary Results (M. Yahya et al.: WWW‘12, EMNLP‘12) http://www.mpi-inf.mpg.de/yago-naga/deanna/ Outline Opportunities Now Semantic Search Today Entity Name Disambiguation Question Answering Disambiguation Reloaded Wrap-Up Who composed scores for westerns and is from Rome? q1 Selection: Xi Who composed composed scores q2 scores for q3 westerns q4 is from [M.Yahya et al.: EMNLP‘12] Assignment: Yij c:person c:musician e:WHO r:created r:wroteComposition r:wroteSoftware c:soundtrack r:soundtrackFor r:shootsGoalFor c:western movie e:Western Digital r:actedIn r:bornIn e:Rome (Italy) Rome Joint Mapping: Zkl e:Lazio Roma weighted edges (coherence, similarity, etc.) Disambiguation Mapping Disambig. Mapping: Objective Function q1 Selection: Xi Assignment: Yij Who composed composed scores wij c:person c:musician e:WHO Joint Mapping: Zkl vkl r:created r:wroteComposition r:wroteSoftware q2 c:soundtrack scores for r:soundtrackFor r:shootsGoalFor q3 c:western movie westerns e:Western Digital maximize i,j wij Yij + k,l vklr:actedIn Zkl +… subject to: q4 1) Yij Xi for allisi,jfrom r:bornIn 2) j Yij 1 for all i e:Rome (Italy) 3) Zkl i,j Yik and Zkl j Yil for all k,l Rome e:Lazio Roma 4) Xi,Yij,Zkl {0,1} weighted edges (coherence, similarity, etc.) Who composed scores for westerns and is from Rome? Disambig. Mapping: Constraints Selection: Qhi q1 Selection: Xi Assignment: Yij Who composed composed scores wij c:person c:musician e:WHO Joint Mapping: Zkl vkl r:created r:wroteComposition r:wroteSoftware q2 c:soundtrack scores for r:soundtrackFor r:shootsGoalFor q3 c:western movie westerns e:Western Digital r:actedIn q4 maximize i,j wisij from Yij + k,l vklr:bornIn Zkl +… subject to: 5) Qhi = 1 g Qhg = 3 for all h,i e:Rome (Italy) 6) Xi + Xg 1 for all mutually exclusive i,g Rome e:Lazio Roma 7) Qhi = 1 g,j Qhg Ygj = 1 for relation nodes j weighted edges (coherence, similarity, etc.) Who composed scores for westerns and is from Rome? Disambig. Mapping: Type Constraints Selection: Qhi q1 ILP optimizers like Gurobi q2 solve this in 1 or 2 seconds Selection: Xi Who composed composed scores scores for q3 westerns q4 Assignment: Yij is from wij c:person c:musician e: WHO Joint Mapping: Zkl vkl r:created r:wroteComposition r:wroteSoftware c:soundtrack r:soundtrackFor r:shootsGoalFor c:western movie e:Western Digital r:actedIn r:bornIn maximize i,j wij Yij + k,l vkle:Rome Zkl +… subject to: (Italy) 8) Yij = 1 and j is relation node and Zkj=1 and Zjl=1 Rome e:Lazio Roma domain(j) types(k) and range(j) types(l) weighted edges (coherence, similarity, etc.) Who composed scores for westerns and is from Rome? Outline Opportunities Now Semantic Search Today Entity Name Disambiguation Question Answering Disambiguation Reloaded Wrap-Up Summary • Web of Data & Knowledge & Text (RDF + Phrases) Calls for Semantic Search by Entities, Classes & Relations • Diversity & Ambiguity of Names and Phrases Calls for Disambiguation Mapping • Strong Story for Entity Name Disambiguation • Ongoing Work on Relation Phrase Disambiguation • Cornerstone of Question Answering with Natural Language or Advanced Keywords Great opportunity towards next-generation search Challenging problems: robustness, scale, dynamics & transfer Take-Home Message Solve „Who composed the Ecstasy and other pieces for westerns?“ can solve semantic search with natural-language disambiguation