YAGO-QA Answering Questions by Structured Knowledge Queries Peter Adolphs Martin Theobald Ulrich Schäfer Hans Uszkoreit Gerhard Weikum ICSC Stanford University September 19, 2011 Jeopardy! A big US city with two airports, one named after a World War II hero, and one named after a World War II battle field? 2 YAGO-QA: Answering Questions by Structured Knowledge Queries 08.04.2015 Deep-QA in NL William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel This town is known as "Sin City" & its downtown is "Glitter Gulch" As of 2010, this is the only former Yugoslav republic in the EU 99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain question classification & decomposition knowledge backends D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine, 2010. YAGO www.ibm.com/innovation/us/watson/index.htm 3 YAGO-QA: Answering Questions by Structured Knowledge Queries 08.04.2015 Structured Knowledge Queries A big US city with two airports, one named after a World War II hero, and one named after a World War II battle field? Select Distinct ?c Where { ?c type City . ?c locatedIn USA . ?a1 type Airport . ?a2 type Airport . ?a1 locatedIn ?c . ?a2 locatedIn ?c . ?a1 namedAfter ?p . ?p type WarHero . ?a2 namedAfter ?b . ?b type BattleField . } In this work: focus on factoid and list questions 4 YAGO-QA: Answering Questions by Structured Knowledge Queries 08.04.2015 Agenda YAGO Server & API Wikipedia-based information extraction Searching & ranking in large RDF graphs Names, Surface Patterns & Paraphrases Named entity disambiguation Mapping surface patterns onto semantic relations Crowdsourcing for questions paraphrases YAGO-QA Architecture Template-based mapping of NL questions onto SPARQL Conclusions & Future Work 5 YAGO-QA: Answering Questions by Structured Knowledge Queries 08.04.2015 Information Extraction from Wikipedia Subj. Pred. Obj. Stanford University type Private University hasPresident J.L.Hennessy hasStudents 15,319 foundedBy L.Stanford foundedIn 1891 … … … 6 YAGO-QA: Answering Questions by Structured Knowledge Queries 08.04.2015 YAGO Knowledge Base Combine knowledge from WordNet & Wikipedia Additional Gazetteers (geonames.org) Part of the LinkedData cloud 7 YAGO-QA: Answering Questions by Structured Knowledge Queries 08.04.2015 YAGO-2 Numbers Just Wikipedia #Relations Incl. Gazetteer Data 104 114 #Classes 364,740 364,740 #Entities 2,641,040 9,804,102 120,056,073 461,893,127 - types & classes 8,649,652 15,716,697 - base relations 25,471,211 196,713,637 - space, time & proven. 85,935,210 249,462,793 3.4 GB 8.7 GB #Facts Size (CSV format) estimated precision > 95% (for base relations excl. space, time & provenance) www.mpi-inf.mpg.de/yago-naga/ 8 YAGO-QA: Answering Questions by Structured Knowledge Queries 08.04.2015 Searching & Ranking RDF Graphs in NAGA Ranking based on confidence, compactness and relevance Discovery queries: Kiel bornIn $x type $a scientist diedOn $x > $b Connectedness queries: German novelist type hasWon hasSon diedOn $y * Thomas Mann Goethe Queries with regular expressions: Ling hasFirstName | hasLastName (coAuthor | advisor)* Beng Chin Ooi 9 $x type scientist worksFor $y locatedIn* YAGO-QA: Answering Questions by Structured Knowledge Queries Nobel prize Zhejiang 08.04.2015 YAGO Server: UI & API % 10 YAGO-QA: Answering Questions by Structured Knowledge Queries 08.04.2015 YAGO Server: UI & API YAGO-UI Interactive online demo RDF with time, space & provenance annotations SPARQL + keywords YAGO-API Two basic WebServices: processQuery (String query) getYagoEntitiesByNames (String[] names) … www.mpi-inf.mpg.de/yago-naga/demo.html 11 YAGO-QA: Answering Questions by Structured Knowledge Queries 08.04.2015 Names, Surface Patterns & Paraphrases Which chemist was born in London? NN (I) Named entity disambiguation chemist wordnet_chemist, wordnet_pharmacist born Bertran_de_Born, Born_Identity_(Movie), Born_(Album) London London_UK, London_Arkansas, Antonio_London (II) Mapping surface patterns onto semantic relations VBD VBN IN NNP/LOC <person> was_born_in <location> bornIn(<person>, <location>) <person> was_born_in <date> bornOn(<person>, <date>) (III) Paraphrases of questions <person> [was] born in <location> <location>-born <person> 12 bornIn(<person>, <location>) YAGO-QA: Answering Questions by Structured Knowledge Queries 08.04.2015 (I) Named Entity Disambiguation Wikipedia link structure 65,872,435 intra-wiki links 2,782,297 disambiguation pages & 328,372 redirects 2,886,027 distinct link anchor texts YAGO “means” relation 18,470,099 mappings of names to entities 6.2 distinct names per entity (on avg.) Individual name disambiguation vs. joint disambiguation AIDA tool for graph-based disambiguation in YAGO-2: “Robust Disambiguation of Named Entities in Text” J. Hoffart et al. In EMNLP, Edinburgh, Scotland, 2011 #inlinks with anchor “Paris” Paris Paris, France Paris Masters Paris (mythology) University of Paris Paris, Texas Paris, Ontario Paris (rapper) Open Gaz de France Paris, Kentucky Paris (2008 film) Gare Saint-Lazare Paris, Tennessee BNP Paribas Masters Paris, Maine Paris Hilton Paris, Arkansas Paris (Supertramp album) Gare du Nord Paris (1979 TV series) Count Paris Palais Omnisports de Paris-Bercy Paris, Virginia Paris 2012 Olympic bid Paris (2003 film) www.mpi-inf.mpg.de/yago-naga/aida/ 13 YAGO-QA: Answering Questions by Structured Knowledge Queries 08.04.2015 32,362 570 134 118 79 56 45 29 26 20 19 18 17 16 14 12 11 10 9 8 7 6 5 4 3 (II) From Patterns to Semantic Relations PROSPERA – statistical pattern mining from free-text Domain-oriented extraction of patterns for known relations (POS-enhanced n-grams) X carried out his doctoral research in math under the supervision of Y X { carried out PRP doctoral research [IN NP] [DET] supervision [IN] } Y Confidence & support based on seeds & counter seeds Pattern/fact-duality & consistency reasoning occurs(p,x,y) expresses(p,R) R(x,y) occurs(p,x,y) R(x,y) expresses(p,R) Spouse Person Person capitalOfCountry cityOfCountry Spouse(x,y): x y, y x pattern-fact duality type constraints inclusion dependencies functional dependencies 10s to 100s of typed patterns per relation 14 YAGO-QA: Answering Questions by Structured Knowledge Queries 08.04.2015 PROSPERA Architecture Gathering: Enhanced Hearst patterns POS-enhanced n-grams Pattern-fact duality & constraints Analysis: Refined pattern weights Carefully chosen seeds and counter seeds Thresholds for pattern confidence & support Reasoning: Scalable extraction & consistency reasoning MapReduce functions for pattern extraction & statistics gathering Distributed MaxSat solver (MAP Inference) 15 YAGO-QA: Answering Questions by Structured Knowledge Queries 08.04.2015 (III) Crowdsourcing for Question Paraphrases Pattern acquisition from the crowd Annotators paraphrase naturallanguage seed questions Seed questions are associated with their semantic arguments and functions Gold resource for pattern acquisition and system evaluation Preliminary results 4,620 paraphrases for 254 seed questions with 7 annotators Total annotation time: ~49 hours, ~1 work-day per annotator 16 YAGO-QA: Answering Questions by Structured Knowledge Queries 08.04.2015 YAGO-QA Architecture Input analysis SProUT for tokenization, stemming & NER (http://sprout.dfki.de/) NE gazetteer extended by YAGO entities Input interpretation 17 Named-entity disambiguation based on YAGO statistics Vague matching against the gathered question paraphrases YAGO-QA: Answering Questions by Structured Knowledge Queries 08.04.2015 YAGO-QA Architecture (ct’d) Input interpretation / Answer retrieval An actor whose place of birth is Chicago. Which actor was born in Chicago ? Which <actor> was_born_in <Chicago> ? ?x type ARG1 . ?x bornIn ARG2 . Template-based answer generation 18 Who/what is/are <?x> ? YAGO-QA: Answering Questions by Structured Knowledge Queries 08.04.2015 YAGO-QA Example Multiple named entity annotations: all names are annotated Interpretation picks suitable NE readings Vague matching against surface templates 19 YAGO-QA: Answering Questions by Structured Knowledge Queries 08.04.2015 Conclusions & Future Work QA based on structured knowledge queries (beyond IR-style retrieval of matching sentences/paragraphs) Wikipedia as rich knowledge backend Entities, semantic classes & typed relations Large-scale statistics for entity disambiguation & surface patterns Crowdsourcing for question paraphrases Predefined question templates translated into join queries Future work 20 “Open-QA” via open-domain information extraction Dynamic learning of template structures from grammars More modular template structures YAGO-QA: Answering Questions by Structured Knowledge Queries 08.04.2015