Harvesting Knowledge from Web Data and Text CIKM 2010 Tutorial (1/2 Day) Hady W. Lauw1, Ralf Schenkel2, Fabian Suchanek3, Martin Theobald4, and Gerhard Weikum4 1Institute for Infocomm Research, Singapore 2Saarland University, Saarbruecken 3INRIA Saclay, Paris 4Max Planck Institute Informatics, Saarbruecken All slides for download… http://www.mpi-inf.mpg.de/yago-naga/ CIKM10-tutorial/ Harvesting Knowledge from Web Data 2 Outline • Part I – What and Why – Available Knowledge Bases • Part II – Extracting Knowledge • Part III – Ranking and Searching • Part IV – Conclusion and Outlook Harvesting Knowledge from Web Data 3 Motivation Elvis Presley 1935 - 1977 Will there ever be someone like him again? 4 Motivation Another Elvis Elvis Presley: The Early Years Elvis spent more weeks at the top of the charts than any other artist. www.fiftiesweb.com/elvis.htm 5 Motivation Another singer called Elvis, young Personal relationships of Elvis Presley – Wikipedia ...when Elvis was a young teen.... another girl whom the singer's mother hoped Presley would .... The writer called Elvis "a hillbilly cat” en.wikipedia.org/.../Personal_relationships_of_Elvis_Presley 6 Motivation Dear Mr. Page, you don’t understand me. I just... Elvis Presley - Official page for Elvis Presley Welcome to the Official Elvis Presley Web Site, home of the undisputed King of Rock 'n' Roll and his beloved Graceland ... www.elvis.com/ 7 Motivation Other (more serious?) queries: • when is Madonna’s next concert in Europe? • which protein inhibits atherosclerosis? • who was king of England when Napoleon I was emperor of France? King George III • has any scientist ever won the Nobel Prize in Literature? Bertrand Russel • which countries have a HDI comparable to Sweden’s? • which scientific papers have led to patents? • is there another famous singer named “Elvis”? 8 This Tutorial Mr. Page, let’s try this again. Is there another singer named Elvis? In this tutorial, we will explain • how the knowledge is organized • what knowledge bases exist already • how we can construct knowledge bases • how we can query knowledge bases singer type type ? “Elvis” “Elvis” 9 Ontologies entity subclassOf subclassOf person location subclassOf scientists subclassOf singer type city type type bornIn Tupelo ? The same label for two entities: homonymy Classes label “Elvis” label “The King” Relations Instances The same entity has two labels: synonymy Labels/words 10 Classes entity subclassOf person subclassOf scientists singer type type ? Transitivity: type(x,y) /\ subclassOf(y,z) => type(x,z) Relations entity subclassOf subclassOf person location domain range bornIn singer subclassOf city type type bornIn Tupelo Domain and range constraints: domain(r,c) /\ r(x,y) => type(x,c) range(r,c) /\ r(x,y) => type(y,c) Looks like higher order, but is not. Consider introducing a predicate fact(r,x,y) Event Entities 1967 An event entity is an artificial entity introduced to represent an n-ary relationship year ElvisGrammy winner prize Grammy Award won Event entities allow representing arbitrary relational data as binary graphs Winner Prize Row42 Elvis Presley Grammy Award Row43 Year 196 7 Reification Reification is the method of creating an entity that represents a fact. 1967 year #42 Grammy Award won #42 source Wikipedia bornIn Tupelo #43 There are different ways to reify a fact, this is the one used in this talk. RDF resource The Resource Description Format (RDF) is a W3C standard that provides a standard vocabulary to model ontologies. An RDF ontology can be seen as a directed labeled multi-graph where • the nodes are entities • the edges are labeled with relations Edges (facts) are commonly written • as triples <Elvis, bornIn, Tupelo> • as literals bornIn(Elvis, Tupelo) subclassOf location subclassOf city type bornIn Tupelo [W3C recommendation: RDF, 2004] Outline • Part I – What and Why ✔ – Available Knowledge Bases • Part II – Extracting Knowledge • Part III – Ranking and Searching • Part IV – Conclusion and Outlook Harvesting Knowledge from Web Data 16 Cyc What if we could make all common sense knowledge computer-processable? Cyc project Douglas Lenat • • • • started in 1984 driven by staff of 20 goal: formalize knowledge manually [Lenat, Comm. ACM, 1995] Cyc: Language CycL is the formal language that Cyc uses to represent knowledge. (Semantics based on First Order Logic, syntax based on LISP) (#$forall ?A (#$implies (#$isa ?A #$Animal) (#$thereExists ?M (#$mother ?A ?M)))) Cyc project (#$arity #$GovernmentFn 1) (#$arg1Isa #$GovernmentFn #$GeopoliticalEntity) (#$resultIsa #$GovernmentFn #$RegionalGovernment) (#$governs (#$GovernmentFn #$Canada) + a logical reasoner #$Canada) http://cyc.com/cycdoc/ref/cycl-syntax.html Cyc: Knowledge #$Love Strong affection for another agent arising out of kinship or personal ties. Love may be felt towards things, too: warm attachment, enthusiasm, or devotion. #$Love is a collection, as further explained under #$Happiness. Specialized forms of #$Love are #$Love-Romantic, platonic Cyc project love, maternal love, infatuation, agape, etc. guid: bd589433-9c29-11b1-9dad-c379636f7270 direct instance of: #$FeelingType direct specialization of: #$Affection direct generalization of: #$Love-Romantic http://cyc.com/cycdoc/vocab/emotion-vocab.html#Love Facts and axioms about: Transportation, Ecology, everyday living, chemistry, healthcare, animals, law, computer science... “If a computer network implements IEEE 802.11 Wireless LAN Protocol and some computer is a node in that computer network, then that computer is vulnerable to decryption. “ http://cyc.com/cyc/technology/whatiscyc_dir/maptest Cyc: Summary Cyc SUMO License proprietary, free for research GNU GPL Entities 500k 20k Assertions 5m 70k Relations 15k Tools Reasoner, NL understanding tool Reasoner URL http://cyc.com http://ontologyportal.org References [Lenat, Comm. ACM 1995] [Niles, FOIS 2001] http://cyc.com/cyc/technology/whatiscyc_dir/whatsincyc http://ontologyportal.org SUMO (the Suggested Upper Model Ontology) is a research project in a similar spirit, driven by Adam Pease of Articulate Software WordNet What if we could make the English language computer-processable? George Miller • started in 1985 • Cognitive Science Laboratory, Princeton University • written by lexicographers • goal: support automatic text analysis and AI applications [Miller, CACM 1995] WordNet: Lexical Database synonymous words polysemous words Word photographic camera Sense sense1 camera television camera sense2 WordNet WordNet: Semantic Relations Hypernymy Kitchen Appliances Meronymy Is-value-of Camera Speed Slow Toaster Optical Lens Fast WordNet: Semantic Relations Relation Meaning Examples Synonymy (N, V, Adj, Adv) Same sense (camera, photographic camera) (mountain climbing, mountaineering) (fast, speedy) Antonymy (Adj, Adv) Opposite (fast, slow) (buy, sell) Hypernymy (N) Is-A (camera, photographic equipment) (mountain climbing, climb) Meronymy (N) Part (camera, optical lens) (camera, view finder) Troponymy (V) Manner (buy, subscribe) (sell, retail) Entailment (V) X must mean doing Y (buy, pay) (sell, give) WordNet: Hierarchy Hypernymy Is-A relations instrumentation equipment device photographic equipment lamp flash WordNet: Size Type Number #words 155k #senses 117k #word-sense pairs 207k %words that are polysemous 17% License Proprietary, Free for research http://wordnet.princeton.edu/wordnet/man2.1/wnstats.7WN.html Downloadable at http://wordnet.prin ceton.edu Wikipedia If a small number of people can create a knowledge base, how about a LARGE number of people? Jimmy Wales • started in 2001 • driven by Wikimedia Foundation, and a large number of volunteers • goal: build world’s largest encyclopedia Wikipedia: Entities and Attributes Entities Attributes Wikipedia: Synonymy and Polysemy Redirection (synonyms) Disambiguation (polysemy) Wikipedia: Classes/Categories Class hierarchy different from WordNet Wikipedia: Others Inter-lingual Links Navigation/ Topic box Wikipedia: Numbers English: • 1B words, • 2.8M articles, • 152K contributors All (250 languages): • 1.74B words, • 9.25M articles, • 283K contributors vs. Britannica: • 25X as many words • ½ avg article length License: Creative Commons Attribution-ShareAlike (CC-BY-SA) Growth 2001 - 2008 Downloadable at http://download.wi kimedia.org/ Automatically Constructed Knowledge Bases • Manual approaches (Cyc, WordNet, Wikipedia) – produce high quality knowledge bases – labor-intensive and limited in scope Can we construct the knowledge bases automatically? YAGO … , etc. YAGO Can we exploit Wikipedia and WordNet to build an ontology? YAGO • started as PhD thesis in 2007 • now major project at the Max Planck Institute for Informatics in Germany • goal: extract ontology from Wikipedia with high accuracy and consistency [Suchanek et al., WWW 2007] YAGO: Construction WordNet Person Person subclassOf subclassOf Singer Singer subclassOf Elvis Presley Rock Singer type Blah blah blub fasel (do not read this, better listen to the talk) blah blah Elvis blub (you are still reading this) blah Elvis blah blub later became astronaut blah ~Infobox~ Born: 1935 ... Categories: Rock singer born Exploit Infoboxes Exploit conceptual categories Add WordNet 1935 YAGO: Consistency Checks Person subclassOf Singer subclassOf Guitar Guitarist Rock Singer type Physics born born Check uniqueness of entities and functional arguments Check domains and ranges of relations Check type coherence 1935 YAGO: Relations About People About Locations About Other Things actedIn establishedOnDate happenedIn bornIn / on date established from / until diedIn / on date hasCapital isCalled created / on date hasPopulation foundIn dicovered locatedIn produced hasChild, hasSpouse hasCurrency hasProductionLanguage family name hasInflation hasISBN graduatedFrom hasPolitician hasPrecedecssor ... ... ... ca. 100 relations with range and domain YAGO: Numbers YAGO YAGO+Geonam es 2.6m 10m 0.5m 0.5m people 0.8m 0.8m classes 0.5m 0.5m Facts 30m 240m Relations 86 92 Precision 95% 95% License Creative Commons AttributionNonCommercial (CC-NC-BY) Entities organizations Downloadable at http://mpii.de/yago incl. converters for RDF, XML, databases DBpedia Can we harvest facts more exhaustively with community effort? • community effort started in 2007 • driven by Free U. Berlin, U. Leipzig, OpenLink • goal: "extract structured information from Wikipedia and to make this information available on the Web" [Bizer et al., Journal of Web Semantics 2009] DBPedia: Ontology In YAGO, the taxonomy is based on WordNet classes. Dbpedia: • places entities extracted from Wikipedia into its own ontology. •hand-crafted: 259 classes, 6 levels, 1200 properties • emphasizes recall • only half of extracted entities are currently placed in its own ontology • alternative classifications: Wikipedia, YAGO, UMBEL (OpenCyc) DBPedia: Mapping Rules DBpedia mapping rules: • maps Wikipedia infoboxes and tables to its ontology • target datatypes (normalize units, ignore deviant values) Community effort: • hand-craft mapping rules • expand ontology < http://en.wikipedia.org/wiki/Elvis_Presley > {{Infobox musical artist |Name = Elvis Presley |Background = solo_singer |Birth_name = Elvis Aaron Presley }} < http://dbpedia.org/page/Elvis_Presley > foaf:name “Elvis Presley”; background “solo_singer”; foaf:givenName “Elvis Aaron Presley”; Note that the values do not change. DBPedia: Numbers Type Number Facts English: 257 m (YAGO: 240 m) All languages: 1 b Entities 3.4 m overall (YAGO: 10 m) 1.5 m in DBPedia ontology People 312 k Locations 413 k Organizations 140 k License Creative Commons Attribution-ShareAlike 3.0 (CC-BY-SA 3.0) plus • 5.5 m links to external Web pages • 1.5 m links to images • 5 m links to other RDF data sets Downloadable at http://dbpedia.org Freebase What if we could harvest both automatic extraction and user contribution? • started in 2000 • driven by Metaweb, part of Google since Jul 2010 • goals: • “an open shared database of the world's knowledge” • “a massive, collaboratively-edited database of cross-linked data” Freebase Like DBpedia and YAGO, Freebase imports data from Wikipedia. Differently: • also imports from other sources (e.g., ChefMoz, NNDB, and MusicBrainz) • including individually contributed data • users can collaboratively edit its data (without having to edit Wikipedia). Freebase: User Contribution Edit Entities • create new entities • assign a new type/class to an entity • add/change attributes • connect to other entities • upload/edit images Review • flag vandalism • flag entities to be merged/deleted • vote on flagged content (3 unanimous vote, or an expert has to be tie-breaker) Edit Schema • define new class, specifying the attributes of the class • class definition can only be changed by creator/admin • class not part of commons until peer-reviewed & promoted by staff/admin Data Game • finding aliases in Wikipedia redirects • extracts dates of events from Wikipedia articles • uses the Yahoo image search API to find candidates Freebase: Community Experts • tie breaker in reviews • split entities • “rewind” changes New experts inducted by current experts. Admins • create new classes and attributes • respond to community suggestions Promoted by staff or other admins. Members • contribute (edit, review, vote) Anyone can be a member. Freebase: Numbers Type Number Facts 41 m Entities 13 m (YAGO: 10 m) People 2m Locations 946 k Businesses 567 k Film 397 k License Creative Commons Attribution (CC-BY) Downloadable at http://download.freebase.com Question Answering Systems Objective is to answer user queries from an underlying knowledge base. • data from Wikipedia and user edits • natural language translation of queries • 9 m entities, 300 m facts • computes answers from an internal knowledge base of curated, structured data. • stores not just facts, but also algorithms and models Application: Semantic Similarity • Task: determine similarity between two words – topological distance of two words in the graph – taxonomic distance: hierarchical is-a relations • Example application: correct real-word spelling errors physical entity legume bean soy … … legume garment bean trouser soy Tofu is made from soy jeans. [Hirst et al., Natural Language Engineering 2001] jean Application: Sentiment Orientation • Task: determine an adjective’s polarity (positive or negative) – same polarity connected by synonymic relations – opposite polarity by antonymic relations • Example application: overall sentiment of customer reviews suitable appropriate proper right spoiled GOOD BAD defective [Hu et al., KDD 2004] forged risky Application: Annotation of Web Data • Task: given a data source in the form of a Web table – Annotate column with entity type – Annotate pair of columns with relationship type – Annotate table cell with entity ID [Limaye et al., VLDB 2010] Application: Map Annotation Idea: • Determine geographical entities in the vicinity (by GPS coordinates) • Show information about these entities (from DBpedia) Possible Applications: • Map search on the Internet • Enhanced Reality applications [Becker et al., Linking Open Data Workshop 2008] Application: Faceted Search Attributes and values based on frequency (?) search is “full text search within results” Constraints are listed for possible deletion Suggestions based on current consideration set DBpedia Browser Summary • Part I covers what knowledge bases are – Knowledge representation model (RDF) – Manual knowledge bases: • WordNet: expert-driven, English words • Wikipedia: community-driven, entities/attributes – Automatically extracted knowledge bases: • YAGO: Wikipedia + WordNet, automated, high precision • DBpedia: Wikipedia + community-crafted mapping rules, high recall • Freebase: Wikipedia + other databases + user edits • Part II will cover how to extract information included in the knowledge bases References for Part I • • • • • • • • • • • C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, S. Hellmann: PDF DocumentDBpedia – A Crystallization Point for the Web of Data. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, Issue 7, Pages 154–165, 2009. C. Becker, C. Bizer: DBpedia Mobile: A LocationEnabled Linked Data Browser. Linking Open Data Workshop 2008 G. Hirst and A. Budanitsky: Correcting real-word spelling errors by restoring lexical cohesion. Natural Language Engineering 11 (1): 87–111, 2001. M. Hu and B. Liu: Mining and Summarizing Customer Reviews. KDD, 2004. J. Kamps, M. Marx, R. J. Mokken, and M. de Rijke: Using WordNet to Measure Semantic Orientations of Adjectives. LREC, 2004. D. Lenat: CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM, 1995. G. Limaye, S. Sarawagi, and S. Chakrabarti: Annotating and Searching Web Tables Using Entities, Types and Relationships. VLDB, 2010. G. A. Miller, WordNet: A Lexical Database for English. Communications of the ACM Vol. 38, No. 11: 39-41, 1995. F. M. Suchanek, G. Kasneci and G. Weikum: Yago - A Core of Semantic Knowledge. WWW, 2007. I. Niles, and A. Pease: Towards a Standard Upper Ontology. In Proceedings of the 2nd International Conference on Formal Ontology in Information Systems (FOIS-2001), Chris Welty and Barry Smith, eds, Ogunquit, Maine, October 17-19, 2001. World Wide Web Consortium: RDF Primer. W3C Recommendation, 2004. http://www.w3.org/TR/rdfprimer/ Outline • Part I – What and Why ✔ – Available Knowledge Bases ✔ • Part II – Extracting Knowledge • Part III – Ranking and Searching • Part IV – Other topics Harvesting Knowledge from Web Data 57 Entities & Classes Which entity types (classes, unary predicates) are there? scientists, doctoral students, computer scientists, … female humans, male humans, married humans, … Which subsumptions should hold (subclass/superclass, hyponym/hypernym, inclusion dependencies)? subclassOf (computer scientists, scientists), subclassOf (scientists, humans), … Which individual entities belong to which classes? instanceOf (Surajit Chaudhuri, computer scientists), instanceOf (BarbaraLiskov, computer scientists), instanceOf (Barbara Liskov, female humans), … Which names denote which entities? means (“Lady Di“, Diana Spencer), means (“Diana Frances Mountbatten-Windsor”, Diana Spencer), … means (“Madonna“, Madonna Louise Ciccone), means (“Madonna“, Madonna(painting by Edward Munch)), … ... Binary Relations Which instances (pairs of individual entities) are there for given binary relations with specific type signatures? hasAdvisor (JimGray, MikeHarrison) hasAdvisor (HectorGarcia-Molina, Gio Wiederhold) hasAdvisor (Susan Davidson, Hector Garcia-Molina) graduatedAt (JimGray, Berkeley) graduatedAt (HectorGarcia-Molina, Stanford) hasWonPrize (JimGray, TuringAward) bornOn (JohnLennon, 9-Oct-1940) diedOn (JohnLennon, 8-Dec-1980) marriedTo (JohnLennon, YokoOno) Which additional & interesting relation types are there between given classes of entities? competedWith(x,y), nominatedForPrize(x,y), … divorcedFrom(x,y), affairWith(x,y), … assassinated(x,y), rescued(x,y), admired(x,y), … Higher-arity Relations & Reasoning • Time, location & provenance annotations • Knowledge representation – how do we model & store these? • Consistency reasoning – how do we filter out inconsistent facts that the extractor produced? Facts (RDF triples) triples): Facts about facts: 1: 2: 3: 4: 5: 5: (1, inYear, 1968) 6: (2, inYear, 2006) 7: (3, validFrom, 22-Dec-2000) 8: (3, validUntil, Nov-2008) 9: (4, validFrom, 2-Feb-2008) 10: (2, source, SigmodRecord) 11: (5, inYear, 1999) 12: (5, location, CampNou) 13: (5, source, Wikipedia) (JimGray, hasAdvisor, MikeHarrison) (SurajitChaudhuri, hasAdvisor, JeffUllman) (Madonna, marriedTo, GuyRitchie) (NicolasSarkozy, marriedTo, CarlaBruni) (ManchesterU, wonCup, ChampionsLeague) Harvesting Knowledge from Web Data 60 Outline • Part I – What and Why ✔ – Available Knowledge Bases ✔ • Part II – Extracting Knowledge • Part III – Ranking and Searching • Part IV – Conclusion and Outlook Harvesting Knowledge from Web Data 61 Outline • Part II –Extracting Knowledge • Pattern-based Extraction • Consistency Reasoning • Higher-arity Relations: Space & Time Harvesting Knowledge from Web Data 62 Framework: Information Extraction (IE) Surajit obtained his PhD in CS from Stanford University under the supervision of Prof. Jeff Ullman. He later joined HP and worked closely with Umesh Dayal … sourcecentric IE 1) recall ! 2) precision instanceOf (Surajit, scientist) inField (Surajit, computer science) hasAdvisor (Surajit, Jeff Ullman) almaMater (Surajit, Stanford U) workedFor (Surajit, HP) friendOf (Surajit, Umesh Dayal) … one source yield-centric harvesting many sources 1) precision ! 2) recall near-human quality ! hasAdvisor Student Surajit Chaudhuri Alon Halevy Jim Gray … … Advisor Jeffrey Ullman Jeffrey Ullman Mike Harrison almaMater Student Surajit Chaudhuri Alon Halevy Jim Gray … … University Stanford U Stanford U UC Berkeley Framework: Knowledge Representation • RDF (Resource Description Framework, W3C): - subject-property-object (SPO) triples / binary relations - highly structured, but no (prescriptive) schema - first-order logical reasoning over binary predicates This tutorial! • Frames, F-Logic, description logics: OWL/DL/lite • Also: higher-order logics, epistemic logics Facts (RDF triples) triples): Reification: facts about facts: 1: 2: 3: 4: 5: (1, inYear, 1968) 6: (2, inYear, 2006) 7: (3, validFrom, 22-Dec-2000) 8: (3, validUntil, Nov-2008) 9: (4, validFrom, 2-Feb-2008) 10: (2, source, SigmodRecord) (JimGray, hasAdvisor, MikeHarrison) (SurajitChaudhuri, hasAdvisor, JeffUllman) (Madonna, marriedTo, GuyRitchie) (NicolasSarkozy, marriedTo, CarlaBruni) Temporal, spatial, & provenance annotations can refer to reified facts via fact identifiers (approx. equiv. to higer-arity RDF: Sub Prop Obj Time Location Source) ... Picking Low-Hanging Fruit (First) Deterministic Pattern Matching [Kushmerick 97; Califf & Mooney 99; Gottlob 01, …] ... Wrapper Induction [Gottlob et al: VLDB’01, PODS’04,…] • Wrapper induction: • Hierarchical document structure, XHTML, XML • Pattern learning for restricted regular languages (ELog, combining concepts of XPath & FOL) ... • Visual interfaces • See e.g. http://www.lixto.com/, http://w4f.sourceforge.net/ 67 Tapping on Web Tables [Cafarella et al: PVLDB‘08; Sarawagi et al: PVLDB‘09] Problem: discover interesting relations wonAward: Person Award nominatedForAward: Person Award … from many table headers and co-occurring cells ... Relational Fact Extraction From Plain Text • Hearst patterns [Hearst: COLING‘92] – POS-enhanced regular expression matching in natural-language text NP0 {,} such as {NP1, NP2, … (and|or) }{,} NPn NP0 {,}{NP1, NP2, … NPn-1}{,} or other NPn … “The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string.” isA(“Bambara ndang”, “bow lute”) • Noun classification from predicate-argument structures [Hindle: ACL’90] – Clustering of nouns by similar verbal phrases – Similarity based on co-occurrence frequencies (mutual information) beer wine drink 9.34 10.20 sell 4.21 3.75 have 0.84 1.38 Harvesting Knowledge from Web Data 69 DIPRE [Brin: WebDB‘98] • DPIRE: “Dual Iterative Pattern Relation Extraction” – (Almost) unsupervised, iterative gathering of facts and patterns – Positive & negative examples as seeds for target relation e.g. +(Hillary, Bill) +(Carla, Nicolas) –(Larry, Google) – Specificity threshold for new patterns based on occurrence frequency (Hillary, Bill) (Carla, Nicolas) X and her husband Y X and Y on their honeymoon (Angelina, Brad) (Victoria, David) (Hillary, Bill) (Carla, Nicolas) X and Y and their children X has been dating with Y X loves Y (Larry, Google) … Harvesting Knowledge from Web Data 70 DIPRE/Snowball/QXtract [Brin: WebDB’98; Agichtein,Gravano: SIGMOD’01+‘03] • DPIRE: “Dual Iterative Pattern Relation Extraction” – (Almost) unsupervised, iterative gathering of facts and patterns – Positive & negative examples as seeds for target relation e.g. +(Hillary, Bill) +(Carla, Nicolas) –(Larry, Google) – Specificity threshold for new patterns based on occurrence frequency • Snowball/QXtract [Agichtein,Gravano: DL’00, SIGMOD’01+‘03] – Refined patterns and statistical measures – >80% recall at >85% precision over a large news corpus – QXtract demo additionally allowed user feedback in the iteration loop Harvesting Knowledge from Web Data 71 Help from NLP: Dependency Parsing! • Analyze lexico-syntactic structure of sentences – Part-Of-Speech (POS) tagging & dependency parsing – Prefer shorter dependency paths for fact candidates Carla has been seen dating with Ben. NNP VBZ VBN VBN VBG IN dating(Carla, Ben) NNP software tools: CMU Link Parser: http://www.link.cs.cmu.edu/link/ Stanford Lex Parser: http://nlp.stanford.edu/software/lex-parser.shtml Open NLP Tools: http://opennlp.sourceforge.net/ ANNIE Open-Source Information Extraction: http://www.aktors.org/technologies/annie/ LingPipe: http://alias-i.com/lingpipe/ (commercial license) Harvesting Knowledge from Web Data 72 Harvesting Knowledge from Web Data 73 Open-Domain Gathering of Facts (Open IE) [Etzioni,Cafarella et al:WWW’04, IJCAI‘07; Weld,Hoffman,Wu: SIGMOD-Rec‘08] Analyze verbal phrases between entities for new relation types • unsupervised bootstrapping with short dependency paths Carla has been seen dating with Ben. Rumors about Carla indicate there is something between her and Ben. • self-supervised classifier for (noun, verb-phrase, noun) triples … seen dating with … (Carla, Ben), (Carla, Sofie), … … partying with … (Carla, Ben), (Paris, Heidi), … • build statistics & prune sparse candidates result oftentypes is noisy • group/cluster candidatesBut: for new relation and their facts clusters are not{romanticRelation}, canonicalized …relations {datesWith, partiesWith}, {affairWith, flirtsWith}, ... far from near-human-quality Learning More Mappings [Wu & Weld: CIKM’07, WWW‘08 ] Kylin Ontology Generator (KOG): learn classifier for subclassOf across Wikipedia & WordNet using • YAGO as training data • advanced ML methods (MLN‘s, SVM‘s) • rich features from various sources • Category/class name similarity measures • Category instances and their infobox templates: template names, attribute names (e.g. knownFor) #articles • Wikipedia edit history: refinement of categories • Hearst patterns: C such as X, X and Y and other C‘s, … • Other search-engine statistics: co-occurrence frequencies instances/classes > 3 Mio. entities > 1 Mio. w/ infoboxes > 500 000 categories Entity Disambiguation Names “Penn“ “U Penn“ Entities Sean Penn ? University of Pennsylvania “Penn State“ Pennsylvania State University „PSU“ Pennsylvania (US State) Passenger Service Unit • ill-defined with zero context • known as record linkage for names in record fields • Wikipedia offers rich candidate mappings: disambiguation pages, re-directs, inter-wiki links, anchor texts of href links Individual Entity Disambiguation Sean Penn … Penn Into the Wild … Penn XML Treebank … Penn Univ. Park University of Pennsylvania Penn State University Typical Approaches: name similarity: edit distances, n-gram overlap, … context similarity: record level context similarity: words/phrases level context similarity: text around names, classes & facts around entities Challenge: efficiency & scalability Collective Entity Disambiguation [Doan et al: AAAI‘05; Singla,Domingos: ICDM’07; Chakrabarti et al: KDD‘09, …] • Consider a set of names {n1, n2, …} in same context and sets of candidate entities E1 = {e11, e12, …}, E2 = {e21, e22, …}, … • Define joint objective function (e.g. likelihood for prob. model) that rewards coherence of mappings (n1)=x1E1, (n2)=x2E2, … • Solve optimization problem Stuart Russell (DJ) Stuart Russell Michael Jordan Stuart Russell (computer scientist) Michael Jordan (computer scientist) Michael Jordan (NBA) Declarative Extraction Frameworks • IBM’s SystemT [Krishnamurthy et al: SIGMOD Rec.’08, ICDE’08] – Fully declarative extraction framework – SQL-style operators, cost models, full optimizer support • DBLife/Cimple [DeRose, Doan et al: CIDR’07, VLDB’07] – Online community portal centered around the DB domain (regular crawls of DBLP, conferences, homepages, etc.) • More commercial endeavors: – FreeBase.com, WolframAlpha.com, Sig.ma, TrueKnowledge.com, Google.com/squared Harvesting Knowledge from Web Data 79 Google Images DBLP Homepages/ DBLP/ DBWorld/ Google Scholar DBWorld/DBLP /Google Scholar Harvesting Knowledge from Web Data 80 Probabilistic Extraction Models • Hidden Markov Models (HMMs) [Rabiner: Proc. IEEE’89; Sutton,McCallum: MIT Press’06] – Markov chain (directed graphical model) with “hidden” states Y, observations X, and transition probabilities – Factorizes the joint distribution P(Y,X) – Assuming independence among observations • Conditional Random Fields (CRFs) [Lafferty,McCallum,Pereira: ML’01; Sarawagi,Cohen: NIPS’04] – Markov random field (undirected graphical model) – Models the conditional distribution P(Y|X) (less strict independence assumptions) “I went skiing with Fernando Pereira in British Columbia.” • Joint segmentation and disambiguation of input strings onto entities and classes: NER, POS tagging, etc. • Trained, e.g., on bibliograhic entries, no manual labeling required Harvesting Knowledge from Web Data 81 Pattern-Based Harvesting [Hearst 92; Brin 98; Agichtein 00; Etzioni 04; …] Facts & Fact Candidates (Hillary, Bill) (Carla, Nicolas) Patterns X and her husband Y X and Y on their honeymoon (Angelina, Brad) (Victoria, David) (Hillary, Bill) X and Y and their children (Carla, Nicolas) X has been dating with Y X loves Y (Yoko, John) (Kate, Pete) (Carla, Benjamin) (Larry, Google) (Angelina, Brad) (Victoria, David) … • good for recall • noisy, drifting • not robust enough for high precision Outline • Part II –Extracting Knowledge • Pattern-based Extraction ✔ • Consistency Reasoning • Higher-arity Relations: Space & Time Harvesting Knowledge from Web Data 83 French Marriage Problem isMarriedTo: person person isMarriedTo: frenchPolitician person ... French Marriage Problem Facts in KB: married (Hillary, Bill) married (Carla, Nicolas) married (Angelina, Brad) New facts or fact candidates: married (Cecilia, Nicolas) married (Carla, Benjamin) married (Carla, Mick) married (Michelle, Barack) married (Yoko, John) married (Kate, Leonardo) married (Carla, Sofie) married (Larry, Google) 1) for recall: pattern-based harvesting 2) for precision: consistency reasoning Reasoning about Fact Candidates Use consistency constraints to prune false candidates! First-order-logic rules (restricted): spouse(x,y) diff(y,z) spouse(x,z) spouse(x,y) diff(w,y) spouse(w,y) spouse(x,y) f(x) spouse(x,y) m(y) spouse(x,y) (f(x)m(y)) (m(x)f(y)) Rules reveal inconsistencies Find consistent subset(s) of atoms (“possible world(s)“, “the truth“) Ground atoms: spouse(Hillary,Bill) spouse(Carla,Nicolas) spouse(Cecilia,Nicolas) spouse(Carla,Ben) spouse(Carla,Mick) spouse(Carla, Sofie) f(Hillary) f(Carla) f(Cecilia) f(Sofie) m(Bill) m(Nicolas) m(Ben) m(Mick) Rules can be weighted (e.g. by fraction of ground atoms that satisfy a rule) uncertain / probabilistic data compute prob. distr. over (a subset of) ground atoms being “true“ Markov Logic Networks (MLN‘s) [Richardson/Domingos: ML 2006] Map logical constraints & fact candidates into probabilistic graphical model: Markov Random Field (MRF) FOL rules: s(x,y) diff(y,z) s(x,z) s(x,y) diff(w,y) s(w,y) Grounding: s(x,y) f(x) s(x,y) m(y) f(x) m(x) m(x) f(x) Grounding: Literal Boolean Var Reasoning: Literal Binary RV s(Ca,Nic) s(Ce,Nic) s(Ca,Nic) s(Ca,Ben) s(Ca,Nic) m(Nic) s(Ca,Nic) s(Ca,So) s(Ce,Nic) m(Nic) s(Ca,Ben) s(Ca,So) s(Ca,Ben) m(Ben) s(Ca,Ben) s(Ca,So) s(Ca,So) m(So) Base facts w/entities: s(Carla,Nicolas) s(Cecilia,Nicolas) s(Carla,Ben) s(Carla,Sofie) … Markov Logic Networks (MLN‘s) [Richardson,Domingos: ML 2006] Map logical constraints & fact candidates into probabilistic graphical model: Markov Random Field (MRF) s(x,y) diff(y,z) s(x,z) s(x,y) diff(w,y) s(w,y) s(x,y) f(x) s(x,y) m(y) f(x) m(x) m(x) f(x) s(Ce,Nic) m(Nic) s(Ca,Nic) s(Ca,Ben) s(Ca,So) m(Ben) m(So) Variety of algorithms for joint inference: Gibbs sampling, other MCMC, belief propagation, randomized MaxSat, … s(Carla,Nicolas) s(Cecilia,Nicolas) s(Carla,Ben) s(Carla,Sofie) … RVs coupled by MRF edge if they appear in same clause MRF assumption: P[Xi|X1..Xn]=P[Xi|MB(Xi)] joint distribution has product form over all cliques Markov Logic Networks (MLN‘s) [Richardson,Domingos: ML 2006] Map logical constraints & fact candidates into probabilistic graphical model: Markov Random Field (MRF) s(x,y) diff(y,z) s(x,z) s(x,y) diff(w,y) s(w,y) s(x,y) f(x) s(x,y) m(y) f(x) m(x) m(x) f(x) s(Ce,Nic) 0.8 m(Nic) 0.1 s(Ca,Nic) s(Ca,Ben) 0.5 0.2 s(Ca,So) 0.7 s(Carla,Nicolas) s(Cecilia,Nicolas) s(Carla,Ben) s(Carla,Sofie) … m(Ben) 0.6 m(So) 0.7 Consistency reasoning: prune low-confidence facts! StatSnowball [Zhu et al: WWW‘09], BioSnowball [Liu et al: KDD‘10] EntityCube, MSR Asia: http://entitycube.research.microsoft.com/ Related Alternative Probabilistic Models Constrained Conditional Models [Roth et al. 2007] log-linear classifiers with constraint-violation penalty mapped into Integer Linear Programs Factor Graphs with Imperative Variable Coordination [McCallum et al. 2008] RV‘s share “factors“ (joint feature functions) generalizes MRF, BN, CRF, … inference via advanced MCMC flexible coupling & constraining of RV‘s s(Ca,Nic) s(Ce,Nic) m(Nic) s(Ca,Ben) m(Ben) software tools: s(Ca,So) alchemy.cs.washington.edu code.google.com/p/factorie/ research.microsoft.com/en-us/um/cambridge/projects/infernet/ m(So) Reasoning for KB Growth: Direct Route [Suchanek,Sozio,Weikum: WWW’09] New fact candidates: Facts in KB married (Hillary, Bill) married (Carla, Nicolas) married (Angelina, Brad) + married (Cecilia, Nicolas) married (Carla, Benjamin) married (Carla, Mick) married (Carla, Sofie) married (Larry, Google) ? Patterns: X and her husband Y X and Y and their children X has been dating with Y Direct approach: X loves Y • KB facts are true; fact candidates & patterns hypotheses • grounded constraints clauses with hypotheses as vars • cast into Weighted Max-Sat with weights from pattern stats • customized approximation algorithm • unifies: fact/candidate consistency, pattern goodness, entity disambig. www.mpi-inf.mpg.de/yago-naga/sofie/ SOFIE: Facts & Patterns Consistency [Suchanek,Sozio,Weikum: WWW’09] Constraints to connect facts, fact candidates & patterns pattern-fact duality: occurs(p,x,y) expresses(p,R) R(x,y) occurs(p,x,y) R(x,y) expresses(p,R) name(-in-context)-to-entity mapping: means(n,e1) means(n,e2) … functional dependencies: spouse(x,y): x y, y x relation properties: asymmetry, transitivity, acyclicity, … type constraints, inclusion dependencies: spouse Person Person capitalOfCountry cityOfCountry domain-specific constraints: bornInYear(x) + 10years ≤ graduatedInYear(x) hasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t www.mpi-inf.mpg.de/yago-naga/sofie/ SOFIE: Facts & Patterns Consistency [Suchanek,Sozio,Weikum: WWW’09] Constraints to connect facts, fact candidates & patterns pattern-fact duality: • Grounded into large propositional occurs(p,x,y) expresses(p,R) R(x,y) Boolean formula in CNF occurs(p,x,y) R(x,y) expresses(p,R) • Max-Sat solver for joint inference (complete truth assignment to all name(-in-context)-to-entity mapping: candidate patterns & facts) means(n,e1) means(n,e2) … functional dependencies: spouse(x,y): x y, y x relation properties: asymmetry, transitivity, acyclicity, … type constraints, inclusion dependencies: spouse Person Person capitalOfCountry cityOfCountry domain-specific constraints: bornInYear(x) + 10years ≤ graduatedInYear(x) hasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t www.mpi-inf.mpg.de/yago-naga/sofie/ SOFIE Example Spouse (HillaryClinton, BillClinton) Spouse (CarlaBruni, NicolasSarkozy) occurs (X and her husband Y, Hillary, Bill) occurs (X Y and their children, Hillary, Bill) occurs (X and her husband Y, Victoria, David) occurs (X dating with Y, Rebecca, David) occurs (X dating with Y, Victoria, Tom) Spouse (Victoria, David) [1] Spouse (Rebecca, David) [1] Spouse (Victoria, Tom) [1] expresses (X and her husband Y, Spouse) expresses (X Y and their children, Spouse) expresses (X dating with Y, Spouse) x,y,z,w: Spouse (Victoria, R(x,y) David) R(x,z) y=z Spouse (Rebecca, David) Spouse x,y,z,w: (Victoria, R(x,y) David) R(w,y) x=w Spouse (Victoria, Tom) … ... occurs x,y: R(x,y) (husband, R(y,x) Victoria, David) expresses (husband, Spouse) … Spouse (Victoria, David) occurs p,x,y:(dating, occurs Rebecca, (p, x, y) David) expresses expresses (p, R) (dating, R (x, y) Spouse) Spouse (Rebecca, David) … occurs p,x,y:(husband, occurs (p,Victoria, x, y) RDavid) (x, y) Spouse expresses (Victoria, (p, R) David) expresses (husband, Spouse) … [100] [40] [60] [20] [10] [1] [1] [1] [60] [20] [60] Soft Rules vs. Hard Constraints Enforce FD‘s (mutual exclusion) as hard constraints: hasAdvisor(x,y) diff(y,z) hasAdvisor(x,z) combine with weighted constraints no longer regular MaxSat constrained (weighted) MaxSat instead Generalize to other forms of constraints: Hard constraint Soft constraint hasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s<t firstPaper(x,p) firstPaper(y,q) author(p,x) author(p,y) ) inYear(p) > inYear(q) + 5years hasAdvisor(x,y) [0.6] open issue for arbitrary constraints Datalog-style grounding (deductive & potentially recursive) rethink reasoning ! Pattern Harvesting, Revisited [Suchanek et al: KDD’06; Nakashole et al: WebDB’10, WSDM’11] narrow / nasty / noisy patterns: X and his famous advisor Y X carried out his doctoral research in math under the supervision of Y X jointly developed the method with Y using narrow & dropping nasty patterns loses recall ! POS-lifted n-gram itemsets as patterns: X { PRP ADJ advisor } Y X { his doctoral research, under the supervision of} Y using noisy patterns loses precision & X { PRP doctoral research, IN DET supervision of} Y slows down MaxSat confidence weights, using seeds and counter-seeds: seeds: (MosheVardi, CatrielBeeri), (JimGray, MikeHarrison) counter-seeds: (MosheVardi, RonFagin), (AlonHalevy, LarryPage) confidence of pattern p ~ #p with seeds #p with counter-seeds Outline • Part II –Extracting Knowledge • Pattern-based Extraction ✔ • Consistency Reasoning ✔ • Higher-arity Relations: Space & Time Harvesting Knowledge from Web Data 97 Higher-arity Relations: Space & Time • YAGO-2 Preview Just Wikipedia #Relations Incl. Gazetteer Data 86 92 #Classes 563,374 563,997 #Entities 2,639,853 9,819,683 495,770,281 996,329,323 - basic relations 20,937,244 61,188,706 - types & classes 8,664,129 181,977,830 466,168,908 753,162,787 23.4 GB 37 GB #Facts - space, time & proven. Size (CSV format) estimated precision > 95% (for basic relations excl. space, time & provenance) www.mpi-inf.mpg.de/yago-naga/ Harvesting Knowledge from Web Data 98 French Marriage Problem (Revisited) JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC Facts in KB: 1: married (Hillary, Bill) 2: married (Carla, Nicolas) 3: married (Angelina, Brad) validFrom (2, 2008) New fact candidates: 4: 5: 6: 7: 8: married (Cecilia, Nicolas) married (Carla, Benjamin) married (Carla, Mick) divorced (Madonna, Guy) domPartner (Angelina, Brad) validFrom (4, 1996) validFrom (5, 2010) validFrom (6, 2006) validFrom (7, 2008) validUntil (4, 2007) Challenge: Temporal Knowledge Harvesting For all people in Wikipedia (100,000‘s) gather all spouses, incl. divorced & widowed, and corresponding time periods! >95% accuracy, >95% coverage, in one night Consistency constraints are potentially helpful: • functional dependencies: {husband, time} {wife, time} • inclusion dependencies: marriedPerson adultPerson • age/time/gender restrictions: birthdate + < marriage < divorce Difficult Dating (Even More Difficult) Implicit Dating explicit dates vs. implicit dates relative to other dates (Even More Difficult) Implicit Dating vague dates relative dates narrative text relative order TARSQI: Extracting Time Annotations http://www.timeml.org/site/tarsqi/ [Verhagen et al: ACL‘05] Hong Kong is poised to hold the first election in more than half <TIMEX3 tid="t3" TYPE="DURATION" VAL="P100Y">a century</TIMEX3> that includes a democracy advocate seeking high office in territory controlled by the Chinese government in Beijing. A prodemocracy politician, Alan Leong, announced <TIMEX3 tid="t4" TYPE="DATE" VAL="20070131">Wednesday</TIMEX3> that he had obtained enough nominations to appear on the ballot to become the territory’s next chief executive. But heextraction acknowledged that he had no chance of beating the Beijing-backed incumbent, Donald Tsang, who is seeking reerrors! election. Under electoral rules imposed by Chinese officials, only 796 people on the election committee – the bulk of them with close ties to mainland China – will be allowed to vote in the <TIMEX3 tid="t5" TYPE="DATE" VAL="20070325">March 25</TIMEX3> election. It will be the first contested election for chief executive since Britain returned Hong Kong to China in <TIMEX3 tid="t6" TYPE="DATE" VAL="1997">1997</TIMEX3>. Mr. Tsang, an able administrator who took office during the early stages of a sharp economic upturn in <TIMEX3 tid="t7" TYPE="DATE" VAL="2005">2005</TIMEX3>, is popular with the general public. Polls consistently indicate that three-fifths of Hong Kong’s people approve of the job he has been doing. It is of course a foregone conclusion – Donald Tsang will be elected and will hold office for <TIMEX3 tid="t9" beginPoint="t0" endPoint="t8“ TYPE="DURATION" VAL="P5Y">another five years </TIMEX3>, said Mr. Leong, the former chairman of the Hong Kong Bar Association. 13 Relations between Time Intervals [Allen, 1984; Allen & Hayes 1989] A Before B B After A A Meets B B MetBy A A A Overlaps B B OverlappedBy A A A Starts B B StartedBy A A During B B Contains A A Finishes B B FinishedBy A A Equal B A B B B A B A B A B A B Possible Worlds in Time [Wang,Yahya,Theobald: MUD Workshop ‘10] Derived Facts teamMates(Beckham, Ronaldo) State Relation playsFor(Beckham, Real, T1) playsFor(Ronaldo, Real, T2) overlaps(T1,T2) 0.36 0.16 0.08 ‘03 ‘04 0.12 ‘05 ‘07 Non-independent Independent 0.4 Base Facts 0.6 1.0 0.1 0.2 0.4 0.9 0.2 ‘05 ‘07 ‘03 playsFor(Beckham, Real) ‘00 ‘02 ‘07 ‘04 ‘05 playsFor(Ronaldo, Real) State Relation State Relation Possible Worlds in Time [Wang,Yahya,Theobald: MUD Workshop ‘10] Derived Facts teamMates(Beckham, Ronaldo) State Relation playsFor(Beckham, Real, T1) playsFor(Ronaldo, Real, T2) overlaps(T1,T2) 0.36 0.16 0.08 ‘03 ‘04 0.12 ‘05 ‘07 Non-independent Independent • Closed and complete representation model (incl. lineage) Stanford Trio project [Widom: CIDR’05, Benjelloun et al: VLDB’06] 1.0 0.4 of bins0.9 • Interval remains linear in the number 0.2 0.2 0.1 • Confidence computation per bin is #P-complete Base ‘05 ‘07 ‘03 ‘00 ‘02 ‘07 ‘04 ‘05 • In general requires possible-worlds-based sampling Real) playsFor(Beckham, Real) playsFor(Ronaldo, Facts techniques (Gibbs-style State Relation sampling, Luby-Karp, Stateetc.) Relation 0.6 computation 0.4 Open Problems and Challenges in IE (I) High precision & high recall at affordable cost robust pattern analysis & reasoning parallel processing, lazy / lifted inference, … Types and constraints soft rules & hard constraints, rich DL, beyond CWA explore & understand different families of constraints Declarative, self-optimizing workflows incorporate pattern & reasoning steps into IE queries/programs Scale, dynamics, life-cycle grow & maintain KB with near-human-quality over long periods Open-domain knowledge harvesting turn names, phrase & table cells into entities & relations Open Problems and Challenges in IE (II) Temporal Querying (Revived) query language (T-SPARQL?), no schema confidence weights & ranking Gathering Implicit and Relative Time Annotations biographies & news, relative orderings aggregate & reconcile observations Incomplete and Uncertain Temporal Scopes incorrect, incomplete, unknown begin/end vague dating Consistency Reasoning extended MaxSat, extended Datalog, prob. graph. models, etc. for resolving inconsistencies on uncertain facts & uncertain time Outline • Part II –Extracting Knowledge • Pattern-based Extraction ✔ • Consistency Reasoning ✔ • Higher-arity Relations: Space & Time ✔ Harvesting Knowledge from Web Data 111 References for Part II • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • E. Agichtein, L. Gravano, J. Pavel, V. Sokolova, A. Voskoboynik. Snowball: a prototype system for extracting relations from large text collections. SIGMOD, 2001. James Allen. Towards a general theory of action and time. Artif.Intell., 23(2), 1984. M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, O. Etzioni. Open information extraction from the web. IJCAI, 2007. R. Baumgartner, S. Flesca, G. Gottlob. Visual web information extraction with Lixto. VLDB, 2001. S. Brin. Extracting patterns and relations from the World Wide Web. WebDB, 1998. M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, Y. Zhang. WebTables: exploring the power of tables on the web. PVLDB, 1(1), 2008. M. E. Califf, R. J. Mooney. Relational learning of pattern-match rules for information extraction. AAAI, 1999. P. DeRose, W. Shen, F. Chen, Y. Lee, D. Burdick, A. Doan, R. Ramakrishnan. DBLife: A community information management platform for the database research community. CIDR, 2007. A. Doan, L. Gravano, R. Ramakrishnan, S. Vaithyanathan. (Eds.). Special issue on information extraction. SIGMOD Record, 37(4), 2008. O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, A. Yates. Web-scale information extraction in KnowItAll. WWW, 2004. G. Gottlob, C. Koch, R. Baumgartner, M. Herzog, S. Flesca. The Lixto data extraction project - back and forth between theory and practice. PODS, 2004. R. Gupta, S. Sarawagi: Answering Table Augmentation Queries from Unstructured Lists on the Web. PVLDB, 2(1), 2009. M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. COLING, 1992. D. Hindle. Noun classification from predicate-argument structures. ACL, 1990. R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, H. Zhu. SystemT: a system for declarative information extraction. SIGMOD Record, 37(4), 2008. S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti. Collective Annotation of Wikipedia Entities in Web Text. KDD, 2009. N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artif. Intell., 118(1-2), 2000. J. Lafferty, A. McCallum, F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ML, 2001. X. Liu, Z. Nie, N. Yu, J.-R. Wen. BioSnowball: automated population of Wikis. KDD, 2010. A. McCallum, K. Schultz, S. Singh. FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs. NIPS, 2009. N. Nakashole, M. Theobald, G. Weikum. Find your Advisor: Robust Knowledge Gathering from the Web. WebDB, 2010. L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 1989. M. Richardson and P. Domingos. Markov Logic Networks. ML, 2006. D. Roth, W. Yih. Global Inference for Entity and Relation Identification via a Linear Programming Formulation. MIT Press, 2007. S. Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3), 2008. S. Sarawagi, W. W. Cohen. Semi-Markov conditional random fields for information extraction. NIPS, 2004. W. Shen, X. Li, A. Doan. Constraint-Based Entity Matching. AAAI, 2005. P. Singla, P. Domingos. Entity resolution with Markov Logic. ICDM, 2006. F. M. Suchanek, M. Sozio, G. Weikum. SOFIE: a self-organizing framework for information extraction. WWW, 2009. F. M. Suchanek, G. Ifrim, G. Weikum. Combining linguistic and statistical analysis to extract relations from web documents. KDD, 2006. C. Sutton, A. McCallum. An Introduction to Conditional Random Fields for Relational Learning. MIT Press, 2006. R. C. Wang, W. W. Cohen. Language-independent set expansion of named entities using the web. ICDM, 2007. Y. Wang, M. Yahya, M. Theobald. Time-aware Reasoning in Uncertain Knowledge Bases. VDLB/MUD, 2010. D. S. Weld, R. Hoffmann, F. Wu. Using Wikipedia to bootstrap open information extraction. SIGMOD Record, 37(4), 2008. F. Wu, D. S. Weld. Autonomously semantifying Wikipedia. CIKM, 2007. F. Wu, D. S. Weld. Automatically refining the Wikipedia infobox ontology. WWW, 2008. A. Yates, M. Banko, M. Broadhead, M. J. Cafarella, O. Etzioni, S. Soderland. TextRunner: Open information extraction on the web. HLT-NAACL, 2007. J. Zhu, Z. Nie, X. Liu, B. Zhang, J.-R. Wen. StatSnowball: a statistical approach to extracting entity relationships. WWW, 2009. Harvesting Knowledge from Web Data Outline • Part I – What and Why ✔ – Available Knowledge Bases ✔ • Part II – Extracting Knowledge ✔ • Part III – Ranking and Searching • Part IV – Conclusion and Outlook Harvesting Knowledge from Web Data Outline for Part III • Part III.1: Querying Knowledge Bases – A short overview of SPARQL – Extensions to SPARQL • Part III.2: Searching and Ranking Entities • Part III.3: Searching and Ranking Facts Harvesting Knowledge from Web Data SPARQL • Query language for RDF from the W3C • Main component: – select-project-join combination of triple patterns graph pattern queries on the knowledge base Harvesting Knowledge from Web Data SPARQL – Example Example query: Find all actors from Ontario (that are in the knowledge base) scientist isA actor isA vegetarian isA isA Mike_Myers Jim_Carrey bornIn bornIn Scarborough Newmarket locatedIn locatedIn physicist isA chemist isA isA Albert_Einstein Otto_Hahn bornIn bornIn Ulm Frankfurt locatedIn Ontario locatedIn Germany locatedIn Canada isA locatedIn Harvesting Knowledge from Web Data Europe 116 SPARQL – Example Example query: Find all actors from Ontario (that are in the knowledge base) scientist isA actor isA vegetarian isA isA Mike_Myers Jim_Carrey bornIn bornIn Scarborough Newmarket locatedIn locatedIn physicist isA chemist isA isA Albert_Einstein Otto_Hahn bornIn bornIn Ulm Frankfurt locatedIn Ontario locatedIn Germany locatedIn Canada isA locatedIn Harvesting Knowledge from Web Data Europe 117 SPARQL – Example Example query: Find all actors from Ontario (that are in the knowledge base) SELECT ?person WHERE ?person isA actor. ?person bornIn ?loc. ?loc locatedIn Ontario. Find subgraphs of this form: actor constants actor isA ?person isA variables locatedIn isA isA Mike_Myers Jim_Carrey bornIn bornIn Scarborough Newmarket bornIn ?loc vegeta locatedIn Ontario locatedIn Ontario locatedIn Harvesting Knowledge from Web Data Canada 118 SPARQL – More Features • Eliminate duplicates in results SELECT DISTINCT ?c WHERE {?person isA actor. ?person bornIn ?loc. ?loc locatedIn ?c} • Return results in some order SELECT ?person WHERE {?person isA actor. ?person bornIn ?loc. ?loc locatedIn Ontario} ORDER BY DESC(?person) with optional LIMIT n clause • Optional matches and filters on bounded vars SELECT ?person WHERE {?person isA actor. OPTIONAL{?person bornIn ?loc}. FILTER (!BOUND(?loc))} • More operators: ASK, DESCRIBE, CONSTRUCT Harvesting Knowledge from Web Data SPARQL: Extensions from W3C W3C SPARQL 1.1 draft: • Aggregations (COUNT, AVG, …) • Subqueries • Negation: syntactic sugar for OPTIONAL {?x … } FILTER(!BOUND(?x)) Harvesting Knowledge from Web Data SPARQL: Extensions from Research (1) More complex graph patterns: • Transitive paths [Anyanwu et al., WWW07] SELECT ?p, ?c WHERE { ?p isA scientist . ?p ??r ?c. ?c isA Country. ?c locatedIn Europe . PathFilter(cost(??r) < 5). PathFilter (containsAny(??r,?t ). ?t isA City. } • Regular expressions [Kasneci et al., ICDE08] SELECT ?p, ?c WHERE { ?p isA ?s. ?s isA scientist. ?p (bornIn | livesIn | citizenOf) locatedIn* Europe.} Harvesting Knowledge from Web Data SPARQL: Extensions from Research (2) Queries over federated RDF sources: • Determine distribution of triple patterns as part of query (for example in ARQ from Jena) • Automatically route triple predicates to useful sources Harvesting Knowledge from Web Data 122 SPARQL: Extensions from Research (2) Queries over federated RDF sources: • Determine distribution of triple patterns as part of query (for example in ARQ from Jena) • Automatically route triple predicates to useful sources Potentially requires mapping of identifiers from different sources Harvesting Knowledge from Web Data 123 RDF+SPARQL: Systems • BigOWLIM • OpenLink Virtuoso • Jena with different backends • Sesame • OntoBroker • SW-Store, Hexastore, RDF-3X (no reasoning) System deployments with >1011 triples ( see http://esw.w3.org/LargeTripleStores) Harvesting Knowledge from Web Data Outline for Part III • Part III.1: Querying Knowledge Bases • Part III.2: Searching and Ranking Entities – Entity Importance: Graph Analysis – Entity Search: Language Models • Part III.3: Searching and Ranking Facts Harvesting Knowledge from Web Data Why ranking is essential • Queries often have a huge number of results: – scientists from Canada – conferences in Toronto – publications in databases – actors from the U.S. • Ranking as integral part of search • Huge number of app-specific ranking methods: paper/citation count, impact, salary, … • Need for generic ranking Harvesting Knowledge from Web Data Extending Entities with Keywords Remember: entities occur in facts in documents Associate entities with terms in those documents chancellor Germany scientist election Stuttgart21 Guido Westerwelle France Nicolas Sarkozy Harvesting Knowledge from Web Data Digression 1: Graph Authority Measures Idea: incoming links are endorsements & increase page authority, authority is higher if links come from high-authority pages PR(p) PR(q) (1 ) |V | ( p , q )E outdeg(p) Authority (page q) = stationary prob. of visiting q Random walk: uniformly random choice of links + random jumps Harvesting Knowledge from Web Data Graph-Based Entity Importance Combine several paradigms: • Keyword search on associated terms to determine candidate entities • Pagerank or similar measure to determine important entities • Ranking can combine entity rank with keywordbased score Harvesting Knowledge from Web Data Digression 2: Language Models (LMs) State-of-the-art model in text retrieval d1 ? LM(1) d2 q ? LM(2) • each document di has LM: generative probability distribution of terms with parameter i • query q viewed as sample from LM(1), LM(2), … • estimate likelihood P[ q | LM(i) ] that q is sample of LM of document di (q is „generated by“ di) • rank by descending likelihoods (best „explanation“ of q) Harvesting Knowledge from Web Data 130 Language Models for Text: Example model M A A A A B B estimate likelihood of observing query C C C D P [ A A B C E |EM] E E E E E query document d: sample of M used for parameter estimation Harvesting Knowledge from Web Data 131 Language Models for Text: Smoothing + model M C A A A A B B D A A C C C B D F E E E E E E estimate likelihood of observing query P [ A B C E F | M] query document d + background corpus and/or smoothing used for parameter estimation Harvesting Knowledge from Web Data Laplace smoothing Jelinek-Mercer Dirichlet smoothing … 132 Some LM Basics s(d , q) P[q | d ] i P[qi | d ] independ. assumpt. tf (i , d ) ~ i log k tf (k , d ) s(d , q) P[q | d ] (1 ) P[q] tf ( i , d ) df ( i ) ~ i log (1 ) tf ( k , d ) k df (k ) k tf ( i , d ) 1 ~ i log 1 k tf (k , d ) simple MLE: overfitting mixture model for smoothing P[q] est. from log or corpus k df (k ) rank by ascending df ( i ) P[i | q] ~ KL(q | d ) i P[i | q] log P[i | d ] Harvesting Knowledge from Web Data “improbability“ KL divergence (Kullback-Leibler div.) aka. relative entropy 133 Entity Search with LM Ranking query: keywords answer: entities P[qi | ei ] ~ KL (LM(q) | LM(e)) P[qi ] LM (entity e) = prob. distr. of words seen in context of e s(e, q) P[q | e] (1 ) P[q] ~ query q: „French player who won world championship“ candidate entities: e1: David Beckham e2: Ruud van Nistelroy played for ManU, Real, LA Galaxy David Beckham champions league England lost match against France married to spice girl … weighted by conf. e3: Ronaldinho e4: Zinedine Zidane e5: FC Barcelona Zizou champions league 2002 Real Madrid won final ... Zinedine Zidane best player France world cup 1998 ... [Z. Nie et al.: WWW’07] Harvesting Knowledge from Web Data 134 Outline for Part III • Part III.1: Querying Knowledge Bases • Part III.2: Searching and Ranking Entities • Part III.3: Searching and Ranking Facts – General ranking issues – NAGA-style ranking – Language Models for facts Harvesting Knowledge from Web Data What makes a fact „good“? Confidence: Prefer results that are likely correct accuracy of info extraction trust in sources (authenticity, authority) Informativeness: Prefer results with salient facts Statistical estimation from: frequency in answer frequency on Web frequency in query log Diversity: Prefer variety of facts Conciseness: Prefer results that are tightly connected size of answer graph cost of Steiner tree bornIn (Jim Gray, San Francisco) from „Jim Gray was born in San Francisco“ (en.wikipedia.org) livesIn (Michael Jackson, Tibet) from „Fans believe Jacko hides in Tibet“ (www.michaeljacksonsightings.com) q: Einstein isa ? Einstein isa scientist Einstein isa vegetarian q: ?x isa vegetarian Einstein isa vegetarian Whocares isa vegetarian E won … E discovered … E played … E won … E won … E won … E won … Einstein won NobelPrize Bohr won NobelPrize Einstein isa vegetarian Cruise isa vegetarian Cruise born 1962 Bohr died 1962 How can we implement this? Confidence: Prefer results that are likely correct accuracy of info extraction trust in sources (authenticity, authority) Informativeness: empirical accuracy of IE PR/HITS-style estimate of trust combine into: max { accuracy (f,s) * trust(s) | s witnesses(f) } PR/HITS-style entity/fact ranking [V. Hristidis et al., S.Chakrabarti, …] Prefer results with salient facts Statistical estimation from: frequency in answer frequency on Web frequency in query log IR models: tf*idf … [K.Chang et al., …] Statistical Language Models Diversity: Statistical Language Models Prefer variety of facts Conciseness: or graph algorithms (BANKS, STAR, …) [J.X. Yu et al., S.Chakrabarti et al., B. Kimelfeld et al., A. Markovetz et al., B.C. Ooi et al., G.Kasneci et al., …] Prefer results that are tightly connected size of answer graph cost of Steiner tree Harvesting Knowledge from Web Data 137 LMs: From Entities to Facts Document / Entity LM‘s LM for doc/entity: prob. distr. of words LM for query: (prob. distr. of) words LM‘s: rich for docs/entities, super-sparse for queries richer query LM with query expansion, etc. Triple LM‘s LM for facts: (degen. prob. distr. of) triple LM for queries: (degen. prob. distr. of) triple pattern LM‘s: apples and oranges • expand query variables by S,P,O values from DB/KB • enhance with witness statistics • query LM then is prob. distr. of triples ! Harvesting Knowledge from Web Data 138 LMs for Triples and Triple Patterns triple patterns (queries q): triples (facts f): q: Beckham ?y LM(q) +psmoothing f1: Beckham p ManchesterU q: Beckham p ManU 200/550 f2: Beckham p RealMadrid q: Beckham p Real 300/550 f3: Beckham p LAGalaxy q: Beckham p Galaxy 20/550 f4: Beckham p ACMilan q: Beckham p Milan 30/550 F5: Kaka p ACMilan q: ?x p ASCannes F6: Kaka p RealMadrid f7: Zidane p ASCannes Zidane p ASCannes 20/30 Tidjani p ASCannes 10/30 f8: Zidane p Juventus f9: Zidane p RealMadrid q: ?x p ?y LM(q): {t P [t | t matches q] ~ #witnesses(t)} f10: Tidjani p ASCannes Messi p FCBarcelona LM(answer 400/2600 f): {t f11: P [t | t matches f] ~ 1 for f} Messi p FCBarcelona Zidane p RealMadrid smooth 350/2600 all LM‘s f12: Henry p Arsenal Kaka p ACMilan 300/2600 rank results by ascending KL(LM(q)|LM(f)) … f13: Henry p FCBarcelona f14: Ribery p BayernMunich q: Cruyff ?r FCBarcelona f15: Drogba p Chelsea Cruyff playedFor FCBarca 200/500 f16: Casillas p RealMadrid Cruyff playedAgainst FCBarca 50/500 Cruyff coached FCBarca 250/500 Harvesting Knowledge from Web Data witness statistics 200 300 20 30 300 150 20 200 350 10 400 200 150 100 150 20 139 : 2600 LMs for Composite Queries q: Select ?x,?c Where {?x bornIn France . ?x playsFor ?c . ?c in UK . } P [ Henry bI F, Henry p Arsenal, Arsenal in UK P [ Drogba bI]F, ~ 200 Drogba 200 p160 Chelsea, UK ] 650 Chelsea 2600 in 500 30 150 140 ~ 650 2600 500 f21: Zidane bI F 200 f22: Tidjani bI F 20 f23: Henry bI F 200 f24: Ribery bI F 200 f25: Drogba bI F 30 f26: Drogba bI IC 100 F27: Zidane bI ALG 50 queries q with subqueries q1 … qn results are n-tuples of triples t1 … tn LM(q): P[q1…qn] = i P[qi] LM(answer): P[t1…tn] = i P[ti] KL(LM(q)|LM(answer)) = i KL(LM(qi)|LM(ti)) f1: Beckham p ManU 200 f7: Zidane p ASCannes 20 f8: Zidane p Juventus 200 f9: Zidane p RealMadrid 300 f10: Tidjani p ASCannes 10 f12: Henry p Arsenal 200 f13: Henry p FCBarca 150 f14: Ribery p Bayern 100 f15:Harvesting DrogbaKnowledge p Chelsea from Web Data 150 f31: ManU in UK 200 f32: Arsenal in UK 160 f33: Chelsea in UK 140 140 Extensions: Keywords Problem: not everything is triplified • Consider witnesses/sources (provenance meta-facts) • Allow text predicates with each triple pattern (à la XQ-FT) Semantics: triples match struct. pred. witnesses match text pred. European composers who have won the Oscar, whose music appeared in dramatic western scenes, and who also wrote classical pieces ? Select ?p Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name AcademyAward . ?p contributedTo ?movie [western, gunfight, duel, sunset] . ?p composed ?music [classical, orchestra, cantata, opera] . } Harvesting Knowledge from Web Data Extensions: Keywords Problem: not everything is triplified • Consider witnesses/sources (provenance meta-facts) • Allow text predicates with each triple pattern (à la XQ-FT) Grouping of keywords or phrases boosts expressiveness French politicians married to Italian singers? Select ?p1, ?p2 Where { ?p1 instanceOf ?c1 [France, politics] . ?p2 instanceOf ?c2 [Italy, singer] . ?p1 marriedTo ?p2 . } CS researchers whose advisors worked on the Manhattan project? Select ?r, ?a Where { ?r ?p1 instOf ?o1 researcher [“computer [“computer science“] science“] . . ?a ?p2 workedOn ?o2 [“Manhattan ?x [“Manhattan project“] project“] . . Harvesting Knowledge ?r ?p3 hasAdvisor ?a . } ?a . } from Web Data 142 LMs for Keyword-Augmented Queries q: Select ?x, ?c Where { France ml ?x [goalgetter, “top scorer“] . ?x p ?c . ?c in UK [champion, “cup winner“, double] . } subqueries qi with keywords w1 … wm results are still n-tuples of triples ti LM(qi): P[triple ti | w1 … wm] = k P[ti | wk] + (1) P[ti] LM(answer fi) analogous KL(LM(q)|LM(answer fi)) = i KL (LM(qi) | LM(fi)) result ranking prefers (n-tuples of) triples whose witnesses score high on the subquery keywords Harvesting Knowledge from Web Data 143 Extensions: Query Relaxation (2): … Where {?x bornIn IC. q(4) ?x .?x ?xpp?c ?c. .?c ?cininUK UK . .}} [ Zidane bI F, Zidane p Real, in ESP ] [ Real Drogba bI IC, Drogba p Chelsea, Chelsea in resOf UK] F, [ Drogba Drogba p Chelsea, Chelsea in bI UK] [ Drogba IC, Drogba p Chelsea, Chelsea in UK] f21: Zidane bI F 200 f22: Tidjani bI F 20 F23: Henry bI F 200 F24: Ribery bI F 200 F26: Drogba bI IC 100 F27 Zidane bI ALG 50 LM(q*) = LM(q) + 1 LM(q(1)) + 2 LM(q(2)) + … replace e in q by e(i) in q(i): precompute P:=LM (e ?p ?o) and Q:=LM (e(i) ?p ?o) set i ~ 1/2 (KL (P|Q) + KL (Q|P)) replace r in q by r(i) in q(i) LM (?s r(i) ?o) replace e in q by ?x in q(i) LM (?x r ?o) … LM‘s of e, r, ... f1: Beckham p ManU 200 f7: Zidane p ASCannes 20 f9: Zidane p Real 300 f10: Tidjani p ASCannes 10 f12: Henry p Arsenal 200 144 f15: Drogba p Chelsea 150 are prob. distr.‘s of triples f31: ManU in!UK 200 f32: Arsenal in UK 160 f33: Chelsea in UK 140 Extensions: Diversification q: Select ?p, ?c Where { ?p isa SoccerPlayer . ?p playedFor ?c . } 1 Beckham, ManchesterU 2 Beckham, RealMadrid 3 Beckham, LAGalaxy 4 Beckham, ACMilan 5 Zidane, RealMadrid 6 Kaka, RealMadrid 7 Cristiano Ronaldo, RealMadrid 8 Raul, RealMadrid 9 van Nistelrooy, RealMadrid 10 Casillas, RealMadrid 1 Beckham, ManchesterU 2 Beckham, RealMadrid 3 Zidane, RealMadrid 4 Kaka, ACMilan 5 Cristiano Ronaldo, ManchesterU 6 Messi, FCBarcelona 7 Henry, Arsenal 8 Ribery, BayernMunich 9 Drogba, Chelsea 10 Luis Figo, Sporting Lissabon rank results f1 ... fk by ascending KL(LM(q) | LM(fi)) (1) KL( LM(fi) | LM({f1..fk}\{fi})) implemented by greedy re-ranking of fi‘s in candidate pool Harvesting Knowledge from Web Data Searching and Ranking – Summary • Don‘t re-invent the wheel: LM‘s are elegant and expressive means for ranking consider both data & workload statistics • Extensions should be conceptually simple: can capture informativeness, personalization, relaxation, diversity – all in same framework • Unified ranking model for complete query language: still work to do Harvesting Knowledge from Web Data References for Part III • • • • • • • • • • • • • • • • • • SPARQL Query Language for RDF, W3C Recommendation, 15 January 2008, http://www.w3.org/TR/2008/REC-rdf-sparql-query20080115/ SPARQL New Features and Rationale, W3C Working Draft, 2 July 2009, http://www.w3.org/TR/2009/WD-sparql-features-20090702/ Kemafor Anyanwu, Angela Maduko, Amit P. Sheth: SPARQ2L: towards support for subgraph extraction queries in RDF databases. WWW Conference, 2007 Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan: Keyword Searching and Browsing in Databases using BANKS. ICDE, 2002 Soumen Chakrabarti: Dynamic personalized pagerank in entity-relation graphs. WWW Conference, 2007 Tao Cheng , Xifeng Yan , Kevin Chen-Chuan Chang: EntityRank: searching entities directly and holistically. VLDB, 2007 Shady Elbassuoni, Maya Ramanath, Ralf Schenkel, Marcin Sydow, Gerhard Weikum: Language-model-based ranking for queries on RDF-graphs. CIKM, 2009 Djoerd Hiemstra: Language Models. Encyclopedia of Database Systems, 2009 Vagelis Hristidis, Heasoo Hwang, Yannis Papakonstantinou: Authority-based keyword search in databases. ACM Transactions on Database Systems 33(1), 2008 Gjergji Kasneci, Maya Ramanath, Mauro Sozio, Fabian M. Suchanek, Gerhard Weikum: STAR: Steiner-Tree Approximation in Relationship Graphs. ICDE, 2009 Gjergji Kasneci, Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, Gerhard Weikum: NAGA: Searching and Ranking Knowledge. ICDE, 2008 Mounia Lalmas: XML Retrieval. Morgan & Claypool Publishers, 2009 Thomas Neumann, Gerhard Weikum: The RDF-3X engine for scalable management of RDF data. VLDB Journal 19(1), 2010 Zaiqing Nie, Yunxiao Ma, Shuming Shi, Ji-Rong Wen, Wei-Ying Ma: Web object retrieval. WWW Conference, 2007 Desislava Petkova, W. Bruce Croft: Hierarchical Language Models for Expert Finding in Enterprise Corpora. ICTAI, 2006 Nicoleta Preda, Gjergji Kasneci, Fabian M. Suchanek, Thomas Neumann, Wenjun Yuan, Gerhard Weikum: Active knowledge: dynamically enriching RDF knowledge bases by web services. SIGMOD Conference, 2010 Pavel Serdyukov, Djoerd Hiemstra: Modeling Documents as Mixtures of Persons for Expert Finding. ECIR, 2008 ChengXiang Zhai: Statistical Language Models for Information Retrieval. Morgan & Claypool Publishers, 2008 Harvesting Knowledge from Web Data Outline • Part I – What and Why ✔ – Available Knowledge Bases ✔ • Part II – Extracting Knowledge ✔ • Part III – Ranking and Searching ✔ • Part IV – Conclusion and Outlook Harvesting Knowledge from Web Data 148 But back to the original question... Will there ever be a famous singer called Elvis again? ?x hasGivenName “Elvis” type singer 149 But back to the original question... http://mpii.de/yago We found him! ?x = Elvis_Costello ?singer = wordnet_singer_110599806 ?d = 1954-08-25 Can we find out more about this guy? 150 But back to the original question... http://mpii.de/yago Alright, and even more? 151 Linking Open Data: Goal guitar plays born 1954 Costellopedia YAGO Can we combine knowledge from different sources? Linking Open Data: URIs guitar plays http://dbpedia.org/resource born 1954 http://costello.org 1. Define a name space http://dbpedia.org/resource/ElvisCostello http://costello.org/Elvis 2. Define entity names in that name space Every entity has a worldwide unique identifier (a Uniform Resource Identifier, URI). There is a W3C standard for that. [W3C URI] Linking Open Data: Cool URIs guitar plays born 1954 http://costello.org http://dbpedia.org/resource 1. Define a name space http://dbpedia.org/resource/ElvisCostello http://costello.org/Elvis 2. Define entity names in that name space 3. Make them accessible online http://costello.org/Elvis client born 1954 server There is a W3C description for that [W3C CoolURI] Linking Open Data: Links guitar plays http://dbpedia.org/resource born 1954 http://costello.org 1. Define a name space http://dbpedia.org/resource/ElvisCostello http://costello.org/Elvis 2. Define entity names in that name space This is an entity resolution problem. Use • • • • similar identifiers similar labels (names) keys (e.g., the ISBN) common properties Goal of the W3C group [Bizer JSWIS 2009] 3. Make them accessible online 4. Define equivalence links Linking Open Data: Status so far Currently (2010) • 200 ontologies • 25 billion triples • 400m links http://richard.cyganiak.de/2007/10/lod/imagemap.html Querying Semantic Data Sindice is an index for the Semantic Web developed at the DERI in Galway/Ireland. http://sindice.com Sindice exploits • RDF dumps available on the Web • RDF information embedded into HTML pages • RDF data available by cool URIs • inter-ontology links [Tummarello ISWC 2007] Querying Semantic Data ? ? ... far from perfect... but far from useless... Conclusion • We have seen the knowledge representation model of ontologies, RDF In a nutshell, RDF is a kind of distributed entity-relationship model • We have seen numerous existing knowledge bases ...manually constructed (Cyc and WordNet) and automatically constructed (YAGO, DBpedia, Freebase, TrueKnowledge etc.) • We have seen techniques for creating such knowledge bases (Pattern-based extraction and reasoning-based extraction, with uncertainty) • We have seen techniques for querying and ranking the knowledge (by SPARQL and language-based models) • We have seen that many knowledge bases already exist and that is ongoing work to interlink them • We have seen that there is indeed a promising singer called Elvis The End The slides are available at http://www.mpi-inf.mpg.de/yago-naga/CIKM10-tutorial/ Feel free to contact us with further questions Hady Lauw Institute for Infocomm Research, Singapore http://hadylauw.com Fabian M. Suchanek INRIA Saclay, Paris http://suchanek.name Martin Theobald Max-Planck Institute for Informatics, Saarbrücken http://mpii.de/~mtb Ralf Schenkel Saarland University http://people.mmci.uni-saarland.de/~schenkel/ References for Part IV References • [W3C URI] W3C: “Architecture of the World Wide Web, Volume One” Recommendation 15 December 2004, http://www.w3.org/TR/webarch/ • [W3C CoolURI] W3C: “Cool URIs for the Semantic Web” Interest Group Note 03 December 2008, http://www.w3.org/TR/cooluris/ • [Bizer JSWIS 2009] C.Bizer, T.Heath, T.Berners-Lee: “Linked data – the story so far” International Journal on Semantic Web and Information Systems, 5(3):1–22, 2009. • [Tummarello ISWC 2007] G. Tummarello, R. Delbru, E. Oren: “Sindice.com: Weaving the Open Linked Data” ISWC/ASWC 2007: