10 Years of Probabilistic Querying – What Next? Martin Theobald University of Antwerp Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig, Andre Melo, Iris Miliaraki, Luc de Raedt, Mauro Sozio, Fabian Suchanek “ The important thing is to not stop questioning ... One cannot help but be in awe when contemplating the mysteries of eternity, of life, of the marvelous structure of reality. It is enough if one tries merely to comprehend a little of this mystery every day.” “The Marvelous Structure of Reality” Joseph M. Hellerstein Keynote at WebDB 2003, San Diego - Albert Einstein, 1936 Look, There is Structure! The important thing is to not stop questioning Look, There is Structure! C1 Text is not just “unstructured data” Plethora of natural-languageprocessing techniques & tools Part-Of-Speech (POS) Tagging Named-Entity Recognition & Disambiguation (NERD) Dependency Parsing Semantic Role Labeling Look, There is Structure! C1 Text is not just “unstructured data” Plethora of natural-languageprocessing techniques & tools Part-Of-Speech (POS) Tagging Named-Entity Recognition & Disambiguation (NERD) Dependency Parsing Semantic Role Labeling But: Even the best NLP tools frequently yield errors Facts found on the Web are logically inconsistent Web-extracted knowledge bases are inherently incomplete Information Extraction YAGO/DBpedia et al. bornOn(Jeff, 09/22/42) gradFrom(Jeff, Columbia) hasAdvisor(Jeff, Arthur) hasAdvisor(Surajit, Jeff) knownFor(Jeff, Theory) >120 M facts for YAGO2 (mostly from Wikipedia infoboxes) New fact candidates type(Jeff, Author)[0.9] author(Jeff, Drag_Book)[0.8] author(Jeff,Cind_Book)[0.6] worksAt(Jeff, Bell_Labs)[0.7] type(Jeff, CEO)[0.4] 100’s M additional facts from Wikipedia free-text YAGO Knowledge Base 3 M entities, 120 M facts 100 relations, 200k classes Entity subclass subclass Organization accuracy 95% subclass Person Location subclass subclass Scientist subclass subclass subclass Politician subclass subclass Biologist State Physicist instanceOf instanceOf instanceOf Oct 23, 1944 instanceOf Max_Planck Society instanceOf City instanceOf Germany instanceOf Erwin_Planck diedOn Nobel Prize Country Kiel hasWon fatherOf locatedIn locatedIn bornIn SchleswigHolstein citizenOf Oct 4, 1947 Apr 23, 1858 diedOn Max_Planck bornOn means means Angela Merkel “Max Planck” http://www.mpi-inf.mpg.de/yago-naga/ means “Max Karl Ernst Ludwig Planck” means “Angela Merkel” means “Angela Dorothea Merkel” 7 Linked Open Data As of Sept. 2011: >200 linked-data sources >30 billion RDF triples >400 million owl:sameAs links http://linkeddata.org/ 8 Maybe Even More Importantly: Linked Vocabularies! LinkedData.org Schema.org Instance & class links between DBpedia, WordNet, OpenCyc, GeoNames, and many more… Common vocabulary released by Google, Yahoo!, BING to annotate Web pages, incl. links to DBpedia. Micro-Formats: RDFa (W3C) <html xmlns="http://www.w3.org/1999/xhtml" xmlns:dc="http://purl.org/dc/elements/1.1/" version="XHTML+RDFa 1.0" xml:lang="en"> Source: http://en.wikipedia.org/wiki/Linked_data <head><title>Martin's Home Page</title> <base href="http://adrem.ua.ac.be/~tmartin/" /> <meta property="dc:creator" content= "Martin" /> </head> 9 As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase 10 Application 1: Enrichment of Search Results “Recent Advances in Structured Data and the Web.” Alon Y. Halevy, Keynote at ICDE 2013, Brisbane 11 Application II: Machine Reading It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder. Blomkvist visits Henrik Vanger at his estate on thesame tiny island of Hedeby. same The old man draws Blomkvist in by promising solid evidence against Wennerström. same Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real owns niece Harriet some 40 years earlier. Hedeby is assignment: the disappearance of Vanger's home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members uncleOf hiresof the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik. enemyOf same affairWith After discovering that Salander has hacked into his computer, he persuades same her to assist him with research. They eventually become lovers, but Blomkvist has trouble getting close affairWith to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer. A 24-year-old computer hacker sporting an assortment of tattoos and body piercings headOf supports herself by doing deep background investigations for Dragan Armansky, who, in same turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill." Etzioni, Banko, Cafarella: Machine Reading.AAAI’06 Mitchell, Carlson et al.: Toward an Architecture for Never-Ending Language Learning. AAAI’1012 Application III: Natural-Language Question Answering evi.com (formerly trueknowledge.com) 13 Application III: Natural-Language Question Answering wolframalpha.com >10 trillion(!) facts >50,000 search algorithms >5,000 visualizations 14 IBM Watson: Deep Question Answering William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel This town is known as "Sin City" & its downtown is "Glitter Gulch" As of 2010, this is the only former Yugoslav republic in the EU 99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain Question classification & decomposition Knowledge back-ends D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine, Fall 2010. www.ibm.com/innovation/us/watson/index.htm 15 Natural-Language QA over Linked Data Multilingual Question Answering over Linked Data (QALD-3), CLEF 2011-13 http://greententacle.techfak.unibielefeld.de/~cunger/qald/ <question id="4" answertype="resource" aggregation="false" onlydbo="true"> <string lang="en">Which river does the Brooklyn Bridge cross?</string> <string lang="de">Welchen Fluss überspannt die Brooklyn Bridge?</string> <string lang="es">¿Por qué río cruza la Brooklyn Bridge?</string> <string lang="it">Quale fiume attraversa il ponte di Brooklyn?</string> <string lang="fr">Quelle cours d'eau est traversé par le pont de Brooklyn?</string> <string lang="nl">Welke rivier overspant de Brooklyn Bridge?</string> <keywords lang="en">river, cross, Brooklyn Bridge</keywords> <keywords lang="de">Fluss, überspannen, Brooklyn Bridge</keywords> <keywords lang="es">río, cruza, Brooklyn Bridge</keywords> <keywords lang="it">fiume, attraversare, ponte di Brooklyn</keywords> <keywords lang="fr">cours d'eau, pont de Brooklyn</keywords> <keywords lang="nl">rivier, Brooklyn Bridge, overspant</keywords> <query> PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX res: <http://dbpedia.org/resource/> SELECT DISTINCT ?uri WHERE { res:Brooklyn_Bridge dbo:crosses ?uri . } </query> </question> 16 Natural-Language QA over Linked Data INEX Linked Data Track, CLEF 2012-13 https://inex.mmci.unisaarland.de/tracks/lod/ <topic id="2012374" category="Politics"> <jeopardy_clue>Which German politician is a successor of another politician who stepped down before his or her actual term was over, and what is the name of their political ancestor?</jeopardy_clue> <keyword_title>German politicians successor other stepped down before actual term name ancestor</keyword_title> <sparql_ft> SELECT ?s ?s1 WHERE { ?s rdf:type <http://dbpedia.org/class/yago/GermanPoliticians> . ?s1 <http://dbpedia.org/property/successor> ?s . FILTER FTContains (?s, "stepped down early") . } </sparql_ft> </topic> 17 Outline Probabilistic Databases Probabilistic & Temporal Databases Sequenced vs. Non-Sequenced Semantics Interval Alignment & Probabilistic Inference Probabilistic Programming Stanford’s Trio System: Data, Uncertainty & Lineage Handling Uncertain RDF Data: URDF (Max-Planck-Institute/U-Antwerp) Statistical Relational Learning Learning “Interesting” Deduction Rules Summary & Challenges 18 Probabilistic Databases: A Panacea to All of the Afore Tasks? Probabilistic databases combine first-order logic and probability theory in an elegant way: C2 Declarative: Queries formulated in SQL/Relational Algebra/Datalog, support for updates, transactions, etc. Deductive: Well-studied resolution algorithms for SQL/Relational Algebra/Datalog (top-down/bottomup), indexes, automatic query optimization Scalable (?): Polynomial data complexity (SQL), but #P-complete for the probabilistic inference 19 Probabilistic Database A probabilistic database Dp (compactly) encodes a probability distribution over a finite set of deterministic database instances Di. 0.42 0.18 0.28 WorksAt(Sub, Obj) WorksAt(Sub, Obj) WorksAt(Sub, Obj) Jeff Stanford Jeff Princeton Jeff Stanford WorksAt(Sub, Obj) Princeton Special Cases: (1) Tuple-independent PDB Jeff 0.12 (II) Block-independent PDB WorksAt(Sub, Obj) p WorksAt(Sub, Obj) p Jeff Stanford 0.6 Jeff Stanford 0.6 Jeff Princeton 0.7 Princeton 0.4 Note: (I) and (II) are not equivalent! Query Semantics: (“Marginal Probabilities”) Run query Q against each instance Di; for each answer tuple t, sum up the probabilities of all instances Di where t exists. 20 [Widom: CIDR 2005] Stanford Trio System Uncertainty-Lineage Databases (ULDBs) 1. 2. 3. 4. Alternatives ‘?’ (Maybe) Annotations Confidence values Lineage 21 Trio’s Data Model 1. Alternatives: uncertainty about value Saw (witness, color, car) Amy red, Honda ∥ red, Toyota ∥ orange, Mazda Three possible instances 22 Trio’s Data Model 1. Alternatives 2.‘?’ (Maybe): uncertainty about presence Saw (witness, color, car) Amy red, Honda ∥ red, Toyota ∥ orange, Mazda Betty blue, Acura ? Six possible instances 23 Trio’s Data Model 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences: weighted uncertainty Saw (witness, color, car) Amy red, Honda 0.5 ∥ red, Toyota 0.3 ∥ orange, Mazda 0.2 Betty blue, Acura 0.6 ? Still six possible instances, each with a probability 24 So Far: Model is Not Closed Saw (witness, car) Cathy Honda ∥ Mazda Drives (person, car) Jimmy, Toyota ∥ Jimmy, Mazda Billy, Honda ∥ Frank, Honda Hank, Honda Suspects = πperson(Saw ⋈ Drives) Suspects Jimmy Billy ∥ Frank Hank ? ? ? CANNOT Does not correctly capture possible instances in the result 25 Example with Lineage ID 11 Saw (witness, car) Cathy Honda ∥ Mazda ID Drives (person, car) 21 Jimmy, Toyota ∥ Jimmy, Mazda 22 Billy, Honda ∥ Frank, Honda 23 Hank, Honda Suspects = πperson(Saw ⋈ Drives) ID Suspects 31 Jimmy 32 Billy ∥ Frank 33 Hank ? ? ? λ(31) = (11,2) Λ (21,2) λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2) λ(33) = (11,1) Λ 23 26 Example with Lineage ID 11 Saw (witness, car) Cathy Honda ∥ Mazda ID Drives (person, car) 21 Jimmy, Toyota ∥ Jimmy, Mazda 22 Billy, Honda ∥ Frank, Honda 23 Hank, Honda Suspects = πperson(Saw ⋈ Drives) ID Suspects 31 Jimmy 32 Billy ∥ Frank 33 Hank ? λ(31) = (11,2) Λ (21,2) ? λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2) ? λ(33) = (11,1) Λ 23 Correctly captures possible instances in the result (4) 27 Operational Semantics But: data complexity is #P-complete! Dp direct implementation possible instances D1, D2,…, Dn Dp′ rep. of instances Q on each instance Closure: up-arrow always exists D1’, D2’, …, Dm’ Completeness: any (finite) set of possible instances can be represented 28 Summary on Trio’s Data Model Uncertainty-Lineage Databases (ULDBs) 1. 2. 3. 4. Alternatives ‘?’ (Maybe) Annotations Confidence values Lineage Theorem: ULDBs are closed and complete. Formally studied properties like minimization, equivalence, approximation and membership based on lineage. [Benjelloun, Das Sarma, Halevy, Widom, Theobald:VLDB-J. 2008] 29 Basic Complexity Issue [Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries"] Theorem [Valiant:1979] For a Boolean expression E, computing Pr(E) is #P-complete NP = class of problems of the form “is there a witness ?” SAT #P = class of problems of the form “how many witnesses ?” #SAT The decision problem for 2CNF is in PTIME. The counting problem for 2CNF is already #P-complete. (will be coming back to this later again…) 30 …back to Information Extraction bornIn(Barack, Honolulu) bornIn(Barack, Kenya) Uncertain RDF (URDF): Facts & Rules Extensional Knowledge (the “facts”) High-confidence facts: existing knowledge base (“ground truth”) New fact candidates: extracted fact candidates with confidences Linked-Data & integration of various knowledge sources: Ontology merging or explicitly linked facts (owl:sameAs, owl:equivProp.) Large “Probabilistic Database” of RDF facts Intensional Knowledge (the “rules”) Soft rules: deductive grounding & lineage (Datalog/SLD resolution) Hard rules: consistency constraints (more general FOL rules) Propositional & probabilistic inference At query-time! Soft Rules vs. Hard Rules (Soft) Deduction Rules vs. (Hard) Consistency Constraints People may live in more than one place livesIn(x,y) marriedTo(x,z) livesIn(z,y) [0.8] livesIn(x,y) hasChild(x,z) livesIn(z,y) [0.5] People are not born in different places/on different dates bornIn(x,y) bornIn(x,z) y=z bornOn(x,y) bornOn(x,z) y=z People are not married to more than one person (at the same time, in most countries?) marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z disjoint(t1,t2) Soft Rules vs. Hard Rules (Soft) Deduction Rules vs. (Hard) Consistency Constraints Deductive Database: Datalog, core of SQL & Relational Algebra, People may live in more than one place RDF/S, OWL2-RL, etc. livesIn(x,y) marriedTo(x,z) livesIn(z,y) [0.8] livesIn(x,y) hasChild(x,z) livesIn(z,y) [0.5] People are not born in different places/on different dates More General FOL Constraints: Datalog plus constraints, People are not married to more than one X-tuples person in PDB’s, owl:FunctionalProperty, (at the same time, in most countries?) owl:disjointWith, etc. bornIn(x,y) bornIn(x,z) y=z bornOn(x,y) bornOn(x,z) y=z marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z disjoint(t1,t2) URDF Running Example KB: Rules RDF Base Facts type[1.0] Computer Scientist type[1.0] hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z) [0.4] type[1.0] hasAdvisor[0.8] Jeff hasAdvisor[0.7] Surajit graduatedFrom graduatedFrom [0.6] [?] graduatedFrom(x,y) graduatedFrom(x,z) y=z David graduatedFrom graduatedFrom[0.9] [?] graduatedFrom[?] graduatedFrom[?] [0.7] worksAt[0.9] Stanford Princeton type[1.0] type[1.0] University Derived Facts gradFr(Surajit,Stanford) gradFr(David,Stanford) Basic Types of Inference MAP Inference Find the most likely assignment to query variables y under a given evidence x. Compute: argmax y P( y | x) (NP-complete for MaxSAT) Marginal/Success Probabilities Probability that query y is true in a random world under a given evidence x. Compute: ∑y P( y | x) (#P-complete already for conjunctive queries) General Route: Grounding & MaxSAT Solving Query graduatedFrom(x, y) 1) Grounding CNF (graduatedFrom(Surajit, Stanford) graduatedFrom(Surajit, Princeton)) 1000 (graduatedFrom(David, Stanford) graduatedFrom(David, Princeton)) (hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford)) 1000 0.4 2) Propositional formula in CNF, consisting of – Grounded soft & hard rules – Weighted base facts 3) Propositional Reasoning (hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford)) 0.4 0.9 0.8 0.7 0.6 0.7 0.9 worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton) – Consider only facts (and rules) which are relevant for answering the query – Find truth assignment to facts such that the total weight of the satisfied clauses is maximized MAP inference: compute “most likely” possible world URDF: MaxSAT Solving with Soft & Hard Rules Special case: Horn-clauses as soft rules & mutex-constraints as hard rules C: Weighted Horn clauses (CNF) S: Mutex-const. [Theobald,Sozio,Suchanek,Nakashole:VLDS’12] { graduatedFrom(Surajit, Stanford), graduatedFrom(Surajit, Princeton) } { graduatedFrom(David, Stanford), graduatedFrom(David, Princeton) } (hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford)) 0.4 (hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford)) 0.4 0.9 0.8 0.7 0.6 0.7 0.9 worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton) Find: arg max y P( y | x) Resolves to a variant of MaxSAT for propositional formulas MaxSAT Alg. Compute W0 = ∑clauses C w(C) P(C is satisfied); For each hard constraint S { For each fact f in St { Compute Wf+t = ∑clauses C w(C) P(C is sat. | f = true); } Compute WS-t = ∑clauses C w(C) P(C is sat. | St = false); Choose truth assignment to f in St that maximizes Wf+t , WS-t ; Remove satisfied clauses C; t++; } • Runtime: O(|S||C|) • Approximation guarantee of 1/2 Experiment (I): MAP Inference • • • YAGO Knowledge Base: 2 Mio entities, 20 Mio facts Query Answering: Deductive grounding & MaxSAT solving for 10 queries over 16 soft rules (partly recursive) & 5 hard rules (bornIn, diedIn, marriedTo, …) Asymptotic runtime checks via synthetic (random) soft rule expansions URDF: Grounding & MaxSAT solving |C| - # literals in grounded soft rules |S| - # literals in grounded hard rules URDF MaxSAT vs. Markov Logic (MAP inference & MC-SAT) Basic Types of Inference MAP Inference ✔ Find the most likely assignment to query variables y under a given evidence x. Compute: argmax y P( y | x) (NP-complete for MaxSAT) Marginal/Success Probabilities Probability that query y is true in a random world under a given evidence x. Compute: ∑y P( y | x) (#P-complete already for conjunctive queries) Deductive Grounding with Lineage (SLD Resolution in Datalog/Prolog) Query graduatedFrom(Surajit, y) graduatedFrom (Surajit, Q1 Princeton) graduatedFrom (Surajit, Q2 Stanford) A(B (CD)) A(B (CD)) A graduatedFrom C hasAdvisor (Surajit,Jeff)[0.8] Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z) [0.4] graduatedFrom(x,y) graduatedFrom(x,z) y=z Base Facts \/ (Surajit, Princeton)[0.7] [Yahya,Theobald: RuleML’11 Dylla,Miliaraki,Theobald: ICDE’13] B graduatedFrom /\ D (Surajit, Stanford)[0.6] worksAt (Jeff,Stanford)[0.9] graduatedFrom(Surajit, Princeton) [0.7] graduatedFrom(Surajit, Stanford) [0.6] graduatedFrom(David, Princeton) [0.9] hasAdvisor(Surajit, Jeff) [0.8] hasAdvisor(David, Jeff) [0.7] worksAt(Jeff, Stanford) [0.9] type(Princeton, University) [1.0] type(Stanford, University) [1.0] type(Jeff, Computer_Scientist) [1.0] type(Surajit, Computer_Scientist) [1.0] type(David, Computer_Scientist) [1.0] Lineage & Possible Worlds [Das Sarma,Theobald,Widom: ICDE’08 Dylla,Miliaraki,Theobald: ICDE’13] Query graduatedFrom(Surajit, y) 1) Deductive Grounding 0.7x(1-0.888)=0.078 graduatedFrom (Surajit, Q1 Princeton) (1-0.7)x0.888=0.266 graduatedFrom (Surajit, Q2 Stanford) A(B (CD)) A(B (CD)) 1-(1-0.72)x(1-0.6) =0.888 \/ 0.8x0.9 =0.72 A graduatedFrom (Surajit, Princeton)[0.7] C hasAdvisor (Surajit,Jeff)[0.8] B graduatedFrom /\ D (Surajit, Stanford)[0.6] worksAt (Jeff,Stanford)[0.9] Dependency graph of the query Trace lineage of individual query answers 2) Lineage DAG (not in CNF), consisting of Grounded soft & hard rules Probabilistic base facts 3) Probabilistic Inference Compute marginals: P(Q): sum up the probabilities of all possible worlds that entail the query answers’ lineage P(Q|H): drop “impossible worlds” Possible Worlds Semantics P(Q1)=0.0784 P(Q2)=0.2664 P(Q1|H)=0.0784 / 0.412 = 0.1903 P(Q2|H)=0.2664 / 0.412 = 0.6466 A:0.7 B:0.6 C:0.8 D:0.9 Q2: P(W) 1 1 1 1 0 0.7x0.6x0.8x0.9 = 0.3024 1 1 1 0 0 0.7x0.6x0.8x0.1 = 0.0336 1 1 0 1 0 … = 0.0756 1 1 0 0 0 … = 0.0084 1 0 1 1 0 … = 0.2016 1 0 1 0 0 … = 0.0224 1 0 0 1 0 … = 0.0504 1 0 0 0 0 … = 0.0056 0 1 1 1 1 0.3x0.6x0.8x0.9 = 0.1296 0 1 1 0 1 0.3x0.6x0.8x0.1 = 0.0144 0 1 0 1 1 0.3x0.6x0.2x0.9 = 0.0324 0 1 0 0 1 0.3x0.6x0.2x0.1 = 0.0036 0 0 1 1 1 0.3x0.4x0.8x0.9 = 0.0864 0 0 1 0 0 … = 0.0096 0 0 0 1 0 … = 0.0216 0 0 0 0 0 … = 0.0024 A(B(CD)) Hard rule H: A (B (CD)) 0.0784 0.2664 1.0 0.412 Inference in Probabilistic Databases Safe query plans [Dalvi,Suciu: VLDB-J’07] Read-once functions [Sen,Deshpande,Getoor: PVLDB’10] Can factorize Boolean formula (in polynomial time) into read-once form, where every variable occurs at most once. Knowledge compilation [Olteanu et al.: ICDT’10, ICDT’11] Can propagate confidences along with relational operators. Can decompose Boolean formula into ordered binary decision diagram (OBDD), such that inference resolves to independent-and and independent-or operations over the decomposed formula. Top-k pruning [Ré,Davli,Suciu: ICDE’07; Karp,Luby,Madras: J-Alg.’89] Can return top-k answers based on lower and upper bounds, even without knowing their exact marginal probabilities. Multi-Simulation: run multiple Markov-Chain-Monte-Carlo (MCMC) simulations in parallel. Monte Carlo Simulation (I) [Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries" Karp,Luby,Madras: J-Alg.’89] Boolean formula: E = X1X2 v X1X3 v X2X3 Naïve sampling: cnt = 0 repeat N times randomly choose X1, X2, X3 {0,1} if E(X1, X2, X3) = 1 then cnt = cnt+1 P = cnt/N N may be very big return P /* estimate for true Pr(F)for */ small Pr(E) Theorem: If N ≥ (1/ Pr(E)) × (4 ln(2/d)/e2) then: Pr[ | P/Pr(E) - 1 | > e ] < d X1X2 X1X3 X2X3 Zero/One-Estimator Theorem Works for any E (not in PTIME) Monte Carlo Simulation (II) [Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries" Karp,Luby,Madras: J-Alg.’89] Boolean formula in DNF: E = C1 v C2 v . . . v Cm Importance sampling: cnt = 0; S = Pr(C1) + … + Pr(Cm) repeat N times randomly choose i {1,2,…, m}, with prob. Pr(Ci)/S randomly choose X1, …, Xn {0,1} s.t. Ci = 1 if C1= 0 and C2= 0 and … and Ci-1= 0 then cnt = cnt+1 P = cnt/N return P /* estimate for true Pr(E) */ Theorem: If N ≥ (1/m) × (4 ln(2/d)/e2) then: Pr[ |P/Pr(E) - 1| > e ] < d This is better! Only for E in DNF in PTIME Top-k Ranking by Marginal Probabilities Query graduatedFrom(Surajit, y) [Dylla,Miliaraki,Theobald: ICDE’13] Datalog/SLD resolution graduatedFrom (Surajit, Q1 Princeton) graduatedFrom (Surajit, Q2 Stanford) \/ graduatedFrom A (Surajit, Princeton)[0.7] C graduatedFrom (Surajit, y=Stanford) hasAdvisor (Surajit,Jeff)[0.8] Top-down grounding allows us to compute lower and upper bounds on the marginal probabilities of answer candidates before rules are fully grounded. Subgoals may represent sets of answer candidates. graduatedFrom B (Surajit, Stanford)[0.6] First-order lineage formulas: Φ(Q1) = A /\ Φ(Q2) = B y gradFrom(Surajit,y) D Prune entire set of answer worksAt (Jeff,Stanford)[0.9] candidates represented by Φ. Bounds for First-Order Formulas [Dylla,Miliaraki,Theobald: ICDE’13] Theorem 1: Given a (partially grounded) first-order lineage formula Φ: Φ(Q2) = B y gradFrom(S,y) Lower bound Plow (for all query answers that can be obtained from grounding Φ) Substitute y gradFrom(S,y) with false (or true if negated). Plow(Q2) = P(B false) = P(B) = 0.6 Upper bound Pup (for all query answers that can be obtained from grounding Φ) Substitute y gradFrom(S,y) with true (or false if negated). Pup(Q2) = P(B true) = P(true) = 1.0 Proof: (sketch) Substitution of a subformula with false reduces the number of models (possible worlds) that satisfy Φ; substitution with true increases them. Convergence of Bounds [Dylla,Miliaraki,Theobald: ICDE’13] Theorem II: Let Φ1,…, Φn be a series of first-order lineage formulas obtained from grounding Φ via SLD resolution, and let φ be the propositional lineage formula of an answer obtained from this grounding procedure. Then rewriting each Φi according to Theorem 1 into Pi,low and Pi,up creates a monotonic series of lower and upper bounds that converges to P(φ). 0 = P(false) P(B false) = 0.6 P(B (C D)) = 0.888 P(B true) = P(true) = 1 Proof: (sketch, via induction) Substitution of true with a formula reduces the number of models that satisfy Φ; substitution of false with a formula increases this number. Top-k Pruning “Fagin’s Algorithm” [Fagin et al.’01; Balke,Kießling’02; Dylla,Miliaraki,Theobald: ICDE’13] Maintain two disjoint queues: Top-k queue sorted by Plow and Candidates sorted by Pup Return the top-k queue at the t’th grounding step when: Pi,low(Qk) | Qk Top-k > Pi,up(Qj) | Qj Candidates Marginal 1 probability P1,up(Qj) Drop Qj from the Candidates queue. P2,up(Qj) k-th lower bound Pn,up(Qj) Pn,low(Qj) 0 P1,low(Qj) P2,low(Qj) #SLD steps t Top-k Stopping Condition “Fagin’s Algorithm” [Fagin et al.’01; Balke,Kießling’02; Dylla,Miliaraki,Theobald: ICDE’13] Maintain two disjoint queues: Top-k queue sorted by Plow and Candidates sorted by Pup Return the top-k queue at the t’th grounding step when: Pi,low(Qk) | Qk Top-k > Pi,up(Qj) | Qj Candidates Marginal 1 probability Pt,up(Q1) Pt,up(Q2) Stop and return the top-2 query answers. 2-nd lower bound k=2 Pt,low(Q1) Pt,up(Qm) Pt,low(Q2) Pt,low(Qm) 0 @SLD step t Experiment (II): Computing Marginals Top-10 Top-20 Top-50 MultiSim Top-10 MultiSim Top-20 MultiSim Top-50 Postgres MayBMS Trio 100,000 10,000 ms 1,000 100 10 1 Non-Rep. Hierarchical Q1 Rep. Hierarchical Q2 Head-Hierarchical Q3 General Unsafe Q4 IMDB data with 26 Mio facts about movies, directors, actors, etc. 4 query patterns, each instantiated to 1,000 queries (showing runtime averages) Q1 – safe, non-repeating hierarchical Q2 – unsafe, repeating hierarchical Q3 – unsafe, head-hierarchical Q4 – general unsafe Experiment (II): Computing Marginals IMDB data set, 26 Mio facts Runtime vs. number of top-k results; single join query Percentage of tuples scanned from input relations Basic Types of Inference MAP Inference ✔ Find the most likely assignment to query variables y under a given evidence x. Compute: argmax y P( y | x) Marginal/Success Probabilities (NP-complete for MaxSAT) ✔ Probability that query y is true in a random world under a given evidence x. Compute: ∑y P( y | x) (#P-complete already for conjunctive queries) Probabilistic & Temporal Database A temporal-probabilistic database DTp (compactly) encodes a probability distribution over a finite set of deterministic database instances Di and a finite time domain T. BornIn(Sub,Obj) T p DeNiro Greenwhich [1943, 1944) 0.9 DeNiro DeNiro Tribeca [1998, 1999) 0.6 DeNiro T p Abbott [1936, 1940) 0.3 Abbott [1976, 1977) 0.7 Sequenced Semantics & Snapshot Reducibility: Wedding(Sub,Obj) DeNiro Abbott T p [1988, 1989) 0.8 [Dignös, Gamper, Böhlen: SIGMOD’12] Built-in semantics: reduce temporal-relational operators to their nontemporal counterparts at each snapshot of the database. Coalesce/split tuples with consecutive time intervals based on their lineages. Non-Sequenced Semantics Divorce(Sub,Obj) [Dylla,Miliaraki,Theobald: PVLDB’13] Queries can freely manipulate timestamps just like regular attributes. Single temporal operator ≤T supports all of Allen’s 13 temporal relations. Deduplicate tuples with overlapping time intervals based on their lineages. Temporal Alignment & Deduplication (f1 f3) (f1 ¬f3) (f2 f3) (f2 ¬f3) Dedupl. Facts (f1 f3) (f1 ¬f3) (f1 f3) (f2 ¬f3) f1 f3 Deduced Facts f1 ¬f3 f2 f3 f2 ¬f3 Base Facts tmin f3 Wedding(DeNiro,Abbott) f1 1936 f2 Divorce(DeNiro,Abbott) Wedding(DeNiro,Abbott) 1976 1988 T tmax Non-Sequenced Semantics: MarriedTo(X,Y)[Tb1,tmax) Wedding(X,Y)[Tb1,Te1) ¬Divorce(X,Y)[Tb2,Te2) MarriedTo(X,Y)[Tb1,Te2) Wedding(X,Y)[Tb1,Te1) Divorce(X,Y)[Tb2,Te2) Te1 ≤T Tb2 Inference in Temporal-Probabilistic Databases [Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13] Derived t3 Facts teamMates(Beckham, Ronaldo, Ronaldo, Tt3) 0.08 ‘03 0.4 Base Facts ‘04 0.16 playsFor(Beckham, Real, T1) playsFor(Ronaldo, Real, T2) overlaps(T1,T2, T3) 0.12 ‘05 ‘07 0.6 ‘05 ‘07 ‘03 playsFor(Beckham, Real, T1) 0.1 0.2 0.4 0.2 ‘00 ‘02 ‘07 ‘04 ‘05 playsFor(Ronaldo, Real, T2) Inference in Temporal-Probabilistic Databases [Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13] teamMates(Beckham, Ronaldo, T4) Derived Facts teamMates(Beckham, Zidane, T5) teamMates(Ronaldo, Zidane, T6) 0.16 0.12 0.08 ‘03 ‘04 ‘05 ‘07 Non-independent Independent 0.4 Base Facts 0.6 0.4 0.2 0.2 playsFor(Zidane, Real, T3) ‘05 ‘07 ‘03 ‘00 ‘02 ‘07 ‘04 ‘05 playsFor(Beckham, Real, T1) playsFor(Ronaldo, Real, T2) 0.1 Inference in Temporal-Probabilistic Databases [Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13] Derived Facts teamMates(Beckham, Ronaldo, T4) teamMates(Beckham, Zidane, T5) teamMates(Ronaldo, Zidane, T6) Non-independent Independent Closed and complete representation model (incl. lineage) Temporal alignment is linear in the number of input intervals playsFor(Zidane, Real, T3) Confidence computation per interval remains #P-hard Base playsFor(Beckham, T1approximations ) playsFor(Ronaldo, Real, Facts In general requires MonteReal, Carlo (Luby-Karp forT2) DNF, MCMC-style sampling), decompositions, or top-k pruning Experiment (III): Temporal Alignment & Probabilistic Inference 1,827 base facts with temporal annotations Extracted from free-text biographies from Wikipedia, IMDB.com, biography.com 11 handcrafted temporal deduction rules, e.g.: MarriedTo(X,Y)[Tb1,Te2) Wedding(X,Y)[Tb1,Te1) Divorce(X,Y)[Tb2,Te2) Te1 ≤T Tb2 21 handcrafted temporal consistency constraints, e.g.: BornIn(X,Y)[Tb1,Te1) MarriedTo(X,Y)[Tb2,Te2) Te1 ≤T Tb2 Statistical Relational Learning & Probabilistic Programming SRL combines first-order logic and probabilistic inference Employs relational data as input, but with a focus also on learning the relations (facts, rules & weights) Knowledge compilation for probabilistic inference Markov Logic Networks (U-Washington) Including recent techniques for “lifted inference” Grounding of weighted first-order rules over a function-free Herbrand base into an undirected graphical model ( Markov Random Field) Probabilistic Programming (ProbLog, KU-Leuven) Deductive grounding over a set of base facts into a directed graphical model (SLD proofs Bayesian Net) Learning Soft Deduction Rules Goal: Inductively learn soft rule S: livesIn(x,y) :- bornIn(x,y) R Ground truth for IivesIn (only partially known) Knowledge base for livesIn (known positive examples) Facts inferred for livesIn from the body of the rule bornIn (only partially correct) KB G | Head Body | confidence( S ) P( Head | Body ) | Body | Inductive learning algorithm based on dynamic programming A-priori-style pre-filtering & pruning of low-support join patterns Adaptation of confidence and support measures from data mining Learning “interesting” rules with constants and type constraints Learning “Interesting” Deduction Rules (I) income(x, y), educationLevel(x, z) Overall population QOB-1st-quarter QOB-2nd-quarter QOB-3rd-quarter QOB-4th-quarter income re/. freq. re/. freq. income(x, y), quarterOfBirth(x, z) income Plots for the distribution of income versus quarterOfBirth and educationLevel over actual US census data from Oct. 2009 (>1 billion RDF facts). Divergence from “Overall population” shows strong correlation of income with educationLevel but not with quarterOfBirth. Learning “Interesting” Deduction Rules (II) re/. freq. – Overall population – Nursery school to Grade 4 – Professional school degree income(x, y) :- educationLevel(x, z) income(x, “low”) :educationLevel(x, “Nursery school to Grade 4”) income(x, “medium”) :educationLevel(x, “Professional school degree”) income(x, “high”) :educationLevel(x, “Professional school degree”) low medium high income Divergence measured using Kullback-Leibler or χ2 between “Overall population” with “Nursery school to Grade 4” and “Professional school degree” over discretized income domain. Summary & Challenges (I) Web-Scale Information Extraction Names & Patterns OpenDomain & Unsupervised DomainOriented Training Data/Facts human effort ontological rigor Entities & Relations < „N. Portman“, „honored with“, „Academy Award“>, < „Jeff Bridges“, „expected to win“, „Oscar“ > < „Bridges“, „nominated for“, „Academy Award“> wonAward: Person Prize type (Meryl_Streep, Actor) wonAward (Meryl_Streep, Academy_Award) wonAward (Natalie_Portman, Academy_Award) wonAward (Ethan_Coen, Palme_d‘Or) Summary & Challenges (I) Web-Scale Information Extraction Names & Patterns OpenDomain & Unsupervised DomainOriented Training Data/Facts TextRunner Probase ontological rigor Entities & Relations ? WebTables / FusionTables StatSnowball / EntityCube ReadTheWeb / NELL Sofie / Prospera ----- Freebase DBpedia 3.8 human effort YAGO2 Summary & Challenges (II) RDF is Not Enough! HMM’s, CRF’s, PCFG’s (not in this talk) yield much richer output structures than just triplets. Extraction of facts beliefs, modifiers, modalities, etc.. intensional knowledge (“rules”) More expressive but canonical representation of natural language: trees, graphs, objects, frames (F-logic, KL-one, CycL, OWL, etc.) All combined with structured probabilistic inference Summary & Challenges (III) Scalable Probabilistic Inference “Domain-liftable” FO formula Corresponding FO d-DNNF circuit X,YPeople smokes(X) friends(X,Y) smokes(Y) Exact lifted inference via Weighted-First-Order-Model-Counting (WFOMC) Probability of a query depends only on the size(s) of the domain(s), a weight function for the first-order predicates, and the weighted model count over the FO d-DNNF. [Van den Broeck’11]: Compilation rules and inference algorithms for FO d-DNNF’s [Jha & Suciu’11]: Classes of SQL queries which admit polynomial-size (propositional) d-DNNF’s Approximate inference via Belief Propagation, MCMC-style sampling, etc. Scale-out via distributed grounding & inference: TrinityRDF (MSR), GraphLab2 (MIT) Final Summary C1 Text is not just unstructured data. C2 C3 Probabilistic databases combine first-order logic and probability theory in an elegant way. Natural-Language-Processing people, Database guys, and Machine-Learning folks: it’s about time to join your forces! Demo! urdf.mpi-inf.mpg.de References Maximilian Dylla, Iris Miliaraki, and Martin Theobald: A Temporal-Probabilistic Database Model for Information Extraction. PVLDB 6(14), 2013 (to appear) Maximilian Dylla, Iris Miliaraki, and Martin Theobald: Top-k Query Processing in Probabilistic Databases with NonMaterialized Views. ICDE 2013, 2013 Ndapandula Nakashole, Mauro Sozio, Fabian Suchanek, Martin Theobald: Query-Time Reasoning in Uncertain RDF Knowledge Bases with Soft and Hard Rules. VLDS 2012: 15-20 Mohamed Yahya, Martin Theobald: D2R2: Disk-Oriented Deductive Reasoning in a RISC-Style RDF Engine. RuleML America 2011: 81-96 Timm Meiser, Maximilian Dylla, Martin Theobald: Interactive Reasoning in Uncertain RDF Knowledge Bases. CIKM 2011: 2557-2560 Ndapandula Nakashole, Martin Theobald, Gerhard Weikum: Scalable Knowledge Harvesting with High Precision and High Recall. WSDM 2011: 227-236 Maximilian Dylla, Mauro Sozio, Martin Theobald: Resolving Temporal Conflicts in Inconsistent RDF Knowledge Bases. BTW 2011: 474-493 Yafang Wang, Mohamed Yahya, Martin Theobald: Time-aware Reasoning in Uncertain Knowledge Bases. MUD 2010: 51-65 Ndapandula Nakashole, Martin Theobald, Gerhard Weikum: Find your Advisor: Robust Knowledge Gathering from the Web. WebDB 2010 Anish Das Sarma, Martin Theobald, Jennifer Widom: LIVE: A Lineage-Supported Versioned DBMS. SSDBM 2010: 416-433 Anish Das Sarma, Martin Theobald, Jennifer Widom: Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases. ICDE 2008: 1023-1032 Omar Benjelloun, Anish Das Sarma, Alon Y. Halevy, Martin Theobald, Jennifer Widom: Databases with uncertainty and lineage. VLDB J. 17(2): 243-264 (2008)