Probabilities in Databases and Logics I Nilesh Dalvi and Dan Suciu University of Washington Two Lectures •Today: probabilistic database to model imprecisions probabilistic logics •Tomorrow: probabilistic database to model incompletness random graphs 2 Motivation Record reconciliation Information extraction Constraint violations Schema matching 3 Problem Setting Queries: Tables: A(x,y) :- Review(x,y), Movie(x,z), z > 1991 Review Movie name rating p Monkey Love good .5 title year p fair .2 Twelve Monkeys 1995 .8 fair .6 Monkey Love 1997 .4 poor .9 Monkey Love 1935 .9 title Monkey Love Pl 2005 .7 Twelve Monkeys fair .53 Monkey Love good .42 Monkey Love Pl fair .15 Problem: complexity of query evaluation Answers: Top k rating p 5 Two Problems Fixed schema S, conjunctive query Q(x,y) Query evaluation problem Fix answer tuple (a,b) Given database I, compute Pr(Q(a,b)) Top-k answering problem Fix k > 0 Given database I, find k answer tuples with highest probabilities 6 Related Work: DB Cavallo&Pitarelli:1987 Barbara,Garcia-Molina, Porter:1992 Lakshmanan,Leone,Ross&Subrahmanian:199 7 Fuhr&Roellke:1997 Dalvi&S:2004 Widom:2005 7 Related Work: Logic Query reliability [Gradel,Gurevitch,Hirsch’98] Degrees of belief [Bacchus,Grove,Halpern,Koller’96] Probabilistic Logic [Nielson] Probabilistic model checking [Kwiatkowska’02] Probabilistic Relational Model [Taskar,Abbeel,Koller’02] 8 Outline Definitions Query Evaluation Top-k answering (joint with Chris Re) Conclusions 9 Probabilistic Database •Schema S, Domain D, Set of instances Inst •Definition Probabilistic database is a probability distribution Pr : Inst → [0,1], ∑I Pr[I] = 1 If Pr[I] > 0 then I is called “possible world” 23 Probabilistic Database •Representation: Independent tuples: I-database DB over some schema Si Independent and disjoint tuples: ID-database DB over some schema Sid •Semantics: DB “means” probability distribution Pr over schema S 24 Representation I-Databases Reviewsi(M,S,p) Reviews(M,S) Mov Scor Movie Score P m42 good p m99 good p m76 poor p 1 2 3 Mov Scor Mov Scor Mov Scor Mov Scor Mov Scor Mov Scor Mov Scor m42 good m99 good m42 good m76 poor m42 good m42 1995 m42 good m76 poor m76 poor m99 good m99 good Pr[I1]= (1-p )*(1-p )*(1-p ) 1 2 3 m76 poor Pr[I4]= p *p *(1-p ) 1 2 3 Pr[I1] + Pr[I2] + . . . + Pr[I8] = 1 Possible worlds semantics Pr[I8]= p *p *p 1 2 3 26 ID-Databases id Activities Activities Time Act Timed Activity P t walk t run t+1 walk p 1 p 2 p 3 Time Act Time Act Time Act Time Act Time Act t t t+1 t walk t run t+1 walk t+1 walk walk run walk Pr[I1]= (1-p -p )*(1-p ) 1 2 3 Pr[I3]= p *(1-p ) 2 3 Pr[I5]= p *p 1 3 Pr[I1] + Pr[I2] + . . . + Pr[I6] = 1 28 ID subsumes I id Reviews i Reviews Movied Scored P m42 good p m99 good p m76 poor p = 1 2 3 Note: Movie Score P Reviewsid m42 good p m99 good p m76 poor p 1 2 3 Movie Score P m42 good p m99 good p m76 poor p means all tuples are disjoint 1 2 3 29 Queries •Syntax: conjunctive queries over schema S Q(y) :- Movie(x,y), Review(x,z), z >= 3 i Review i Movie id m42 m99 m76 m05 year 1995 2002 2002 2005 P 0.95 0.65 0.1 0.7 mid rating p m42 4 0.7 m42 m99 m99 5 5 4 0.45 0.82 0.68 m05 5 0.79 30 Two Query Semantics •Possible answer sets Given set A: Pr[{t | I ⊨ Q(t)} = A] Used for views •Possible This talk tuples Given tuple t: Pr[I ⊨ Q(t)] Used for query evaluation and top-k 31 Query Semantics id year m42 id year 1935 m76 1903 id year 2004 m99 m99 1901 m05 m76 1902 p1 p2 id year 1995 m87 1934 m99 1935 m44 1904 m05 2004 p3 p4 Q(y) :- Movie(x,y), Review(x,z) Tuple probabilities year 1935 2004 1995 . . . p p2 + p3 = 0.6 p1 + p3 = 0.5 p3 = 0.2 . . . top k 35 Summary on Data Model Data Model: Semantics = possible worlds Syntax = I-databases or ID-databases Queries: Syntax = unchanged (conjunctive queries) Semantics = tuple probabilities 38 Outline Definitions Query evaluation Top-k answering Conclusions 39 Problem Definition •Fix schema S, query Q, answer tuple t •Problem: given I/ID-database DB, compute Pr[I ⊨ Q(t)] notation: Pr[Q(t)] •Conventions: For upper bounds (P or #P): probabilities are rationals For lower bounds (#P): probabilities are 1/2 40 Query Evaluation on I-Databases •Outline Intuition Extensional plans: PTIME case Hard queries: #P-complete case Dichotomy Theorem 41 Q(y) :- Movie(x,y), Review(x,z) Intuition Reviewi Moviei id year p m42 1995 p1 m99 2002 p2 m76 2002 p3 m05 2005 p4 mid m42 m42 m42 m99 m99 m76 Answer Year rate 4 2 3 1 3 5 p q1 q2 q3 q4 q5 q6 p 1995 p1 × (1 - (1 - q1) ×(1 - q2)×(1 - q3)) 2002 1 - (1 - p2 × (1 - (1 - q4)×(1 ) ×- q5)) (1 - p3 × )q6 42 I-Extensional Plans [Barbara92,Lakshmanan97] Add p Join ⋈ p = p1 * p2 Projection ∏ p = 1-(1-p1)(1-p2)...(1-pn) Selection σ p = p Note: data complexity is PTIME 43 Q(y) :- Movie(x,y), Review(x,z) 1995 1-(1-pq1)(1-pq2)(1pq3) ∏ ∏ 1995 m1 pq1 p(1-(1-q1)(1-q2)(11995 m1 q3)) ⋈ 1 - (1-q1)(1-q2)(1m1 q3) 1995 m1 pq2 ⋈ 1995 m1 pq3 ∏ Movie Review 1995 p Movie m1 q1 m1 q2 m1 q3 INCORRECT! 1995 p Review m1 q1 m1 q2 m1 q3 CORREC 46 T #P-Complete Queries i R A i T S p A B B p p1 q1 p2 q2 p3 q3 p4 q4 Qbad :- Ri(x), S(x,y), Ti(y) Theorem: Data complexity is #P-complete 48 Proof: Theorem [Provan&Ball83] Counting the number of satisfying assignments for bipartite DNF is #Pcomplete Reduction: x2y3 V x1y2 V x4y3 V x3y1 i T S Ri A p A B B p x1 1/2 x2 y3 y1 1/2 x2 1/2 x1 y2 y2 1/2 x3 1/2 x4 y3 y3 1/2 x4 1/2 x3 y1 Qbad :- Ri(x), S(x,y), Ti(y) 49 I-Dichotomy Q = boolean conjunctive query Definition 1. For each variable x: goals(x) = set of goals that contain x Definition 2. Q is hierarchical if forall x, y: (a) goals(x) ∩ goals(y) = ∅, or (b) goals(x) ⊆ goals(y), or (c) goals(y) ⊆ goals(x) Q :- R(x),S(x,y),T(x,y,z),K(x,v) x y z R S v T K “hierarchical” Q :- R(x), S(x,y), T(y) x “non-hierarchical” R y S T 52 [Dalvi&S.’04] I-Dichotomy Schema Si = {R1i, R2i, . . ., Rmi} Theorem Let Q = conjunctive query w/o self-joins. Then one of the following holds: or: Q is in PTIME Q has a correct extensional plan Q is hierarchical Q is #P-complete Q has subgoals R(x,...),S(x,y,...),T(y,...) 53 Proof Lemma 1. If Q is non-hierarchical, then #P-complete Proof: y x v R S z T K Q :- Ri(v,x), Si(x,y), Ti(y,z), Ki(z) rest is like for Qbad 54 Proof Lemma 2. If Q is hierarchical, then PTIME Proof: Case 1: has no root Pr(Q) = Pr(Q1) Pr(Q2) Pr(Q3) This is extensional join ⋈ 55 Proof Case 2: has root x x Dom={a1, a2, . . ., an} Pr(Q) = 1 - (1-Pr(Q(a1/x))(1-Pr(Q(a2/x))...(1-Pr(Q(an/x))) This is an extensional projection: ∏ QED 56 Query Evaluation on ID-Databases ID-extensional plans #P-complete queries Dichotomoy Theorem 57 Extensional Plans for ID-DBs Only difference: two kinds of projections: independent 1-(1-p1)...(1-pn) disjoint p1 + ... + pn 58 #P-Complete Queries Q1 :- Ri(x), Si(x,y), Ti(y) Q2 :- Rd(xd,y), Sd(yd,z) Q3 :- Rd(xd,y), Sd(zd,y) 59 [Dalvi&S.’04] I-DB Dichotomy Schema Sid s.t. each table is either Ri or Rid Theorem Let Q = conjunctive query w/o self-joins. Then one of the following holds: Q is in PTIME Q has a correct extensional plan or: Q is #P-complete Q has one of Q1, Q2, Q3 as subqueries 60 Extensions •Extensions of the dichotomoy theorem exists for: Mixed schemas (some relations are deterministic) Functional dependencies 61 Summary on Query Evaluation •Extensional plans: popular, efficient, BUT “Equivalent” plans lead to different results Some queries admit “correct” plans •Some simple queries: #P-complete complexity •Dichotomy •Future theorem work: remove ‘no-self-join’ restriction 62 Outline Definitions Query evaluation Top-k answering (joint with Chris Re) Conclusions 65 Top-k Ranking Problem •Fix schema S, query Q, number k > 0 •Problem: given I- or ID-database DB, find k answers t1,...,tk with highest probabilities Pr[Q(t1)] > Pr[Q(t2)] > .... > Pr[Q(tk)] > ... •Note: Checking Pr[Q(ti)] > Pr[Q(tj)] is PP-complete Goal: efficient polynomial time approximation 69 Probabilities of Boolean Expressions A p e1 e2 e3 e1 p1 0 0 0 e2 p2 0 0 1 Pr (1-p1)(1-p2)(1p3) (1-p1)(1-p2)p3 e3 p3 0 1 0 (1-p1)p2(1-p3) 0 1 1 (1-p1)p2p3 1 0 0 p1(1-p2)(1-p3) 1 0 1 p1(1-p2)p3 1 1 0 p1p2(1-p3) 1 1 1 p1p2p3 What is the probability of e1⋀e2 ⋁ e1⋀e2 ⋁ e1⋀e3? Theorem #P-hard [Valiant] (1-p1)p2p3 + p1(1-p2)p3 + p1p2(1-p3) + p1p2p3 70 Monte Carlo Simulation Algorithm: radomly pick each e1, e2, e3 = false or true compute e1∧e2 ∨ e1∧e3 ∨ e2∧e3: true or false ? repeat Approximate probability p with frequency p’ [Karp&Luby’83] Pr( |p’-p| < ε ) > 1-δ p’- ε p’ p’+ ε p Better: PTAS 71 Monte Carlo Simulation 0 p 1 N=0 N=1 N=2 N=3 72 The Multisimulation Problem 0 Year P 1995 2002 ?? ?? 1933 ?? 1984 ?? 1 Schedule simulation steps to find top-k 73 Multisimulation How to find the top k out of n ? 0 p1 1 5 p4 2 1 p2 p54 p3 3 Example: looking for top k=2; Which one simulate next ? 74 Multisimulation Critical region: (k’th left, k+1’th right) 0 1 k=2 75 Multisimulation Algorithm Case 1: pick a “double crosser” and simulate it 0 this k=2 1 76 Multisimulation Algorithm Case 2: pick both a “left” AND a “right” crosser 0 1 this and this k=2 77 Multisimulation Algorithm Case 3: pick a “max crosser” and simulate it 0 this k=2 1 78 Multisimulation Algorithm End: when critical region is “empty” 0 1 k=2 To sort the top k, find the top k-1, etc 79 Multisimulation Algorithm Theorem (1) It runs in < 2 Optimal # steps (2) no other deterministic algorithm does better 80 Experiments 82 Summary on Top-k Answering Simple algorithm, optimal (x2) w.r.t. a very powerful standard Marriage of probabilistic and top-k answers make probabilistic databases practical 83 Outline Definitions Query evaluation Top-k answering Conclusions 85 Conclusions Strong motivation from practical applications Opportunity to merge query and search technologies Probabilistic DB’s are hard ! Great opportunity for impactful theory work Tomorrow: applications of random graphs to model incompleteness in databases 87 Thank you ! Questions ?