Containment of Conjunctive Queries on Annotated Relations TJ Green University of Pennsylvania Symposium on Database Provenance University of Edinburgh May 21, 2008 Provenance and Query Optimization • Many kinds of semiring-based provenance annotations to choose from: – – – – – lineage why-provenance minimal witness why-provenance provenance polynomials ... • These seem to keep track of more/less information • A fundamental question: how does this affect query optimization? 2 Conjunctive Queries on K-Relations • Datalog-style syntax for conjunctive queries (CQs): Q(x,y) :- R(x,z), R(z,y) • Semantics of applying the CQ to a K-relation R : D£D K: Q(a,b) = z2D R(a,z)¢R(z,b) • # of repetitions of an atom in the body matters • For unions of conjunctive quereis (UCQs) (equivalent to positive RA), sum over CQs: P(x,y) :- R(x,z), R(z,y) P(x,y) :- R(x,w), R(y,w) • Semantics of UCQ applied to R ― a sum over CQs: P(a,b) = z2D R(a,z)¢R(z,b) + w2D R(a,w)¢R(b,w) 3 Choice of K Affects Query Optimization K = N (bag semantics) differs from K = B (set semantics) e.g., the conjunctive queries Q1(x) :- R(x,y), R(x,z) Q2(u) :- R(u,v) are set-equivalent, but not bag-equivalent Conjunctive Queries (CQs) Bag Semantics ? (¦2p-hard) Containment (vN) [Chaudhuri&Vardi 93] Bag Semantics isomorphism () Equivalence (´N) [CV 93] Unions of Conjunctive Queries (UCQs) undecidable [Ioannidis&Ramakrishnan 95] ? 4 Our Contributions • We make a systematic study of query containment and query equivalence for various provenance models • We show that K-containment and K-equivalence of CQs and UCQs are decidable for lineage, whyprovenace, and the provenance polynomials N[X], as well as a new model, B[X] • The decision procedures are based on interesting variations of containment mappings • We analyze the complexity in each case 5 Our Contributions • As a corollary of the decidability result for N[X]-equivalence of UCQs, we also fill in a gap in the chart for bag semantics: Conjunctive Queries (CQs) Bag Semantics ? (¦2p-hard) Containment (vN) [Chaudhuri&Vardi 93] Bag Semantics isomorphism () Equivalence (´N) [CV 93] Unions of Conjunctive Queries (UCQs) undecidable [Ioannidis&Ramakrishnan 95] isomorphism () 6 K-Containment for Queries • For semiring K, define a ·K b , 9c . a + c = b. If ·K is a partial order, it is called the natural order, and K is said to be naturally-ordered • B, N, lineage, why-provenance, B[X], and N[X] are all naturally-ordered • We define K-containment using the natural order: Q1 vK Q2 , 8I 8t Q1(I)(t) ·K Q2(I)(t) Q1 ´K Q2 , 8I 8t Q1(I)(t) = Q2(I)(t) 7 A Hierarchy of Semiring Provenance (1) • Provenance polynomials (N[X], +, ¢, 0, 1) – tracks calculations abstractly; most general e.g., 2p2r + 3ps + ps3 • Drop coefficients to get (B[X], +, ¢, 0, 1) p2r + ps + ps3 • Drop exponents to get why-prov. (P(P(X)), [, d, ;, {;}) {{p,r}, {p,s}} • Flatten set-of-sets to get lineage (P(X), +, ¢, ?, ;) {p,r,s} • Drop, flatten, etc. correspond to surjective semiring homomorphisms 8 A Hierarchy of Semiring Provenance (2) • Suppose h : K1 K2 is a semiring homomorphism. Then a ·K1 b implies h(a) ·K2 h(b). If h is also surjective, then h(a) ·K2 h(b) implies a ·K1 b. • Definition: K1 ¹ K2 means P vK2 Q implies P vK1 Q • Proposition: for any positive K B ¹ K ¹ N[X] (All those we consider are positive.) Moreover: • Proposition (Provenance Hierarchy): B ¹ lineage ¹ Why-Prov. ¹ B[X] ¹ N[X] 9 Containment Mappings • A containment mapping from CQ Q to CQ P is a function h : Vars(Q) Vars(P) such that – head of Q is mapped to head of P – every atom in body of Q is mapped to an atom in body of P • Theorem [CM77]: For CQs P,Q we have P vB Q iff there is a containment mapping from Q to P – e.g. Q1(x) :- R(x,y), R(x,z) Q2(u) :- R(u,v) – h which sends u x and v y is a containment mapping • Checking for existence of containment mapping is NP-complete 10 Canonical Databases • Take body of CQ, “freeze” into database instance [CM77], and tag each tuple with a “tuple id” • We’ll denote by canK(Q) the canonical database for Q with abstract tags from K • e.g., Q(w) :- R(u,v), R(v,w) canN[X](Q) = canB[X](Q) = R canlin(Q) = R u v {x1} v w {x2} u v x1 v w x2 canwhy(Q) = R u v {{x1}} v w {{x2}} 11 Lineage-Containment of CQs • Covering set of containment mappings: for every atom A in the body of P there is a containment mapping h : Q P with A in the image of h • Theorem: For CQs P, Q the following are equivalent: 1. P vlin Q 2. P(canlin(P)) µlin Q(canlin(P)) 3. there is a covering set of containment mappings from Q to P • Note: covering sets of containment mappings were identified in [CV 93] as a necessary (but not sufficient) condition for bag-containment of CQs 12 Why-Containment of CQs • A containment mapping is onto if it induces a surjection on atoms • Theorem: For CQs P, Q the following are equivalent: 1. P vwhy Q 2. P(canwhy(P)) µwhy Q(canwhy (P)) 3. there is an onto containment mapping h : Q P • Note: onto containment mappings were identified in [CV 93] as a sufficient (but not necessary) condition for bag-containment of CQs 13 B[X], N[X]-containment of CQs • A containment mapping is exact if it induces a bijection on atoms • Theorem: For CQs P, Q and for K 2 {B[X], N[X]} the following are equivalent 1. P vK Q 2. P(canK (P)) µK Q(canK (P)) 3. there is an exact containment mapping h : Q P • Another way to think of exact containment mappings: by unifying variables in Q, you get a query isomorphic to P 14 So Far • K-containment of CQs is decidable for all the provenance models in the hierarchy • Next, we indicate which steps in the hierarchy are strict, and which collapse: B Á lineage Á Why-Prov. Á B[X] ¼ N[X] 15 Separating the Models for v of CQs • B Á lineage: Q1(x,y) :- R(x,y), R(x,z) Q2(x,y) :- R(x,y) Q1 vB Q2 but Q1 vlin Q2 • lineage Á why: Q1(x) :- R(x,y), R(x,z) Q2(x) :- R(x,y) Q1 vlin Q2 but Q1 vwhy Q2 • why Á B[X]: Q1(x,y) :- R(x,y) Q2(x,y) :- R(x,y), R(x,z) Q1 vwhy Q2 but Q1 vB[X] Q2 16 From Containment to Equivalence • {Onto|exact} containment mappings in both directions implies CQs are isomorphic, so why-provenance, B[X], and N[X] collapse to: P ´why Q , P ´B[X] Q , P ´N[X] Q , P Q • In contrast, for lineage, having sets of covering containment mappings in both directions does not imply isomorphism (but still decidable) 17 From CQs to UCQs • For idempotent semirings (where + is idempotent) this is easy. B, PosBool(B), lineage, why-provenance, and B[X] are idempotent; N[X] is not (omitted) • Proposition [after SY80]: If K is idempotent, then for UCQs P, Q we have P vK Q iff for every CQ P in P there is a CQ Q in Q such that P vK Q • Corollary: For idempotent K, the problems of checking K-equivalence of CQs and K-equivalence of UCQs are polynomially equivalent 18 N[X]- and Bag-Equivalence of UCQs • As with CQs, N[X]-equivalence of UCQs turns out to be the same as isomorphism: Theorem: For UCQs P, Q, P ´N[X] Q iff P Q • But, it turns out that N[X]-equivalence and Nequivalence of UCQs are intimately related: Theorem: for UCQs P, Q, P ´N[X] Q iff P ´N Q Thus: Corollary: for UCQs P, Q P ´N Q iff P Q 19 Summary: Complexity Results • Theorem: checking for {covering set of|onto|exact} containment mappings is NP-complete • Checking for query isomorphism: believed >P, <NP B CQs UCQs PosBool(B) N Lineage Why-Pr. B[X] N[X] vK NP NP [CM 77] [PODS 07] ? (¦2p-hard) [CV 93] NP-ct NP-ct NP-ct NP-ct ´K NP ibid. NP ibid. ibid. NP-ct vK NP [SY 80] NP ibid. undec [IR 95] NP-ct NP-ct NP-ct PSPACE ´K NP ibid. NP ibid. NP-ct NP-ct NP-ct 20 Summary: Provenance Hierarchy B CQs B[X] Why-Pr. N[X] vK ¼ Á Á Á Á ¼ ´K ¼ Á Á ¼ ¼ ¼ B UCQs N Lineage PosB.(B) Lineage PosB.(B) B[X] Why-Pr. N[X] vK ¼ Á Á Á Á ´K ¼ Á Á Á Á 21 Related Work • Already mentioned – – – – Set-cont. and equiv. of CQs [Chandra&Merlin 77] Set-cont. and equiv. of UCQs [Sagiv&Yannakakis 80] Bag-cont. of UCQs [Ioannidis&Ramakrishnan 95] Bag-equiv. of CQs [Chaudhuri&Vardi 93] • Containment of CQs with where-provenance [Tan 03] • Bag-set semantics [CV 93], combined semantics [Cohen 06] – For K-relations: support operator of [Geerts&Poggi 08] generalizes duplicate elimination • Bag-containment of CQs [Jayram+ 06] 22 Future Work • Loose ends: – Lower bound for N[X]-containment of UCQs (we gave only a PSPACE upper bound) – Generalize results for specific semirings to semirings with certain properties? • Beyond UCQs: Datalog – is K-containment of Datalog programs the same as setcontainment when K is a distributive lattice? – is bag-equivalence/N[X]-equivalence undecidable for Datalog? • Could semiring framework give any insight into bagcontainment of CQs? • Query optimization for annotated XML 23 24