Containment of Conjunctive Queries on Annotated Relations Todd J. Green University of Pennsylvania March 25, 2009 @ ICDT 09, Saint Petersburg The Need for Data Provenance • Many new database applications must track where data came from (as it is combined and transformed by queries, schema mappings, etc.): data provenance – Debugging schema mappings – Assessing data quality, trustworthiness – Computing probabilities – Enforcing access control policies – Preserving the scientific record • Must do this while also satisfying DBMS performance requirements and retaining compatibility with legacy systems 2 Challenge: Provenance May Affect Query Optimization • Query optimization strategies depend fundamentally on issues of query containment and equivalence – Query minimization, rewritings queries using materialized views, etc. • Well-known difference between set, bag semantics: consider Q(x,y) :– R(x,y) Q’(u,v) :– R(u,v), R(u,w) Under set semantics, Q and Q’ are equivalent; under bag semantics, they are not! (“redundant” join in Q’ affects output tuple multiplicities) • Issues pointed out in [Buneman+ 01], reiterated in [Buneman+ 08] 3 Contributions • We study containment and equivalence of conjunctive queries (CQs) and unions of conjunctive queries (UCQs), for provenance models captured by semiring-annotated relations: – Provenance polynomials (ORCHESTRA system) [Green+ 07] – Why-provenance [Buneman+ 01] – Data warehousing lineage [Cui+ 01] – Trio system lineage [Das Sarma+ 08] • We give positive decidability results and complexity characterizations in (nearly) all cases • We show interesting connections with same problems under set semantics and bag semantics 4 Outline • Semiring-annotated relations (K-relations) • Bounds based on semiring homomorphisms • Results for provenance polynomials • Overview of other results 5 A Unifying Framework for Data Provenance: Semiring Annotated Relations [Green+ PODS 07] • Basic idea: annotate source tuples with tuple ids, combine and propagate during query processing – Abstract “+” records alternative use of data (union, projection) – Abstract “¢” records joint use of data (join) – Yields space of annotations K • K-relation: a relation whose tuples are annotated with elements from K – Notation: R(t) means annotation of t in K-relation R 6 Combining Annotations in Queries A B C D ID Species 34 Character State L.catta hand color white p 47 L.catta hand color white q ID Character State 61 hand color black ID Species 61 Lemur catta Species Img source tuples annotated with tuple ids from K r Img s Comm. Name Lemur catta Ring-tailed Lemur u 7 Combining Annotations in Queries Union of conjunctive queries (UCQ) A B C D ID Species 34 Img Character State L.catta hand color white p 47 L.catta hand color white q ID Character State 61 hand color black ID Species 61 Lemur catta Species E(name, color) :– B(id, “hand color”, color), C(id, species,_), D(species, name) rr join Img s E Comm. Name Hand Color Ring-tailed Lemur black r¢s¢u Comm. Name Lemur catta Ring-tailed Lemur u Operation x¢y means joint use of data annotated by x and data annotated by y 8 Combining Annotations in Queries Union of conjunctive queries (UCQ) A B C D ID Species 34 Character State L.catta hand color white p 47 L.catta hand color white q ID Character State 61 hand color black ID Species 61 Lemur catta Species Img E(name, color) :– A(id, species,_, “hand color”, color), D(species, name) r Img E s Comm. Name Lemur catta Ring-tailed Lemur E(name, color) :– B(id, “hand color”, color), C(id, species,_), D(species, name) Comm. Name Hand Color Ring-tailed Lemur black r¢s¢u Ring-tailed Lemur white p¢u Ring-tailed Lemur white q¢u u Operation x¢y means joint use of data annotated by x and data annotated by y 9 Combining Annotations in Queries Union of conjunctive queries (UCQ) A B C D ID Species 34 Character State L.catta hand color white p 47 L.catta hand color white q ID Character State 61 hand color black ID Species 61 Lemur catta Species Img E(name, color) :– A(id, species,_, “hand color”, color), D(species, name) r Img E s Comm. Name Lemur catta Ring-tailed Lemur E(name, color) :– B(id, “hand color”, color), C(id, species,_), D(species, name) u Comm. Name Hand Color Ring-tailed Lemur black Ring-tailed Lemur white Ring-tailed Lemur white r¢s¢u p¢u + p¢u q¢u q¢u Operation x+y means alternate use of data annotated by x and data annotated by y 10 What Properties Do K-Relations Need? • DBMS query optimizers choose from among many plans, assuming certain identities: – union is associative, commutative – join associative, commutative, distributive over union – projections and selections commute with each other and with union and join (when applicable) • Equivalent queries should produce same provenance! • Proposition [Green+ 07]. Above identities hold for positive relational algebra queries on K-relations iff (K, +, ¢, 0, 1) is a commutative semiring 11 What is a Commutative Semiring? • An algebraic structure (K, +, ¢, 0, 1) where: – K is the domain – + is associative, commutative with 0 identity – ¢ is associative, commutative with 1 identity – ¢ is distributive over + – 8 a 2 K, a ¢ 0 = 0 ¢ a = 0 (unlike ring, no requirement for additive inverses) • Big benefit of semiring-based framework: one framework unifies many database semantics 12 Semirings Unify Commonly-Used Database Semantics Standard database models: (B, Æ, Ç, >, ?) Set semantics (ℕ, +, ∙, 0, 1) Bag semantics Incomplete/probabilistic data: (PosBool(X), Æ, Ç, >, ?) Conditional tables [Imielinski&Lipski 84] (P(), [, Å, ;, ) Probabilistic event tables [Fuhr&Rölleke 97] Also ranked query models, dissemination policies, ... 13 Semirings Unify Provenance Models X a set of indeterminates, can be thought of as tuple ids (N[X], +, ¢, 0, 1) “most informative” Provenance polynomials (Lin(X), [, [*, ;, ;*) sets of contributing tuples Data warehousing lineage (Why(X), [, d, ;, {;}) sets of sets of contributing tuples Why-provenance (Trio(X), +, ¢, 0, 1) bags of sets of contributing tuples Trio-style lineage (B[X], +, ¢, 0, 1) Boolean prov. polynomials [Green+ 07] [Cui+ 00] [Buneman+ 01] [Das Sarma+ 08] 14 A Hierarchy of Provenance Example: 2p2r + pr + 5r2 + s most informative drop coefficients p2r + pr + r2 + s N[X] Trio(X) B[X] drop both exp. and coeff. pr + r + s Why(X) apply absorption (pr + r ´ r) PosBool(X) r+s collapse terms Lin(X) prs least informative drop exponents 3pr + 5r + s B non-zero? true A path downward from K1 to K2 indicates that there exists a surjective semiring homomorphism h : K1 K2 15 What Does Query Containment Mean for K-Relations? • Notion of containment based on natural order for K: a ≤K b iff exists c s.t. a + c = b – When this is a partial order, call K naturally ordered; all semirings considered here are naturally ordered • Lift to K-relations: R ≤K R’ iff for all tuples t R(t) ≤K R’(t) – For K = B (set semantics), this is set-containment – For K = ℕ (bag semantics), this is bag-containment – For K = PosBool(X), this is logical implication • Queries on K-relations: say that Q is K-contained in Q’ iff for all K-relations R, Q(R) ≤K Q’(R) 16 Provenance Hierarchy and Query Containment most informative strongest notion of containment N[X] Trio(X) B[X] any K (positive K) Why(X) N Lin(X) least informative PosBool(X) B weakest notion of containment A path downward from K1 to K2 also indicates that for UCQs Q1, Q2, if Q1 is K1-contained in Q2, then Q1 is K2-contained in Q2 17 Prov. Hierarchy and Query Containment (2) • Provenance hierarchy tells us something about relative behavior of K-containment for various K • Doesn’t tell us which implications are strict; we’d also like to know whether containment/equivalence is even decidable! • One case already known: Theorem [Grahne+ 97]. If K is a distributive lattice, then for UCQs Q,Q’, Q is K-contained in Q’ iff Q is set-contained in Q’ – Distributive lattices are between PosBool(X) (for c-tables) and B in previous slide – Other examples: dissemination policies, prob. event tables, ... 18 Summary: Logical Implications of Containment/Equivalence B[X] N[X] Why(X) Trio(X) N N[X] Trio(X) B[X] N[X] B[X] N[X] B[X] N Why(X) Trio(X) Why(X) Trio(X) Why(X) Lin(X) Lin(X) N Lin(X) N Lin(X) PosBool(X) B PosBool(X) B PosBool(X) PosBool(X) CQs, cont. “K1 B CQs, equiv. UCQs, cont. B UCQs, equiv. K2” indicates that for CQs (UCQs), K1 cont. (equiv.) implies K2 cont. (equiv.) All implications not marked “ ” are strict. Red arrows are from [Grahne+ 97]. 19 Summary: Logical Implications of Containment/Equivalence B[X] N[X] Why(X) Trio(X) N Lin(X) PosBool(X) CQs, cont. “K1 B N N[X] CQs separating the various notions of K-containment: B[X] N[X] B[X] N[X] ...other examples... other examples Q(u) :– R(u,v), R(u,w) Why(X) Trio(X) Trio(X) Q’(x) :– R(x,y)Why(X) Q(x,y) :– R(x,y) Q is Lin(X)-contained in Q’, but N Lin(X) Q’(u,v) :– N R(u,v),Lin(X) R(u,w) Q is not bag-contained in Q’ Q is set-contained in Q’, but PosBool(X) PosBool(X) B B Q is not Lin(X)-contained in Q’ CQs, equiv. UCQs, cont. B[X] Trio(X) Why(X) Lin(X) PosBool(X) B UCQs, equiv. K2” indicates that for CQs (UCQs), K1 cont. (equiv.) implies K2 cont. (equiv.) All implications not marked “ ” are strict. Red arrows are from [Grahne+ 97]. 20 Summary: Logical Implications of Containment/Equivalence B[X] N[X] Why(X) Trio(X) B[X] N[X] B[X] N[X] Bag-equivalence of UCQs implies KWhy(X) forTrio(X) Why(X) equivalence provenance modelsTrio(X) (in fact, bag-equivalence implies KLin(X) for any N equivalence N K) Lin(X) N Lin(X) PosBool(X) CQs, cont. “K1 bag semantics B PosBool(X) B CQs, equiv. PosBool(X) B UCQs, cont. N N[X] B[X] Trio(X) Why(X) Lin(X) PosBool(X) B UCQs, equiv. K2” indicates that for CQs (UCQs), K1 cont. (equiv.) implies K2 cont. (equiv.) All implications not marked “ ” are strict. Red arrows are from [Grahne+ 97]. 21 Tools for Main Results: Containment Mappings, Canonical Databases Theorem [Chandra&Merlin 77]. For CQs Q, Q’, following are equivalent: 1. 2. 3. Q is (set-)contained in Q’ Q(can(Q)) ⊆ Q’(can(Q)) where can(Q) is canonical database for Q There is a containment mapping h : vars(Q) vars(Q’) Most of our results follow this template, with two key differences: – We use provenance-annotated canonical databases: e.g., Q(x,y) :– R(x,z), R(z,y) canN[X](Q) is R = – We use variations of containment mappings x z p z y q e.g., exact containment mapping: a containment mapping h : vars(Q) vars(Q’) that induces a bijection between atoms of Q and atoms of Q’ 22 N[X]-Containment/Equivalence of CQs Natural order for N[X]: monomial-wise comparison of coefficients e.g., p2 ≤N[X] 2p2 + pq but p2 ≰N[X] p3 Theorem. For CQs Q, Q’, the following are equivalent: 1. Q is N[X]-contained in Q’ 2. Q(canN[X](Q)) ≤N[X] Q’(canN[X](Q)) 3. There is an exact containment mapping h : vars(Q) vars(Q’) and checking containment is NP-complete Corollary. Q and Q’ are N[X]-equivalent iff they are isomorphic (and checking equivalence is graph isomorphism-complete) 23 N[X]-Containment/Equivalence of UCQs Theorem. For UCQs Q,Q’, if Q is not N[X]-contained in Q’, then there is a small counterexample, i.e., an N[X]-relation R s.t. – Size of R (tuples and their annotations) polynomial in |Q| + |Q’| – Q(R) ≰N[X] Q’(R) Corollary. N[X]-containment of UCQs is in PSPACE – Exact complexity: don’t know! Theorem. For UCQs Q,Q’, Q is N[X]-equivalent to Q’ iff Q and Q’ are isomorphic (and checking is again graph isomorphismcomplete) 24 Highlights of Other Results • Why(X) and Trio(X): CQ containment based on onto containment mappings • Lin(X): CQ containment based on covering containment mappings • These kinds of containment mappings have been used before, for checking bag-containment of CQs [Chaudhuri&Vardi 93]! – Decidability of this problem: open – But, onto containment mappings sufficient for bag-containment – And, covering containment mappings necessary for bag-containment • Hence for CQs, Why(X)/Trio(X)-containment and Lin(X)containment “sandwich” bag-containment 25 N[X]-Equivalence and Bag-Equivalence • Theorem. For UCQs, N[X]-equivalence is the same as bagequivalence – Proof idea. For polynomials A, B in N[X], we have A = B iff for all valuations ν : X N, Evalν(A) = Evalν(B) • We have used this idea in another ICDT 09 paper; and results there for Z-relations also hold for Z[X]-relations – A fact used in ORCHESTRA system @ Penn for optimizing change propagation with provenance 26 Summary: Complexity of Checking Containment/Equivalence of CQs/UCQs CQs cont B PosBool(X) Lin(X) Why(X) Trio(X) B[X] N[X] N NP NP NP NP NP NP NP ? (Π2phard) UCQs equiv NP NP NP GI GI GI GI GI cont NP NP NP NP ? NP in PSPACE undec equiv NP NP NP NP GI NP GI GI Bold type indicates results of this paper “NP” indicates NP-complete, “GI” indicates graph isomorphism-complete NP-complete/GI-complete considered “tractable” here - Complexity in size of query; queries small in practice 27 Related Work on Query Containment • Set semantics [Chandra&Merlin 77], [Sagiv&Yannakakis 80], ... • Bag, bag-set semantics [Lovász 67], [Chaudhuri&Vardi 93], [Ioannidis&Ramakrishnan 95], [Cohen+ 99], [Jayram+ 06], ... – Label systems of [Ioannidis&Ramakrishnan 95]: similar in spirit to K-relations • Bilattice-annotated relations [Grahne+ 97], parametric databases [Lakshmanan&Shiri 01] – Also similar in spirit to K-relations • Minimal-witness why-prov. [Buneman+ 01], where-prov. [Tan 03] • Z-relations/Z[X]-relations [Green+ 09] 28 Conclusion • When optimizers rewrite queries, the provenance of query answers may change! This paper helps us understand how. • We have given positive decidability results and complexity characterizations for CQ/UCQ containment/equivalence on various kinds of provenance-annotated databases • For optimizations common in commercial DBMSs (i.e., those compatible with bag semantics), we have shown that they imply no change in provenance 29 Open Problems for Future Work • Decidability of Trio(X)-containment of UCQs? • Exact complexity of N[X]-containment of UCQs? (GI-hard, in PSPACE) • Complexity when UCQs are represented as positive relational algebra queries (exponentially more concise than UCQs)? 30