Version June 20, 2011 Default-all is dangerous! Wolfgang Gatterbauer Alexandra Meliou Dan Suciu 3rd USENIX Workshop on the Theory and Praxis of Provenance (Tapp'11) Database group University of Washington http://db.cs.washington.edu/causality/ Overview Provenance Definitions Why? Where? Witness Naive "SQL interpretation" Why-provenance = Where-provenance = witness basis (αw) Provenance propagation (αp) definition Buneman et al. [ICDT’01] Buneman et al. [PODS’02] Minimal witness basis (αwm) Default-all QRI definition d) propagation (α p (Query-Rewrite- Buneman et al. [ICDT’01] We do not discuss here whether QRI is desirable (see also Glavic, Miller [Tapp'11] ), but merely point out that, if aiming for QRI, care has to be taken about the ramifications of the proposed semantics. Bhagwat et Insensitive) al. [VLDB’04] Has problems if one interprets annotations on attribute values Minimal propagation (αpm) Proposed in this paper! Independent work presented at this WS 2 Overview Provenance Definitions Why? Where? Witness Naive "SQL interpretation" Why-provenance = Where-provenance = witness basis (αw) Provenance propagation (αp) definition Glavic, Miller [Tapp'11] Buneman et al. [ICDT’01] Buneman et al. [PODS’02] Minimal witness basis (αwm) Default-all QRI definition d) propagation (α p (Query-Rewrite- Buneman et al. [ICDT’01] Bhagwat et Insensitive) al. [VLDB’04] Has problems if one interprets annotations on attribute values Minimal propagation (αpm) Proposed in this paper! Note that Minimal propagation is "stable", in contrast to Default-all 3 Example 1: Query-Rewrite-Insensitivity (QRI) Why-provenance = witness basis (αw) Why Input R A t1 1 t2 1 t3 2 B 2 3 2 Where Input Ra A 1a 1c 2e B 2b 3d 2f Query 1 Q1(x,y):-R(x,y) A B 1 2 {{t1}} 1 3 {{t2}} 2 2 {{t3}} Minimal witness basis (αwm) Query 2 ≡ Query 1 Q2(x,y):-R(x,y),R(_,y) A B 1 2 {{t1},{t1,t3}} {{t1}} {t1,t3} 1 3 {{t2}} {{t2}} {t2} 2 2 {{t3},{t1,t3}} {{t3}} {t1,t3} Where-provenance = propagation (αp) Query 1 Q1(x,y):-Ra (x,y) A B 1a 2b 1c 3d 2e 2f Lineage (αl) Minimal propagation (αpm) Default-all propagation (αpd) Query 2 ≡ Query 1 Q2(x,y):-Ra(x,y),Ra (_,y) A B A B A a b,f a,c b,f 1 2 1 2 1a 1c 3d 1a,c 3d 1c 2e 2b,f 2e 2b,f 2e Example adapted from Cheney et al. [Foundations and Trends in DBs’09] B 2b 3d 2f 4 Real example: Why Default-all is dangerous Hanako queries a community DB for contents of LF-milk*: Community Database Ra Food Content LF Milk Cesium-137b LF Milk Calciumd SC Water Cesium-137f Hanako's query Q (y):-Ra(‘LF Milk’,y) b Bob, March 18, 2011 Don't drink, lots of Cesium! f Fuyumi, March 19, 2011 No Cesium, save to drink! Content Cesium-137??? Calciumd Default-all propagation makes her drink the milk: Default-all propagation (αpd) Content Cesium-137bf Calciumd b Bob, March 18, 2011 Don't drink, lots of Cesium! "semantically irrelevant information": annotations leak over from SC Water tuple to LF Milk f Fuyumi, March 19, 2011 No Cesium, save to drink! * Note the one-to-one correspondence of this example with example 1 Minimal propagation (αpm) Content Cesium-137b Calciumd b Bob, March 18, 2011 Don't drink, lots of Cesium! "all relevant and only relevant" 5 Definition Minimal propagation (αpm) Intuition: Return the intersection between: •query-specific where-provenanc (αp) •and QRI minimal witness basis (αwm) transforms 'sets of sets' into 'sets', hence something like QRI lineage Example 1 Input Ra A t1 1a t2 1c t3 2e B 2b 3d 2f Where provenance (αp) Query 2 Q2(x,y):-Ra(x,y),Ra (_,y) A B {t1} 1a 2b,f {{t1}} {t2} 1c 3d {{t2}} {t3} 2e 2b,f {{t3}} "all relevant ... and only relevant" Minimal propagation (αpm) t4 t5 t6 A 1a 1c 2e B 2b 3d 2f αwm Minimal witness basis (αwm) 6 Example 1: Illustration of "minimal" versus "all" Why-provenance Why-provenance (αw) Minimal witness basis (αwm) Where-provenance Where-provenance (αp) Default-all propagation (αpd) Minimal propagation (αpm) 7 Interpretation of Annotations 1: Attribute Value* * Interpretation of annotations on entity attribute values favored by us and underlying our model 8 Interpretation of Annotations 1: Attribute Value* Annotations on values of an attribute (here "population") for a particular entity (here "Athens") Argument: Interpreting cell annotations as relevant to the tuple (entity) adds something that is not trivially modeled with normalized tables. * Interpretation of annotations on entity attribute values favored by us and underlying our model 9 Interpretation of Annotations 2: Domain Value* Domain value annotations* Input Ra: A B 1a 2b 1c 3d 2e 2f b Bob, March 18, 2011 This number is a prime number. Input Sa: ... Date ... Dec 25 ... ... ... Dec 25 f Fuyumi, March 19, 2011 Two is not a prime number because it is even. b This is a holiday. f This is a holiday too !!! Argument for default-all: If annotations are on domain values, then retrieving all annotations are relevant. * Alternative representation Annotation table Sa: B annotation 2 b: Bob, March 18, 2011 This number is a prime number. 2 f: Fuyumi, March 19, 2011 Two is not a prime number because it is even Annotation table Sa: Date annotation Dec 25 This is a holiday. Counter-Argument: But then these annotations can be modeled in a separate table as normalized tables. Alternative interpretation suggested by Wang-Chiew Tan (example created after conversation at Sigmod 2011) 10 Backup: Detailed Example 2 t1 t2 t3 t4 Ra A 1a 1c 2e 2g B 2b 3d 2f 4h Q5(x,y):-Ra(x,y),Ra(y,_),Ra(x,_) A B t5 1a,c 2b,e,g {{t1,t3},{t1,t2,t3},{t1,t4},{t1,t2,t4}} t6 2e,g 2e,f,g {{t3},{t3,t4}} {{t1,t3}, {t1,t4}} {{t3}} αwm (~QRI lineage) Why-provenance (αw) Where-provenance (αp) Default-all propagation (αpd) A B 1a,c 2b,e,f,g 2e,g 2b,e,f {t1,t3,t4} {t3} Minimal witness basis (αwm) Minimal propagation (αpm) t4 t5 A 1a 2e B 2b,e,g 2e,f αpd(t4,B,Q5) = αp(t4,B,Q6) with Q6(x,y):-Ra(x,y),Ra(y,_),Ra(x,_) ,Sa(_,y) Note minimal propagation is not equivalent to just evaluating the where-provenance for the query: Q7(x,y):-Ra(x,y),Ra(y,_). E.g. αp(t5,B,Q7) = {e,f,g} 11