Extending Where-Provenance to deal with Primitive Data Value Operations Stijn Vansummeren Hasselt University and Transnational University of Limburg, Belgium S. Vansummeren Where-Provenance Revisited Introduction Chris curates a DB create table DB(Beer, Type, Origin) Beer insert into DB values(. . . ) insert into DB select ∗ from S where origin=’USA’ insert into DB select ∗ from T where type=’stout’ S. Vansummeren Where-Provenance Revisited Type Origin Introduction Chris curates a DB create table DB(Beer, Type, Origin) insert into DB values(. . . ) Beer Duvel insert into DB select ∗ from S where origin=’USA’ insert into DB select ∗ from T where type=’stout’ S. Vansummeren Where-Provenance Revisited Type blond Origin Belgium Introduction Chris curates a DB create table DB(Beer, Type, Origin) insert into DB values(. . . ) insert into DB select ∗ from S where origin=’USA’ Beer Duvel Heineken Bud insert into DB select ∗ from T where type=’stout’ S. Vansummeren Where-Provenance Revisited Type blond blond blond Origin Belgium USA USA Introduction Chris curates a DB create table DB(Beer, Type, Origin) insert into DB values(. . . ) insert into DB select ∗ from S where origin=’USA’ Beer Duvel Heineken Bud Guinness insert into DB select ∗ from T where type=’stout’ S. Vansummeren Where-Provenance Revisited Type blond blond blond stout Origin Belgium USA USA Ireland Introduction Chris curates a DB create table DB(Beer, Type, Origin) insert into DB values(. . . ) insert into DB select ∗ from S where origin=’USA’ Beer Duvel Heineken Bud Guinness insert into DB select ∗ from T where type=’stout’ S. Vansummeren Where-Provenance Revisited Type blond blond blond stout Origin Belgium USA USA Ireland Introduction Chris curates a DB create table DB(Beer, Type, Origin) insert into DB values(. . . ) insert into DB select ∗ from S where origin=’USA’ Beer Duvel Heineken Bud Guinness Type blond blond blond stout Origin Belgium USA USA Ireland insert into DB select ∗ from T where type=’stout’ Where-provenance is vital for assessing trustworthiness Manual provenance recording is tedious and error-prone Automatic provenance recording support? S. Vansummeren Where-Provenance Revisited Outline 1 Current Where-Provenance for Queries 2 Extending Where-Provenance to deal with Primitive Data Value Operations S. Vansummeren Where-Provenance Revisited A simple model of data where-provenance Copy-paste model Buneman, Chapman, Cheney (SIGMOD 2006) 5 x 8 A 1 8 2 Ingredients: A space of data objects Every (sub)object is identifiable A 1 8 B 2 1 Provenance links only if copy No link? → Object is constructed C 3 4 9 S. Vansummeren Where-Provenance Revisited A 1 8 B 2 1 B 5 1 1 A simple model of data where-provenance Copy-paste model Buneman, Chapman, Cheney (SIGMOD 2006) 5 x 8 A 1 8 2 Ingredients: A space of data objects Every (sub)object is identifiable A 1 8 B 2 1 Provenance links only if copy No link? → Object is constructed C 3 4 9 S. Vansummeren Where-Provenance Revisited A 1 8 B 2 1 B 5 1 1 A simple model of data where-provenance Copy-paste model Buneman, Chapman, Cheney (SIGMOD 2006) 5 x 8 A 1 8 2 Ingredients: A space of data objects Every (sub)object is identifiable A 1 8 B 2 1 Provenance links only if copy No link? → Object is constructed C 3 4 9 S. Vansummeren Where-Provenance Revisited A 1 8 B 2 1 B 5 1 1 A simple model of data where-provenance Copy-paste model Buneman, Chapman, Cheney (SIGMOD 2006) 5 x 8 A 1 8 A 1 8 2 B 2 1 To store provenance, store identifier/color A 1 8 For For 5 B 2 1 8 store x C 3 4 9 store S. Vansummeren Where-Provenance Revisited A 1 8 B 2 1 B 5 1 1 Propagation-approach to where-provenance Implicit where-provenance Wang and Madnick (VLDB 1990) and Bhagwat et al. (VLDB 2004) for atomic data values in queries. Buneman et al (ICDT 2007): provenance for tables, tuples, and atomic data values in queries and updates. A 1 8 B 2 1 (select ∗ from R where A <> 1) union (select A, 5 as B from R where A = 1) S. Vansummeren Where-Provenance Revisited A 1 8 B 5 1 Propagation-approach to where-provenance Implicit where-provenance Wang and Madnick (VLDB 1990) and Bhagwat et al. (VLDB 2004) for atomic data values in queries. Buneman et al (ICDT 2007): provenance for tables, tuples, and atomic data values in queries and updates. A 1 8 B 2 1 (select ∗ from R where A <> 1) union (select A, 5 as B from R where A = 1) S. Vansummeren Where-Provenance Revisited A 1 8 B 5 1 Propagation-approach to where-provenance Implicit where-provenance Wang and Madnick (VLDB 1990) and Bhagwat et al. (VLDB 2004) for atomic data values in queries. Buneman et al (ICDT 2007): provenance for tables, tuples, and atomic data values in queries and updates. A 1 8 B 2 1 P[(select ∗ from R where A <> 1) union (select A, 5 as B from R where A = 1)] S. Vansummeren Where-Provenance Revisited A 1 8 B 5 1 Queries Objects: atomic data values – records – sets Complex object queries: e ::= x | a | (A : e, . . . , B : e 0 ) | e.A S | {e, . . . , e 0 } | e | {e 0 | x ∈ e} | if e1 = e2 then et else ef S. Vansummeren Where-Provenance Revisited Queries Objects: atomic data values – records – sets Complex object queries: e ::= x | a | (A : e, . . . , B : e 0 ) | e.A S | {e, . . . , e 0 } | e | {e 0 | x ∈ e} | if e1 = e2 then et else ef Example select ∗ from R where A <> 1 ≡ S {if x.A = 1 then 0/ else {x} | x ∈ R} S. Vansummeren Where-Provenance Revisited Provenance for queries Queries create new objects Objects copied from the input retain their color. Objects resulting from constant, record, or set constructor are colored blank. A 1 8 B 2 1 A 1 8 P R S. Vansummeren Where-Provenance Revisited B 2 1 Provenance for queries Queries create new objects Objects copied from the input retain their color. Objects resulting from constant, record, or set constructor are colored blank. A 1 8 B 2 1 P {x | x ∈ R} S. Vansummeren Where-Provenance Revisited A 1 8 B 2 1 Provenance for queries Queries create new objects Objects copied from the input retain their color. Objects resulting from constant, record, or set constructor are colored blank. A 1 8 B 2 1 P S {if x.A = 1 then 0/ else {x} | x ∈ R} S. Vansummeren Where-Provenance Revisited A 8 B 1 Provenance for queries Queries create new objects Objects copied from the input retain their color. Objects resulting from constant, record, or set constructor are colored blank. A 1 8 B 2 1 P {(A : x.A, B : x.B) | x ∈ R} S. Vansummeren Where-Provenance Revisited A 1 8 B 2 1 Provenance for queries Queries create new objects Objects copied from the input retain their color. Objects resulting from constant, record, or set constructor are colored blank. A 1 8 B 2 1 P {(A : x.A, B : x.B) | x ∈ R} S. Vansummeren Where-Provenance Revisited A 1 8 B 2 1 Sanity Check Copy-paste model Buneman, Chapman, Cheney (SIGMOD 2006) 5 x 8 A 1 8 2 Ingredients: A space of data objects Every (sub)object is identifiable A 1 8 B 2 1 Provenance links only if copy No link? → Object is constructed C 3 4 9 S. Vansummeren Where-Provenance Revisited A 1 8 B 2 1 B 5 1 1 Sanity Check The implicit provenance semantics is sound Every non-blank object in the output is identical to the unique object in the input with the same color. A 1 8 B 2 1 A 1 8 P R S. Vansummeren Where-Provenance Revisited B 2 1 Sanity Check The implicit provenance semantics is sound Every non-blank object in the output is identical to the unique object in the input with the same color. A 1 8 B 2 1 P {x | x ∈ R} S. Vansummeren Where-Provenance Revisited A 1 8 B 2 1 Sanity Check The implicit provenance semantics is sound Every non-blank object in the output is identical to the unique object in the input with the same color. A 1 8 B 2 1 P S {if x.A = 1 then 0/ else {x} | x ∈ R} S. Vansummeren Where-Provenance Revisited A 8 B 1 Sanity Check The implicit provenance semantics is sound Every non-blank object in the output is identical to the unique object in the input with the same color. A 1 8 B 2 1 P {(A : x.A, B : x.B) | x ∈ R} S. Vansummeren Where-Provenance Revisited A 1 8 B 2 1 Sanity Check The implicit provenance semantics is sound Every non-blank object in the output is identical to the unique object in the input with the same color. A 1 8 B 2 1 P {(A : x.A, B : x.B) | x ∈ R} S. Vansummeren Where-Provenance Revisited A 1 8 B 2 1 Implicit vs explicit provenance Theorem: For every query e there exists a query f that explicitly implements the implicit provenance semantics P[e]. A 1 8 B 2 1 clr to exp val clr A P[e] A 1 8 B 1 2 8 1 f B 5 1 clr val clr to exp S. Vansummeren A B 1 5 8 1 Where-Provenance Revisited Expressiveness Question: Can every sound explicit f be expressed implicitly? Answer: No, the provenance semantics P[e] of a query e is: 1 Recording 2 Bounded Inventing Theorem: Every explicit, sound, recording, and bounded inventing f can be expressed implicitly as the provenance semantics P[e] of a query e. S. Vansummeren Where-Provenance Revisited Expressiveness The implicit provenance semantics P[e] of a query e is recording: The explicit expression for P[e] never compares colors: if e1 = e2 then et else ef can occur only if e1 and e2 do not output colors. S. Vansummeren Where-Provenance Revisited Expressiveness The implicit provenance semantics P[e] of a query e is recording: The explicit expression for P[e] never compares colors P[e] is therefore also propagating: it commutes with recolorings A 1 8 B 2 1 P[e] ρ A 1 8 A 1 8 B 5 1 ρ B 2 1 P[e] S. Vansummeren Where-Provenance Revisited A 1 8 B 5 1 Expressiveness The implicit provenance semantics P[e] of a query e is recording: The explicit expression for P[e] never compares colors P[e] is therefore also propagating: it commutes with recolorings clr val clr A clr f B val clr A 2 1 5 8 1 8 1 ρ clr ρ val clr B 1 A f B 1 2 8 1 S. Vansummeren clr val clr A B 0 Where-Provenance Revisited 2 Expressiveness The implicit provenance semantics P[e] of a query e is recording: The explicit expression for P[e] never compares colors P[e] is therefore also propagating: it commutes with recolorings Corollary: Non-propagating explicit f cannot be expressed implicitly But such f query provenance rather than define it S. Vansummeren Where-Provenance Revisited Expressiveness The implicit provenance semantics P[e] of a query e is recording: The explicit expression for P[e] never compares colors P[e] is therefore also propagating: it commutes with recolorings Corollary: Non-propagating explicit f cannot be expressed implicitly But such f query provenance rather than define it Related work: [Geerts et al, ICDE 2006] and [Geerts and Van den Bussche, DBPL 2007] consider the queries f that explicitly query provenance, and show them expressively complete to the color algebra, an extension of the relational algebra that implicitly queries provenance. S. Vansummeren Where-Provenance Revisited Expressiveness The implicit provenance semantics P[e] of a query e is bounded inventing: Only constants appearing in e will be colored blank. clr val clr A f B 1 clr val clr B 1 2 S. Vansummeren A Where-Provenance Revisited 2 Expressiveness The implicit provenance semantics P[e] of a query e is bounded inventing: Only constants appearing in e will be colored blank. Corollary: Unbounded inventing explicit f cannot be expressed implicitly. But such f are not “domain preserving” w.r.t. provenance. S. Vansummeren Where-Provenance Revisited Expressive Completeness Theorem: [Buneman, Cheney, V. (ICDT 2007)] Every explicit, sound, recording, and bounded inventing f can be expressed implicitly as the provenance semantics P[e] of a query e. clr val clr A to impl B 1 2 8 1 val clr B 2 1 P[e] f clr A 1 8 A to impl B 1 5 8 1 S. Vansummeren Where-Provenance Revisited A 1 8 B 5 1 Expressive Completeness Theorem: [Buneman, Cheney, V. (ICDT 2007)] Every explicit, sound, recording, and bounded inventing f can be expressed implicitly as the provenance semantics P[e] of a query e. Implicit Provenance Semantics is a reasonable candidate to record provenance automatically since we do not lose flexibility S. Vansummeren Where-Provenance Revisited However ... We’re lacking important real-world features Operations on atomic data, aggregates, external function calls, . . . A 1 8 B 2 1 (select ∗ from R where A <> 1) union (select A, B + 3 from R where A = 1) S. Vansummeren Where-Provenance Revisited A 1 8 B 5 1 However ... We’re lacking important real-world features Operations on atomic data, aggregates, external function calls, . . . A 1 8 B 2 1 P[(select ∗ from R where A <> 1) union (select A, B + 3 from R where A = 1)] S. Vansummeren Where-Provenance Revisited A 1 8 B 5 1 However ... We’re lacking important real-world features Operations on atomic data, aggregates, external function calls, . . . A 1 8 B 2 1 P[(select ∗ from R where A <> 1) union (select A, B + 3 from R where A = 1)] A 1 8 This is not bounded-inventing Expressive completeness fails without bounded-inventing requirement S. Vansummeren Where-Provenance Revisited B 5 1 Outline 1 Current Where-Provenance for Queries 2 Extending Where-Provenance to deal with Primitive Data Value Operations S. Vansummeren Where-Provenance Revisited Revisiting the simple model of where-provenance Ingredients: Every (sub)object is identifiable 5 x 8 A 1 8 2 Provenance links only if copy No link? → Object is constructed A 1 8 Let’s record how objects are constructed! S. Vansummeren B 2 1 C 3 4 9 Where-Provenance Revisited A 1 8 B 2 1 B 5 1 1 Provenance terms by example Data space A 1 8 Provenance term B 2 1 Evaluates to unique object identified by this color S. Vansummeren Where-Provenance Revisited Provenance terms by example Data space A 1 8 Provenance term B 2 1 Evaluates to unique object identified by this color (A : , B : 5) S. Vansummeren record (A : 1, B : 5) Where-Provenance Revisited Provenance terms by example Data space A 1 8 Provenance term B 2 1 Evaluates to unique object identified by this color (A : {(A : record (A : 1, B : 5) , B : 5) , B : 5), S. Vansummeren } A 1 8 Where-Provenance Revisited B 5 1 Provenance terms by example Data space A 1 8 Provenance term B 2 1 Evaluates to unique object identified by this color (A : {(A : record (A : 1, B : 5) , B : 5) , B : 5), } A 1 8 +3 S. Vansummeren B 5 1 5 Where-Provenance Revisited Provenance terms by example Data space A 1 8 Provenance term B 2 1 Evaluates to unique object identified by this color (A : {(A : record (A : 1, B : 5) , B : 5) , B : 5), } A 1 8 +3 avg{ , S. Vansummeren B 5 1 5 } 1.5 Where-Provenance Revisited Queries — Revisited Objects: atomic data values – records – sets Complex object queries with primitives p : σ → atom: e ::= x | a | (A : e, . . . , B : e 0 ) | e.A S | {e, . . . , e 0 } | e | {e 0 | x ∈ e} | if e1 = e2 then et else ef | sum(e) | e + e | . . . S. Vansummeren Where-Provenance Revisited Queries — Revisited Objects: atomic data values – records – sets Complex object queries with primitives p : σ → atom: e ::= x | a | (A : e, . . . , B : e 0 ) | e.A S | {e, . . . , e 0 } | e | {e 0 | x ∈ e} | if e1 = e2 then et else ef | sum(e) | e + e | . . . Example select A, B + 3 from R where A = 1 ≡ S {if x.A = 1 then {(A : x.A, B : x.B + 3)} else 0/ | x ∈ R} S. Vansummeren Where-Provenance Revisited Provenance for Queries — Revisited Queries create new objects Objects copied from the input retain their color. Atomic constant, record, set constructor, external function induce corresponding operation on color terms A 1 8 B 2 1 P R S. Vansummeren A 1 8 Where-Provenance Revisited B 2 1 Provenance for Queries — Revisited Queries create new objects Objects copied from the input retain their color. Atomic constant, record, set constructor, external function induce corresponding operation on color terms A 1 8 B 2 1 P R S. Vansummeren A 1 8 Where-Provenance Revisited B 2 1 Provenance for Queries — Revisited Queries create new objects Objects copied from the input retain their color. Atomic constant, record, set constructor, external function induce corresponding operation on color terms A 1 8 B 2 1 P {x | x ∈ R} S. Vansummeren A 1 8 Where-Provenance Revisited B 2 1 Provenance for Queries — Revisited Queries create new objects Objects copied from the input retain their color. Atomic constant, record, set constructor, external function induce corresponding operation on color terms { , A 1 8 B 2 1 P {x | x ∈ R} S. Vansummeren A 1 8 Where-Provenance Revisited } B 2 1 Provenance for Queries — Revisited Queries create new objects Objects copied from the input retain their color. Atomic constant, record, set constructor, external function induce corresponding operation on color terms A 1 8 B 2 1 P {(A : x.A, B : x.B) | x ∈ R} S. Vansummeren A 1 8 Where-Provenance Revisited B 2 1 Provenance for Queries — Revisited Queries create new objects Objects copied from the input retain their color. Atomic constant, record, set constructor, external function induce corresponding operation on color terms {(A : A 1 8 B 2 1 ,B : A 1 8 P {(A : x.A, B : x.B) | x ∈ R} (A : S. Vansummeren ), (A : ,B : ) Where-Provenance Revisited ,B : )} B 2 1 (A : ,B : ) Provenance for Queries — Revisited Queries create new objects Objects copied from the input retain their color. Atomic constant, record, set constructor, external function induce corresponding operation on color terms A 1 8 B 2 1 P[(select ∗ from R where A <> 1) union (select A, B + 3 from R where A = 1)] S. Vansummeren A 1 8 Where-Provenance Revisited B 5 1 Provenance for Queries — Revisited Queries create new objects Objects copied from the input retain their color. Atomic constant, record, set constructor, external function induce corresponding operation on color terms {(A : A 1 8 B 2 1 P[(select ∗ from R where A <> 1) union (select A, B + 3 from R where A = 1)] (A : S. Vansummeren ,B : ,B : A 1 8 + 3) Where-Provenance Revisited + 3), B 5 1 } +3 Sanity Check This provenance semantics is sound Every object in the output is identical to the evaluation of its provenance term. Colors are used consistent. E.g. if tuple has provenance term (A : provenance term S. Vansummeren , B : 5) then it’s A-field has Where-Provenance Revisited Implicit vs explicit provenance — Revisited Theorem: For every query e there exists a query f that explicitly implements the implicit provenance semantics P[e]. A 1 8 clr B 2 1 to exp val clr A P[e] B 1 2 8 1 f t1 A 1 8 t2 clr B 5 1 val clr to exp t1 A t4 1 t5 5 t3 t6 8 t7 1 t3 S. Vansummeren B t2 Where-Provenance Revisited Implicit vs explicit provenance — Revisited Theorem: For every query e there exists a query f that explicitly implements the implicit provenance semantics P[e]. Explicit Query Language: e ::= | | | x | a | (A : e,S . . . , B : e 0 ) | e.A 0 {e, . . . , e } | e | {e 0 | x ∈ e} | if e1 = e2 then et else ef sum(e) | e + e | . . . Pa | P(A : e, . . . , B : e 0 ) | P{e, . . . , e 0 } | Psum(e) | e +P e | . . . S. Vansummeren Where-Provenance Revisited Expressive Completeness — Revisited Question: Can every sound explicit f be expressed implicitly? Answer: No, the provenance semantics P[e] of a query e is: 1 Recording (no comparison between prov terms) 2 Bounded Inventing (only finitely many constant atoms in prov terms) S. Vansummeren Where-Provenance Revisited Expressive Completeness — Revisited Theorem: Every sound explicit, recording, and bounded inventing f can be expressed implicitly as the provenance semantics P[e] of a query e. clr val clr A A 1 8 to impl B 1 2 8 1 B 2 1 P[e] f t1 clr val clr t1 A A 1 8 to impl B t2 t4 1 t5 5 t3 t6 8 t7 1 t2 S. Vansummeren Where-Provenance Revisited B 5 1 t3 Expressive Completeness — Revisited Theorem: Every sound explicit, recording, and bounded inventing f can be expressed implicitly as the provenance semantics P[e] of a query e. Implicit Provenance Semantics is a reasonable candidate to record provenance automatically since we do not lose flexibility S. Vansummeren Where-Provenance Revisited Wrapping up In conclusion: Provenance terms help preserve expressive completeness in the presence of primitive operations on atomic data values Future work Efficient implementation? Updates? S. Vansummeren Where-Provenance Revisited