On the Expressiveness of Implicit Provenance in
Query and Update Languages
Peter Buneman
1
James Cheney
1
Stijn Vansummeren
2
1
University of Edinburgh, Scotland
2
Hasselt University and Transnational University of Limburg, Belgium
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Introduction
Provenance is: the history of ownership of a valued object or work of art/literature a record of origin, modification, influences used to validate authenticity, integrity, validity of an object
Valuable because it’s hard to collect, verify
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Introduction
Chris curates a DB create table DB(Beer, Type, Origin) insert into DB values( . . .
) insert into DB select ∗ from S where origin=’USA’ insert into DB select ∗ from T where type=’stout’
Beer Type Origin
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Duvel blond Belgium
Heineken blond USA
Bud blond USA
Guinness stout Ireland
Introduction
Chris curates a DB create table DB(Beer, Type, Origin) insert into DB values( . . .
) insert into DB select ∗ from S where origin=’USA’ insert into DB select ∗ from T where type=’stout’
Beer
Duvel
Type Origin blond Belgium
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Heineken blond USA
Bud blond USA
Guinness stout Ireland
Introduction
Chris curates a DB create table DB(Beer, Type, Origin) insert into DB values( . . .
) insert into DB select ∗ from S where origin=’USA’ insert into DB select ∗ from T where type=’stout’
Beer
Duvel
Type Origin blond Belgium
Heineken blond USA
Bud blond USA
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Guinness stout Ireland
Introduction
Chris curates a DB create table DB(Beer, Type, Origin) insert into DB values( . . .
) insert into DB select ∗ from S where origin=’USA’ insert into DB select ∗ from T where type=’stout’
Beer
Duvel
Type Origin blond Belgium
Heineken blond USA
Bud blond USA
Guinness stout Ireland
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Introduction
Chris curates a DB create table DB(Beer, Type, Origin) insert into DB values( . . .
) insert into DB select ∗ from S where origin=’USA’ insert into DB select ∗ from T where type=’stout’
Beer
Duvel
Type Origin blond Belgium
Heineken blond USA
Bud blond USA
Guinness stout Ireland
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Introduction
Chris curates a DB create table DB(Beer, Type, Origin) insert into DB values( . . .
) insert into DB select ∗ from S where origin=’USA’ insert into DB select ∗ from T where type=’stout’
Beer
Duvel
Type Origin blond Belgium
Heineken blond USA
Bud blond USA
Guinness stout Ireland
Provenance is vital for assessing trustworthiness
Manual provenance recording is tedious and error-prone
Automatic provenance recording support?
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Introduction
Problem statement
Given a query or update, define for each item in the output where
(if anywhere) it comes from in the input
A B
1 2
8 1
(select ∗ from R where A <> 1) union (select A , 5 as B from R where A = 1)
Different definitions possible:
Wang and Madnick (VLDB 1990)
Buneman et al. (ICDT 2001)
Bhagwat et al. (VLDB 2005)
A B
1 5
8 1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Which are suitable/to be preferred and why?
Introduction
Problem statement
Given a query or update, define for each item in the output where
(if anywhere) it comes from in the input
A B
1 2
8 1
(select ∗ from R where A <> 1) union (select A , 5 as B from R where A = 1)
Different definitions possible:
Wang and Madnick (VLDB 1990)
Buneman et al. (ICDT 2001)
Bhagwat et al. (VLDB 2005)
Which are suitable/to be preferred and why?
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
A B
1 5
8 1
Other work on provenance
Why -provenance/ lineage that describes for each output item what input items it was influenced by.
Woodruff and Stonebreaker (ICDE 1997), Cui et al (TODS 2000),
Buneman et al (ICDT 2001), Green et al (PODS 2007)
Workflow -provenance for data processed through arbitrary workflows instead of queries/updates
Rose and Frew (ACM Computing Surveys, 2005),
Davidson et al. (Data Engineering Bulletin 2007)
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Propagation-approach to provenance
Implicit where-provenance
Wang and Madnick (VLDB 1990) and Bhagwat et al. (VLDB 2004) for atomic data values in queries.
This talk: provenance for tables, tuples, and atomic data values in queries and updates.
A B
1 2
8 1
(select ∗ from R where A <> 1) union (select A , 5 as B from R where A = 1)
A B
1 5
8 1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Propagation-approach to provenance
Implicit where-provenance
Wang and Madnick (VLDB 1990) and Bhagwat et al. (VLDB 2004) for atomic data values in queries.
This talk: provenance for tables, tuples, and atomic data values in queries and updates.
A B
1 2
8 1
(select ∗ from R where A <> 1) union (select A , 5 as B from R where A = 1)
A B
1 5
8 1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Propagation-approach to provenance
Implicit where-provenance
Wang and Madnick (VLDB 1990) and Bhagwat et al. (VLDB 2004) for atomic data values in queries.
This talk: provenance for tables, tuples, and atomic data values in queries and updates.
A B
1 2
8 1
P [ (select ∗ from R where A <> 1) union (select A , 5 as B from R where A = 1) ]
A B
1 5
8 1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Outline
1
2
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Queries
Objects: atoms – records – sets
Complex object queries: e ::= x | a | ( A : e , . . . , B : e
0
) | e .
A
| { e , . . . , e
0 } | S e | { e
0 | x ∈ e } | if e
1
= e
2 then e t else e f
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Example select ∗ from R where A <> 1
≡
S { if x .
A = 1 then /0 else { x } | x ∈ R }
Queries
Objects: atoms – records – sets
Complex object queries: e ::= x | a | ( A : e , . . . , B : e
0
) | e .
A
| { e , . . . , e
0 } | S e | { e
0 | x ∈ e } | if e
1
= e
2 then e t else e f
Example select ∗ from R where A <> 1
≡
S { if x .
A = 1 then /0 else { x } | x ∈ R }
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Provenance for queries
Queries create new objects
Objects copied from the input retain their color.
Objects resulting from constant, record, or set constructor are colored blank.
A B
1 2
8 1
P R
A B
1 2
8 1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
A B
1 2
8 1
P S { P x { .
( A : x .
A , x B | : x x ∈ .
B ) } | x x ∈ R } ∈ R }
A B
1 2
8 1
8 1
Provenance for queries
Queries create new objects
Objects copied from the input retain their color.
Objects resulting from constant, record, or set constructor are colored blank.
A B
1 2
8 1
P { x | x ∈ R }
A B
1 2
8 1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
A B
1 2
8 1
P S { P x { .
( A : x .
A , B : R x /0 .
B ) | x x ∈ R } ∈ R }
A
A
1
8
B
B
2
1
Provenance for queries
Queries create new objects
Objects copied from the input retain their color.
Objects resulting from constant, record, or set constructor are colored blank.
A B
1 2
8 1
P S { if x .
A = 1 then /0 else { x } | x ∈ R }
A
8
B
1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
A B
1 2
8 1
P { ( A : x .
A , x B | : x R x ∈ .
B ) } | x ∈ R }
A B
1 2
8 1
Provenance for queries
Queries create new objects
Objects copied from the input retain their color.
Objects resulting from constant, record, or set constructor are colored blank.
A B
1 2
8 1
P { ( A : x .
A , B : x .
B ) | x ∈ R }
A B
1 2
8 1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
A B
1 2
8 1
P S { if x .
A P { x P | x R ∈ R } { x } | x ∈ R }
A B
A B
8 1
8 1
Implicit vs explicit provenance
Theorem: For every query e there exists a query f that explicitly implements the implicit provenance semantics P [ e ].
clr
A B
1 2
8 1 to exp clr val
A
1
8
B
2
1
P [ e ]
A B
1 5
8 1 to exp clr clr f val
A
1
8
B
5
1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Expressiveness
Question: Can every explicit f be expressed implicitly?
Answer: No, the provenance semantics P [ e ] of a query e is:
1 Copying
2 Recording
3 Bounded Inventing
Theorem: Every explicit, copying, recording, and bounded inventing f can be expressed implicitly as the provenance semantics P [ e ] of a query e .
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Expressiveness
The implicit provenance semantics P [ e ] of a query e is copying :
Every non-blank object in the output is identical to the unique object in the input with the same color.
A B
1 2
8 1
P [ e ]
A B
1 5
8 1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Corollary: clr Non-copying explicit clr A B f P f
[ e ] clr clr
A B
1 2
1
A
1
5
8 1
8
8
1
B
5
1
Expressiveness
The implicit provenance semantics P [ e ] of a query e is copying :
Every non-blank object in the output is identical to the unique object in the input with the same color.
A B
1 2
8 1
P [ e ]
A B
1 5
8 1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Corollary: clr Non-copying explicit clr A B f f clr clr
1
8
2
1
A val
1
8
B
5
1
Expressiveness
The implicit provenance semantics P [ e ] of a query e is copying :
Every non-blank object in the output is identical to the unique object in the input with the same color.
clr clr A val
1
8
B
2
1 f clr clr A val
1
8
B
5
1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Corollary:
A B f P [ e ]
A B
1 2 1 5
8 1 8 1
Expressiveness
The implicit provenance semantics P [ e ] of a query e is copying :
Every non-blank object in the output is identical to the unique object in the input with the same color.
Corollary:
Non-copying explicit f cannot be expressed implicitly.
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
clr clr
A val B
1
A
1
2
8
8
1
B
2
1
P f
[ e ] clr clr
A B
1
A
1
5
8
8
1
B
5
1
Expressiveness
The implicit provenance semantics P [ e ] of a query e is propagating :
It never compares colors and hence commutes with recolorings
A B
1 2
8 1
ρ
A B
1 2
8 1
P [ e ]
P [ e ]
A B
1 5
8 1
ρ
A B
1 5
8 1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Corollary: clr val clr clr A
1
B
2 f clr A
Non-propagating explicit f cannot be expressed implicitly
1 val
But such f query provenance rather than define it
8
B
5
1
ρ f : ρ clr
(if e
1
= e
2 then val e
3 else e
4
) can occur in f only if e
1 and e
2 do not output colors.
B f that f explicitly query provenance, and show them expressively complete to the
Always define propagating functions color algebra , an extension of
0 2 the relational algebra that implicitly queries provenance.
Expressiveness
The implicit provenance semantics P [ e ] of a query e is propagating :
It never compares colors and hence commutes with recolorings clr clr
ρ val
A
1
8 clr clr val
A
1
8
B
2
1
B
2
1 f f clr clr val
A
1
8
B
5
1
ρ clr clr val
A
0
B
2
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Corollary:
A B
P [ e ]
A B
1 2 f 1 5
8 1 provenance rather than define it
8 1
ρ f : ρ
(if e
1
= e
2 then e
3 else e
4
) can occur in f only if e
1 and e
2 do not
A
1
8
B
2
1 f P [ e ] color algebra implicitly queries
A
1
8
B
5
1
Expressiveness
The implicit provenance semantics P [ e ] of a query e is propagating :
It never compares colors and hence commutes with recolorings
Corollary:
Non-propagating explicit f cannot be expressed implicitly
But such f query provenance rather than define it
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
clr clr
A B
1
A
1
2
8
8
1
B
2
1
P f
[ e ] clr clr
A B
1
A
1
5
8
8
1
B
5
1
ρ f : ρ
ρ clr
(if e
1
=
A
1
8 e
2 then
B
2
1 e
3
B else e
4
) can occur in f P f
[ e ] f only if implicitly queries e
1 and color algebra
A
1
8 e
2
0 do not
B
5
1
2
Expressiveness
The implicit provenance semantics P [ e ] of a query e is propagating :
It never compares colors and hence commutes with recolorings
Corollary:
Non-propagating explicit f cannot be expressed implicitly
But such f query provenance rather than define it
Related work:
[Geerts et al, ICDE 2006] and [Geerts and Van den Bussche, DBPL
2007] consider the queries f that explicitly query provenance, and show them expressively complete to the color algebra , an extension of the relational algebra that implicitly queries provenance.
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
clr clr
A B
1
A
1
2
8
8
1
B
2
1
P f
[ e ] clr clr
A B
1
A
1
5
8
8
1
B
5
1
ρ f : ρ
ρ clr
(if e clr
1
=
A e
A
2 then
B e
3
B else e
4
1
8
1
8
2
1
2
1
) can occur in f only if e
1
P f
[ e ] clr clr and e
2
A do not
B
1 A
0
5
8 1
B
2
Expressiveness
The implicit provenance semantics P [ e ] of a query e is propagating :
It never compares colors and hence commutes with recolorings
Corollary:
Non-propagating explicit f cannot be expressed implicitly
But such f query provenance rather than define it
Restriction to recording f :
(if e
1
= e
2 then output colors.
e
3 else e
4
) can occur in f only if e
1 and e
2 do not
Always define propagating functions
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
clr clr
A B
1
A
1
2
8
8
1
B
2
1
P f
[ e ] clr clr
A B
1
A
1
5
8
8
1
B
5
1
ρ ρ
ρ clr
[Geerts et al, ICDE 2006] and [Geerts and Van den Bussche, DBPL
2007]
A
1
8
A
B
2
1
B f P f
[ e ] color algebra implicitly queries
A
1
8
0
B
5
1
2
Expressiveness
The implicit provenance semantics P [ e ] of a query e is bounded inventing :
Only constants appearing in e will be colored blank.
clr clr val
A
1
B
2 f clr clr val
A
1
B
2
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Corollary:
Unbounded inventing explicit f cannot be expressed implicitly.
But such f are not “domain preserving” w.r.t. provenance.
Expressiveness
The implicit provenance semantics P [ e ] of a query e is bounded inventing :
Only constants appearing in e will be colored blank.
Corollary:
Unbounded inventing explicit f cannot be expressed implicitly.
But such f are not “domain preserving” w.r.t. provenance.
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
clr clr val
A
1
B
2 f clr clr val
A
1
B
2
Main Result
Theorem: Every explicit, copying, recording, and bounded inventing f can be expressed implicitly as the provenance semantics P [ e ] of a query e .
clr clr val
A
1
8
B
2
1 to impl
A B
1 2
8 1 f P [ e ] clr clr val
A
1
8
B
5
1 to impl
A B
1 5
8 1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Implicit Provenance Semantics is a reasonable candidate to record provenance automatically since we do not lose flexibility
Main Result
Theorem: Every explicit, copying, recording, and bounded inventing f can be expressed implicitly as the provenance semantics P [ e ] of a query e .
Implicit Provenance Semantics is a reasonable candidate to record provenance automatically since we do not lose flexibility
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
clr clr val
A
1
8
B
2
1 f clr clr val
A
1
8
B
5
1 to impl to impl
A B
1 2
8 1
P [ e ]
A B
1 5
8 1
Proof Sketch
Lemma: Every explicit, recording f is polymorphic w.r.t. colors.
clr clr val
A
1
8
B
2
1
α clr
A B
1 2
8 1 clr
1 2
8 1 val
A
1 1
8 8
B
2 2
1 1
P [ f
0
] f clr clr A val
1
8
B
5
1
α clr blank clr blank
8 1 val
A
1 1
8 8
B blank 5
1 1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Proof Sketch clr clr f val
A
1
8
B
2
1 clr clr A val
1
8
B
5
1 to impl to impl
A
1
8 clr
B
2
1 clr blank
A B
1 2
8 1
P [ enc ]
1
8 clr
2
1
1
8 val
A
1
8
P [ f
0
]
2
1
B
2
1 clr blank
8 1
A B
1 5
8 1 val
A
1 1
8 8
P [ dec ]
B blank 5
1 1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Questions and Answers
Question: Is the ability to construct deeply nested objects vital here?
Answer: No, it is possible to construct P [ e ] equivalent to f such that e never constructs deeper intermediate objects than f .
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Question: What happens when f is propagating instead of recording?
Conjecture: Every propagating f can be written as a recording one.
Questions and Answers
Question: Is the ability to construct deeply nested objects vital here?
Answer: No, it is possible to construct P [ e ] equivalent to f such that e never constructs deeper intermediate objects than f .
Question: What happens when f is propagating instead of recording?
Conjecture: Every propagating f can be written as a recording one.
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Outline
1
2
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
What about updates?
Updates are usually no more expressive than queries.
A B
1 2
8 1 update R set B = 5 where A = 1
A B
1 5
8 1
A B
1 2
8 1
(select ∗ from R where A <> 1) union (select A , 5 as B from R where A = 1)
A B
1 5
8 1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
What about updates?
Updates are usually no more expressive than queries.
A B
1 2
8 1
P [update R set B = 5 where A = 1]
A B
1 5
8 1
A B
1 2
8 1
P [(select ∗ from R where A <> 1) union (select A , 5 as B from R where A = 1)]
A B
1 5
8 1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Updates
Objects: atoms – records – sets
Complex object updates (Liefke and Davidson, 1999): u ::= skip | u
1
; u
2
| repl e | [ x ] u
| insert e | remove e | iter u
| A ⇒ u | add e | drop A
Theorem: Updates and queries are equally expressive under the “normal” semantics
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Provenance for updates
Updates modify objects
Objects retain their color, unless they are replaced by another object.
A B
1 2
8 1
P iter [ x ] B ⇒ repl(if x .
A = 1 then 5 else x .
B )
A B
1 5
8 1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Implicit vs explicit provenance
Theorem: For every update u there exists a query f that explicitly implements the implicit provenance semantics P [ u ].
clr
A B
1 2
8 1 to exp clr val
A
1
8
B
2
1
P [ u ]
A B
1 5
8 1 to exp f clr clr val
A
1
8
B
5
1
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Expressiveness
The implicit provenance semantics P [ u ] of an update u is: kind preserving
A B
1 2
8 1
P [ u ]
A B C
1 5 2
8 1 8 recording bounded inventing
Theorem: Every explicit, kind preserving, recording, and bounded inventing f can be expressed implicitly as the provenance semantics P [ u ] of an update.
P. Buneman, J. Cheney, S. Vansummeren
The Expressiveness of Implicit Provenance
Summary
Queries and updates implicitly define provenance
Queries create new objects, updates modify them
Readily implemented
Formal justification for the proposed semantics
It is expressively complete w.r.t. natural classes of queries that explicitly manipulate provenance.
P. Buneman, J. Cheney, S. Vansummeren