Extending Where-Provenance to deal with Primitive Data Value Operations Stijn Vansummeren

advertisement
Extending Where-Provenance to deal with
Primitive Data Value Operations
Stijn Vansummeren
Hasselt University and Transnational University of Limburg, Belgium
S. Vansummeren
Where-Provenance Revisited
Introduction
Chris curates a DB
create table DB(Beer, Type, Origin)
Beer
insert into DB values(. . . )
insert into DB
select ∗ from S where origin=’USA’
insert into DB
select ∗ from T where type=’stout’
S. Vansummeren
Where-Provenance Revisited
Type
Origin
Introduction
Chris curates a DB
create table DB(Beer, Type, Origin)
insert into DB values(. . . )
Beer
Duvel
insert into DB
select ∗ from S where origin=’USA’
insert into DB
select ∗ from T where type=’stout’
S. Vansummeren
Where-Provenance Revisited
Type
blond
Origin
Belgium
Introduction
Chris curates a DB
create table DB(Beer, Type, Origin)
insert into DB values(. . . )
insert into DB
select ∗ from S where origin=’USA’
Beer
Duvel
Heineken
Bud
insert into DB
select ∗ from T where type=’stout’
S. Vansummeren
Where-Provenance Revisited
Type
blond
blond
blond
Origin
Belgium
USA
USA
Introduction
Chris curates a DB
create table DB(Beer, Type, Origin)
insert into DB values(. . . )
insert into DB
select ∗ from S where origin=’USA’
Beer
Duvel
Heineken
Bud
Guinness
insert into DB
select ∗ from T where type=’stout’
S. Vansummeren
Where-Provenance Revisited
Type
blond
blond
blond
stout
Origin
Belgium
USA
USA
Ireland
Introduction
Chris curates a DB
create table DB(Beer, Type, Origin)
insert into DB values(. . . )
insert into DB
select ∗ from S where origin=’USA’
Beer
Duvel
Heineken
Bud
Guinness
insert into DB
select ∗ from T where type=’stout’
S. Vansummeren
Where-Provenance Revisited
Type
blond
blond
blond
stout
Origin
Belgium
USA
USA
Ireland
Introduction
Chris curates a DB
create table DB(Beer, Type, Origin)
insert into DB values(. . . )
insert into DB
select ∗ from S where origin=’USA’
Beer
Duvel
Heineken
Bud
Guinness
Type
blond
blond
blond
stout
Origin
Belgium
USA
USA
Ireland
insert into DB
select ∗ from T where type=’stout’
Where-provenance is vital for assessing trustworthiness
Manual provenance recording is tedious and error-prone
Automatic provenance recording support?
S. Vansummeren
Where-Provenance Revisited
Outline
1
Current Where-Provenance for Queries
2
Extending Where-Provenance to deal with Primitive Data Value
Operations
S. Vansummeren
Where-Provenance Revisited
A simple model of data where-provenance
Copy-paste model
Buneman, Chapman, Cheney
(SIGMOD 2006)
5
x
8
A
1
8
2
Ingredients:
A space of data objects
Every (sub)object is identifiable
A
1
8
B
2
1
Provenance links only if copy
No link? → Object is constructed
C
3
4
9
S. Vansummeren
Where-Provenance Revisited
A
1
8
B
2
1
B
5
1
1
A simple model of data where-provenance
Copy-paste model
Buneman, Chapman, Cheney
(SIGMOD 2006)
5
x
8
A
1
8
2
Ingredients:
A space of data objects
Every (sub)object is identifiable
A
1
8
B
2
1
Provenance links only if copy
No link? → Object is constructed
C
3
4
9
S. Vansummeren
Where-Provenance Revisited
A
1
8
B
2
1
B
5
1
1
A simple model of data where-provenance
Copy-paste model
Buneman, Chapman, Cheney
(SIGMOD 2006)
5
x
8
A
1
8
2
Ingredients:
A space of data objects
Every (sub)object is identifiable
A
1
8
B
2
1
Provenance links only if copy
No link? → Object is constructed
C
3
4
9
S. Vansummeren
Where-Provenance Revisited
A
1
8
B
2
1
B
5
1
1
A simple model of data where-provenance
Copy-paste model
Buneman, Chapman, Cheney
(SIGMOD 2006)
5
x
8
A
1
8
A
1
8
2
B
2
1
To store provenance, store identifier/color
A
1
8
For
For
5
B
2
1
8
store
x
C
3
4
9
store
S. Vansummeren
Where-Provenance Revisited
A
1
8
B
2
1
B
5
1
1
Propagation-approach to where-provenance
Implicit where-provenance
Wang and Madnick (VLDB 1990) and Bhagwat et al. (VLDB 2004)
for atomic data values in queries.
Buneman et al (ICDT 2007): provenance for tables, tuples, and
atomic data values in queries and updates.
A
1
8
B
2
1
(select ∗ from R where A <> 1)
union (select A, 5 as B from R where A = 1)
S. Vansummeren
Where-Provenance Revisited
A
1
8
B
5
1
Propagation-approach to where-provenance
Implicit where-provenance
Wang and Madnick (VLDB 1990) and Bhagwat et al. (VLDB 2004)
for atomic data values in queries.
Buneman et al (ICDT 2007): provenance for tables, tuples, and
atomic data values in queries and updates.
A
1
8
B
2
1
(select ∗ from R where A <> 1)
union (select A, 5 as B from R where A = 1)
S. Vansummeren
Where-Provenance Revisited
A
1
8
B
5
1
Propagation-approach to where-provenance
Implicit where-provenance
Wang and Madnick (VLDB 1990) and Bhagwat et al. (VLDB 2004)
for atomic data values in queries.
Buneman et al (ICDT 2007): provenance for tables, tuples, and
atomic data values in queries and updates.
A
1
8
B
2
1
P[(select ∗ from R where A <> 1)
union (select A, 5 as B from R where A = 1)]
S. Vansummeren
Where-Provenance Revisited
A
1
8
B
5
1
Queries
Objects:
atomic data values – records – sets
Complex object queries:
e
::= x | a | (A : e, . . . , B : e 0 ) | e.A
S
|
{e, . . . , e 0 } | e | {e 0 | x ∈ e} | if e1 = e2 then et else ef
S. Vansummeren
Where-Provenance Revisited
Queries
Objects:
atomic data values – records – sets
Complex object queries:
e
::= x | a | (A : e, . . . , B : e 0 ) | e.A
S
|
{e, . . . , e 0 } | e | {e 0 | x ∈ e} | if e1 = e2 then et else ef
Example
select ∗ from R where A <> 1
≡
S
{if x.A = 1 then 0/ else {x} | x ∈ R}
S. Vansummeren
Where-Provenance Revisited
Provenance for queries
Queries create new objects
Objects copied from the input retain their color.
Objects resulting from constant, record, or set constructor are colored
blank.
A
1
8
B
2
1
A
1
8
P R
S. Vansummeren
Where-Provenance Revisited
B
2
1
Provenance for queries
Queries create new objects
Objects copied from the input retain their color.
Objects resulting from constant, record, or set constructor are colored
blank.
A
1
8
B
2
1
P {x | x ∈ R}
S. Vansummeren
Where-Provenance Revisited
A
1
8
B
2
1
Provenance for queries
Queries create new objects
Objects copied from the input retain their color.
Objects resulting from constant, record, or set constructor are colored
blank.
A
1
8
B
2
1
P
S
{if x.A = 1 then 0/ else {x} | x ∈ R}
S. Vansummeren
Where-Provenance Revisited
A
8
B
1
Provenance for queries
Queries create new objects
Objects copied from the input retain their color.
Objects resulting from constant, record, or set constructor are colored
blank.
A
1
8
B
2
1
P {(A : x.A, B : x.B) | x ∈ R}
S. Vansummeren
Where-Provenance Revisited
A
1
8
B
2
1
Provenance for queries
Queries create new objects
Objects copied from the input retain their color.
Objects resulting from constant, record, or set constructor are colored
blank.
A
1
8
B
2
1
P {(A : x.A, B : x.B) | x ∈ R}
S. Vansummeren
Where-Provenance Revisited
A
1
8
B
2
1
Sanity Check
Copy-paste model
Buneman, Chapman, Cheney
(SIGMOD 2006)
5
x
8
A
1
8
2
Ingredients:
A space of data objects
Every (sub)object is identifiable
A
1
8
B
2
1
Provenance links only if copy
No link? → Object is constructed
C
3
4
9
S. Vansummeren
Where-Provenance Revisited
A
1
8
B
2
1
B
5
1
1
Sanity Check
The implicit provenance semantics is sound
Every non-blank object in the output is identical to the unique object
in the input with the same color.
A
1
8
B
2
1
A
1
8
P R
S. Vansummeren
Where-Provenance Revisited
B
2
1
Sanity Check
The implicit provenance semantics is sound
Every non-blank object in the output is identical to the unique object
in the input with the same color.
A
1
8
B
2
1
P {x | x ∈ R}
S. Vansummeren
Where-Provenance Revisited
A
1
8
B
2
1
Sanity Check
The implicit provenance semantics is sound
Every non-blank object in the output is identical to the unique object
in the input with the same color.
A
1
8
B
2
1
P
S
{if x.A = 1 then 0/ else {x} | x ∈ R}
S. Vansummeren
Where-Provenance Revisited
A
8
B
1
Sanity Check
The implicit provenance semantics is sound
Every non-blank object in the output is identical to the unique object
in the input with the same color.
A
1
8
B
2
1
P {(A : x.A, B : x.B) | x ∈ R}
S. Vansummeren
Where-Provenance Revisited
A
1
8
B
2
1
Sanity Check
The implicit provenance semantics is sound
Every non-blank object in the output is identical to the unique object
in the input with the same color.
A
1
8
B
2
1
P {(A : x.A, B : x.B) | x ∈ R}
S. Vansummeren
Where-Provenance Revisited
A
1
8
B
2
1
Implicit vs explicit provenance
Theorem: For every query e there exists a query f that explicitly implements
the implicit provenance semantics P[e].
A
1
8
B
2
1
clr
to exp
val
clr
A
P[e]
A
1
8
B
1
2
8
1
f
B
5
1
clr
val
clr
to exp
S. Vansummeren
A
B
1
5
8
1
Where-Provenance Revisited
Expressiveness
Question: Can every sound explicit f be expressed implicitly?
Answer: No, the provenance semantics P[e] of a query e is:
1
Recording
2
Bounded Inventing
Theorem: Every explicit, sound, recording, and bounded inventing f can
be expressed implicitly as the provenance semantics P[e] of a query e.
S. Vansummeren
Where-Provenance Revisited
Expressiveness
The implicit provenance semantics P[e] of a query e is recording:
The explicit expression for P[e] never compares colors:
if e1 = e2 then et else ef
can occur only if e1 and e2 do not output colors.
S. Vansummeren
Where-Provenance Revisited
Expressiveness
The implicit provenance semantics P[e] of a query e is recording:
The explicit expression for P[e] never compares colors
P[e] is therefore also propagating: it commutes with recolorings
A
1
8
B
2
1
P[e]
ρ
A
1
8
A
1
8
B
5
1
ρ
B
2
1
P[e]
S. Vansummeren
Where-Provenance Revisited
A
1
8
B
5
1
Expressiveness
The implicit provenance semantics P[e] of a query e is recording:
The explicit expression for P[e] never compares colors
P[e] is therefore also propagating: it commutes with recolorings
clr
val
clr
A
clr
f
B
val
clr
A
2
1
5
8
1
8
1
ρ
clr
ρ
val
clr
B
1
A
f
B
1
2
8
1
S. Vansummeren
clr
val
clr
A
B
0
Where-Provenance Revisited
2
Expressiveness
The implicit provenance semantics P[e] of a query e is recording:
The explicit expression for P[e] never compares colors
P[e] is therefore also propagating: it commutes with recolorings
Corollary:
Non-propagating explicit f cannot be expressed implicitly
But such f query provenance rather than define it
S. Vansummeren
Where-Provenance Revisited
Expressiveness
The implicit provenance semantics P[e] of a query e is recording:
The explicit expression for P[e] never compares colors
P[e] is therefore also propagating: it commutes with recolorings
Corollary:
Non-propagating explicit f cannot be expressed implicitly
But such f query provenance rather than define it
Related work:
[Geerts et al, ICDE 2006] and [Geerts and Van den Bussche, DBPL
2007] consider the queries f that explicitly query provenance, and
show them expressively complete to the color algebra, an extension of
the relational algebra that implicitly queries provenance.
S. Vansummeren
Where-Provenance Revisited
Expressiveness
The implicit provenance semantics P[e] of a query e is bounded inventing:
Only constants appearing in e will be colored blank.
clr
val
clr
A
f
B
1
clr
val
clr
B
1
2
S. Vansummeren
A
Where-Provenance Revisited
2
Expressiveness
The implicit provenance semantics P[e] of a query e is bounded inventing:
Only constants appearing in e will be colored blank.
Corollary:
Unbounded inventing explicit f cannot be expressed implicitly.
But such f are not “domain preserving” w.r.t. provenance.
S. Vansummeren
Where-Provenance Revisited
Expressive Completeness
Theorem: [Buneman, Cheney, V. (ICDT 2007)] Every explicit, sound, recording, and bounded inventing f can be expressed implicitly as the provenance
semantics P[e] of a query e.
clr
val
clr
A
to impl
B
1
2
8
1
val
clr
B
2
1
P[e]
f
clr
A
1
8
A
to impl
B
1
5
8
1
S. Vansummeren
Where-Provenance Revisited
A
1
8
B
5
1
Expressive Completeness
Theorem: [Buneman, Cheney, V. (ICDT 2007)] Every explicit, sound, recording, and bounded inventing f can be expressed implicitly as the provenance
semantics P[e] of a query e.
Implicit Provenance Semantics is a reasonable candidate to record
provenance automatically since we do not lose flexibility
S. Vansummeren
Where-Provenance Revisited
However ...
We’re lacking important real-world features
Operations on atomic data, aggregates, external function calls, . . .
A
1
8
B
2
1
(select ∗ from R where A <> 1)
union (select A, B + 3 from R where A = 1)
S. Vansummeren
Where-Provenance Revisited
A
1
8
B
5
1
However ...
We’re lacking important real-world features
Operations on atomic data, aggregates, external function calls, . . .
A
1
8
B
2
1
P[(select ∗ from R where A <> 1)
union (select A, B + 3 from R where A = 1)]
S. Vansummeren
Where-Provenance Revisited
A
1
8
B
5
1
However ...
We’re lacking important real-world features
Operations on atomic data, aggregates, external function calls, . . .
A
1
8
B
2
1
P[(select ∗ from R where A <> 1)
union (select A, B + 3 from R where A = 1)]
A
1
8
This is not bounded-inventing
Expressive completeness fails without bounded-inventing
requirement
S. Vansummeren
Where-Provenance Revisited
B
5
1
Outline
1
Current Where-Provenance for Queries
2
Extending Where-Provenance to deal with Primitive Data Value
Operations
S. Vansummeren
Where-Provenance Revisited
Revisiting the simple model of where-provenance
Ingredients:
Every (sub)object is identifiable
5
x
8
A
1
8
2
Provenance links only if copy
No link? → Object is constructed
A
1
8
Let’s record how objects are constructed!
S. Vansummeren
B
2
1
C
3
4
9
Where-Provenance Revisited
A
1
8
B
2
1
B
5
1
1
Provenance terms by example
Data space
A
1
8
Provenance term
B
2
1
Evaluates to
unique object identified
by this color
S. Vansummeren
Where-Provenance Revisited
Provenance terms by example
Data space
A
1
8
Provenance term
B
2
1
Evaluates to
unique object identified
by this color
(A :
, B : 5)
S. Vansummeren
record (A : 1, B : 5)
Where-Provenance Revisited
Provenance terms by example
Data space
A
1
8
Provenance term
B
2
1
Evaluates to
unique object identified
by this color
(A :
{(A :
record (A : 1, B : 5)
, B : 5)
, B : 5),
S. Vansummeren
}
A
1
8
Where-Provenance Revisited
B
5
1
Provenance terms by example
Data space
A
1
8
Provenance term
B
2
1
Evaluates to
unique object identified
by this color
(A :
{(A :
record (A : 1, B : 5)
, B : 5)
, B : 5),
}
A
1
8
+3
S. Vansummeren
B
5
1
5
Where-Provenance Revisited
Provenance terms by example
Data space
A
1
8
Provenance term
B
2
1
Evaluates to
unique object identified
by this color
(A :
{(A :
record (A : 1, B : 5)
, B : 5)
, B : 5),
}
A
1
8
+3
avg{ ,
S. Vansummeren
B
5
1
5
}
1.5
Where-Provenance Revisited
Queries — Revisited
Objects:
atomic data values – records – sets
Complex object queries with primitives p : σ → atom:
e
::= x | a | (A : e, . . . , B : e 0 ) | e.A
S
|
{e, . . . , e 0 } | e | {e 0 | x ∈ e} | if e1 = e2 then et else ef
|
sum(e) | e + e | . . .
S. Vansummeren
Where-Provenance Revisited
Queries — Revisited
Objects:
atomic data values – records – sets
Complex object queries with primitives p : σ → atom:
e
::= x | a | (A : e, . . . , B : e 0 ) | e.A
S
|
{e, . . . , e 0 } | e | {e 0 | x ∈ e} | if e1 = e2 then et else ef
|
sum(e) | e + e | . . .
Example
select A, B + 3 from R where A = 1
≡
S
{if x.A = 1 then {(A : x.A, B : x.B + 3)} else 0/ | x ∈ R}
S. Vansummeren
Where-Provenance Revisited
Provenance for Queries — Revisited
Queries create new objects
Objects copied from the input retain their color.
Atomic constant, record, set constructor, external function induce
corresponding operation on color terms
A
1
8
B
2
1
P R
S. Vansummeren
A
1
8
Where-Provenance Revisited
B
2
1
Provenance for Queries — Revisited
Queries create new objects
Objects copied from the input retain their color.
Atomic constant, record, set constructor, external function induce
corresponding operation on color terms
A
1
8
B
2
1
P R
S. Vansummeren
A
1
8
Where-Provenance Revisited
B
2
1
Provenance for Queries — Revisited
Queries create new objects
Objects copied from the input retain their color.
Atomic constant, record, set constructor, external function induce
corresponding operation on color terms
A
1
8
B
2
1
P {x | x ∈ R}
S. Vansummeren
A
1
8
Where-Provenance Revisited
B
2
1
Provenance for Queries — Revisited
Queries create new objects
Objects copied from the input retain their color.
Atomic constant, record, set constructor, external function induce
corresponding operation on color terms
{ ,
A
1
8
B
2
1
P {x | x ∈ R}
S. Vansummeren
A
1
8
Where-Provenance Revisited
}
B
2
1
Provenance for Queries — Revisited
Queries create new objects
Objects copied from the input retain their color.
Atomic constant, record, set constructor, external function induce
corresponding operation on color terms
A
1
8
B
2
1
P {(A : x.A, B : x.B) | x ∈ R}
S. Vansummeren
A
1
8
Where-Provenance Revisited
B
2
1
Provenance for Queries — Revisited
Queries create new objects
Objects copied from the input retain their color.
Atomic constant, record, set constructor, external function induce
corresponding operation on color terms
{(A :
A
1
8
B
2
1
,B :
A
1
8
P {(A : x.A, B : x.B) | x ∈ R}
(A :
S. Vansummeren
), (A :
,B :
)
Where-Provenance Revisited
,B :
)}
B
2
1
(A :
,B :
)
Provenance for Queries — Revisited
Queries create new objects
Objects copied from the input retain their color.
Atomic constant, record, set constructor, external function induce
corresponding operation on color terms
A
1
8
B
2
1
P[(select ∗ from R where A <> 1)
union (select A, B + 3 from R where A = 1)]
S. Vansummeren
A
1
8
Where-Provenance Revisited
B
5
1
Provenance for Queries — Revisited
Queries create new objects
Objects copied from the input retain their color.
Atomic constant, record, set constructor, external function induce
corresponding operation on color terms
{(A :
A
1
8
B
2
1
P[(select ∗ from R where A <> 1)
union (select A, B + 3 from R where A = 1)]
(A :
S. Vansummeren
,B :
,B :
A
1
8
+ 3)
Where-Provenance Revisited
+ 3),
B
5
1
}
+3
Sanity Check
This provenance semantics is sound
Every object in the output is identical to the evaluation of its
provenance term.
Colors are used consistent.
E.g. if tuple has provenance term (A :
provenance term
S. Vansummeren
, B : 5) then it’s A-field has
Where-Provenance Revisited
Implicit vs explicit provenance — Revisited
Theorem: For every query e there exists a query f that explicitly implements
the implicit provenance semantics P[e].
A
1
8
clr
B
2
1
to exp
val
clr
A
P[e]
B
1
2
8
1
f
t1
A
1
8
t2
clr
B
5
1
val
clr
to exp
t1
A
t4
1
t5
5
t3
t6
8
t7
1
t3
S. Vansummeren
B
t2
Where-Provenance Revisited
Implicit vs explicit provenance — Revisited
Theorem: For every query e there exists a query f that explicitly implements
the implicit provenance semantics P[e].
Explicit Query Language:
e
::=
|
|
|
x | a | (A : e,S
. . . , B : e 0 ) | e.A
0
{e, . . . , e } | e | {e 0 | x ∈ e} | if e1 = e2 then et else ef
sum(e) | e + e | . . .
Pa | P(A : e, . . . , B : e 0 ) | P{e, . . . , e 0 } | Psum(e) | e +P e | . . .
S. Vansummeren
Where-Provenance Revisited
Expressive Completeness — Revisited
Question: Can every sound explicit f be expressed implicitly?
Answer: No, the provenance semantics P[e] of a query e is:
1
Recording (no comparison between prov terms)
2
Bounded Inventing
(only finitely many constant atoms in prov terms)
S. Vansummeren
Where-Provenance Revisited
Expressive Completeness — Revisited
Theorem: Every sound explicit, recording, and bounded inventing f can be
expressed implicitly as the provenance semantics P[e] of a query e.
clr
val
clr
A
A
1
8
to impl
B
1
2
8
1
B
2
1
P[e]
f
t1
clr
val
clr
t1
A
A
1
8
to impl
B
t2
t4
1
t5
5
t3
t6
8
t7
1
t2
S. Vansummeren
Where-Provenance Revisited
B
5
1
t3
Expressive Completeness — Revisited
Theorem: Every sound explicit, recording, and bounded inventing f can be
expressed implicitly as the provenance semantics P[e] of a query e.
Implicit Provenance Semantics is a reasonable candidate to record
provenance automatically since we do not lose flexibility
S. Vansummeren
Where-Provenance Revisited
Wrapping up
In conclusion:
Provenance terms help preserve expressive completeness in the
presence of primitive operations on atomic data values
Future work
Efficient implementation?
Updates?
S. Vansummeren
Where-Provenance Revisited
Download