Provenance as Dependency Analysis

advertisement
Provenance as
Dependency Analysis
Provenance in Databases Workshop
May 21, 2007
James Cheney
University of Edinburgh
Amal Ahmed Umut Acar
Toyota Technological Institute, Chicago
What is provenance?
●
●
“Where did that result come from?”
–
Where-provenance, annotation propagation
–
Idea: Trace sequence of “copies”
“Why is that result in the output?”
–
Why-provenance, lineage
–
Idea: Trace data on which output “depends”
Previous work
●
●
Lots!
–
“Polygen” model [Wang, Madnick '90]
–
Lineage [Cui, Widom, Wiener '00]
–
Why- and where- [Buneman Khanna Tan '01]
–
Copy-paste [Buneman, Chapman, C,
Vansummeren '06, '07]
–
Semirings [Green, Karvounarakis, Tannen '07]
(Why) Aren't
we done?
Why we're not happy
There is a big gap between the strong
informal motivations/descriptions
“relevant”, “depends on”, “comes from”, "causes"
and the formal definitions of
previous work on provenance.
I find most of them somewhat ad hoc.
including the ones I've worked on.
No offense!
What is provenance for?
●
Understand/debug query
–
●
Show what data a result “relies on”
–
●
What we must trust to believe result
Show where errors in result “come from”
–
●
show what query is “really” doing
What to blame if there is an error
Propagate annotations on input to
“relevant” output
What else is good for similar
problems?
●
Program dependence graphs
–
●
Information flow analysis
–
●
for tracking data secrecy and integrity
Program slicing
–
●
for understanding/debugging programs
for identifying parts of program on which
wrong output depends
Key idea: dependency analysis
Goal
●
●
Explore connection to PL ideas
–
dependency
–
information flow
–
program slicing
Improve understanding of provenance
–
How to generalize to richer query languages?
–
What formal guarantees can/should it
provide?
Simple dependency analysis
w=5
x=1
y=0
z=1
x := y;
y := 12;
z := w + y;
w=5
x=0
y = 12
z = 17
Simple dependency analysis
w=5
If we change
part of input...
x=1
y=0
z=1
x := y;
y := 12;
z := w + y;
w=5
x=0
y = 12
z = 17
Simple dependency analysis
w=5
If we change
part of input...
x=1
y=0
x := y;
z=1
What part of
output might
change?
y := 12;
z := w + y;
w=5
x=0
y = 12
z = 17
Simple dependency analysis
w=5
x=1
y=0
z=1
Trace back to find
last “use”
(assignment)
x := y;
y := 12;
z := w + y;
w=5
x=0
y = 12
z = 17
Simple dependency analysis
w=5
x=1
y=0
x := y;
z=1
Use-def
Make use-def
chainschains
y := 12;
z := w + y;
w=5
x=0
y = 12
z = 17
Simple dependency analysis
w=5
x=1
y=0
x := y;
z=1
Now simplify
y := 12;
z := w + y;
w=5
x=0
y = 12
z = 17
Simple dependency analysis
w=5
x=1
y=0
x := y;
z=1
Now simplify
y := 12;
z := w + y;
w=5
x=0
y = 12
z = 17
Simple dependency analysis
w=5
x=1
y=0
z=1
f
g
c
h
w=5
x=0
y = 12
z = 17
This dependence information correctly
approximates program behavior if:
there exists f, g, h, c such that
w' = f(w) x' = g(y) y' = c z' = h(w,y)
Dependency analysis: if-then
w=5
x=1
y=0
x := y;
if (z > 0) x := w;
else y := z;
z=1
Dependency analysis: if-then
w=5
x=1
y=0
z=1
x := y;
if (z > 0) x := w;
else y := z;
w=5
x=5
y=0
z=1
Dependency analysis: if-then
w=5
x=1
y=0
z = -42
x := y;
if (z > 0) x := w;
else y := z;
w=5
x=0
y = -42
z = -42
Dependency analysis: if-then
Control dependences
needed to reflect
w=5
x=1
alternative
branch
y=0
z=1
x := y;
if (z > 0) x := w;
else y := z;
w=5
x=5
y=1
z=1
Dependency analysis: if-then
●
w=5
x=1
y=0
z=1
w=5
x=5
y=1
z=1
To get all dependences,
–
need to consider what did happen
–
but also what might have happened
Extending to DBs: main idea
●
●
Given input DB, query Q
Link each “part” p of input to all “parts”
of output that depend on p
–
Intuition: If we change DB at p, then linked
parts may change
–
The rest of the DB not depending on p must
not change if we change the DB at p.
1000 words
Q
DB
p'
Q(DB)
1000 words
q3
Q
DB
p'
Q(DB)
q1
q2
1000 words
q3
Q
DB
p'
Provenance
links are
“dependency
correct” if...
Q(DB)
q1
q2
1000 words
q3
Q
DB
q1
p'
Whenever DB
and DB' are
“the same
except at p”
Q(DB)
DB'
p'
q2
1000 words
Q
DB
Q(DB)
q1
p'
Whenever DB
and DB' are
“the same
except at p”
Then Q(DB)
and Q(DB')
are “the
same except
q3' at q ,...,q ”
1
n
q2
q3'
DB'
p'
Q
Q(DB')
q1'
q2'
Easier said than done.
●
Several challenges
–
How do we address “parts” of DB?
–
What is a “change”?
–
What do we mean by “same place” in two
query results on changed input?
●
Gloss over here, details in paper.
●
Show basic idea using pictures
–
Much easier and more fun
–
but hard to formalize...
Selection
R
1
2
3
1
3
4
2
3
4
A=1(R)
1
2
3
1
3
4
Selection
R
1
42 3
1
3
4
2
3
4
Change here
Visible here
A=1(R)
1
42 3
1
3
4
Selection
R
1
42 3
1
3
4
2
3
4
A=1(R)
1
42 3
1
3
4
Selection
R
42 2
3
1
3
4
2
3
4
Visible here
Change here
A=1(R)
1
3
4
Selection
R
42 2
3
1
3
4
2
3
4
A=1(R)
1
3
4
Selection
R
1
2
3
1
3
4
1
3
4
Change here
Visible here
A=1(R)
1
2
3
1
3
4
1
3
4
Selection
R
1
2
3
“Data”
dependences
1
3
4
2
3
4
“Control”
dependences
A=1(R)
1
2
3
1
3
4
Projection
R
1
2
3
1
3
4
2
3
Only data
dependences
4
23(R)
2
3
3
4
3
4
Join
R
1
S
2
1
3
2
3
2
3
Mix of
4 3 4
dependences
2
R JOIN S
1
2
3
1
2
4
1
3
4
2
3
4
Union
R
1
S
2
1
3
2
3
2
3
2
Only data
4 4
dependences
4
RUS
1
2
1
3
2
3
2
3
2
4
4
4
Difference
R
1
S
2
1
3
2
3
2
R-S
1
2
1
3
3
2 4 control
3 4
Only
dependences
Grouping/aggregation
R
1
Data
dependence on
aggregated
field
2
1
3
2
3
1
5
2
Control
dependence on
grouped field
SUM
3
What is actually in the paper
●
●
●
Precise definition & formalization of
dependency correctness for NRC queries
Proof that “minimal” dependency
provenance noncomputable
Define provenance semantics that is
dependency-correct
–
●
but might be "inaccurate"
Define static approximation
–
via type-based analysis
Future work
●
●
Negation, = handled inelegantly
–
As in most/all other techniques
–
“Equal except for” is too strong
–
Can we do better?
Implementation
–
Prototype implementation now
–
How to scale dynamic tracking to large DBs?
Conclusions
●
●
Dependence analysis
–
provides a solid foundation for provenance
–
can track data, row, table dependences
–
can deal with grouping, aggregation
Minimizing nontrivial (undecidable)
–
●
Introduced dynamic and static approximation
techniques
Lots of possible variations to explore!
Download