Reflection: A Tool for to Model and Reason about Provenance Indiana University

advertisement
Reflection: A Tool for to Model and
Reason about Provenance
Dirk Van Gucht
Indiana University
Computer Science Department
vgucht@cs.indiana.edu
1
Overview
• Discussion on data provenance: data + program
• What is reflection?
• Why reflection?
• How is reflection realized in programming and query
languages?
• Is reflection useful to model and reason about provenance?
2
Motivation
• What is the provenance of a data object?
• How should it be modeled?
• How should it be computed?
• How do we manipulate and query it?
Should we separate these concerns?
Should we seek an integration in a single database programming environment?
3
What is provenance?
• Workflow provenance of a piece of data is information recording a complete history of its derivation.
It is information describing how derived data has
been calculated from raw information; tracking the
interaction of programs (software), but also the involvement of external devices (sensors, cameras,
collecting equipment).
• Data provenance of a piece of data is
1. the description of its origin(s) and
2. the process by which it arrived from this origin.
Data provenance = data + program
4
Where and Why Provenance
1. Where-provenance: Where does a given piece of
data come from?
The where-provenance of an data object is simply
the identification of the source element(s) where
the object is copied or derived from.
2. Why-provenance: Why is a given piece of data in
the database?
In addition to the where-provenance of a data object, one keeps the justification or explanation for
the appearance of that object because it can exists through the execution of a query, a copy-paste
operation, a computation, an update etc.
5
Modeling data provenance
• Annotations
• Colors
• Expressions over colors
• Semi-ring elements
• Dataflow and workflow models
6
Computing Data Provenance
Lazy: compute the provenance of data only
when needed. (Debugging)
Eager: compute data provenance by carrying
the provenance of data along as the data is
transformed. (Annotation propagation)
7
Manipulating and querying provenance
“Which process created a data product”
“When was is created”
“What were the input data products to a process”
“Find differences between two workflow runs with slightly
different services”
etc..
8
What is reflection? (Stemple et.al)
• The ability of a program to generate new
program fragments and to integrate these
into its own execution.
• New program fragment are dependent on
the current code of the program and the
current environment.
9
Why reflection?
There are many scenarios wherein both data
and program become part of database applications. (Provenance is one of these.)
• Stonebraker et.al argue for stored procedures (POSTGRES);
• Jim Gray urges database community to lower
the wall between data and program (SIGMOD Innovation Award 1993);
• Asilomar report: unification between data
and programs need to be high on database
research agenda;
10
Query languages for data and programs
Scenarios
meta-data
ontological data
database designs
constraints
indexes
queries
views
procedures
rules (if condition then action)
access control
mappings
wrappers
provenance
···
11
How is reflection realized in programming
languages?
• Reification:
– Input: Program (Process, Action)
– Output: Data
• Evaluation:
– Input: Data
– Output: Program (Process, Action)
12
Scheme
• Reification: quote (’)
• Evaluation: eval
> (+ 3 4)
7
> ’(+ 3 4)
(+ 3 4)
> (cons ’(+ 3 4))
(3 4)
> (eval ’(+ 3 4))
7
13
Scheme
> (lambda(x) (+ x x) 2)
4
> ’(lambda(x) (+ x x) 2)
(lambda(x) (+ x x) 2)
> (eval ’(lambda(x) (+ x x)))
4
14
Dynamic SQL
• Reification: string building
• Evaluation: EXEC
SELECT A FROM R
DECLARE @query = varchar
@query = ’SELECT’ + ’ A’ + ’ FROM’ + ’ R’
EXEC @query
15
Dynamic SQL - Stored Queries
Queries
QueryName: varchar QueryCode: varchar
QA
’SELECT A FROM R’
QB
’SELECT B FROM R’
DECLARE @query varchar
SELECT QueryCode INTO @query
FROM
Queries
WHERE QueryName = ’QA’
EXEC @query
16
Reflective query languages?
• Design a relational query language with reflection
• Start from a relational QL and add a limited number of primitives
17
The problem of reification
How should queries (procedures etc) be represented as data in tables?
Not unique.
Can be messy.
18
Reflective relational algebra
• A relational algebra programs
• RA reflective relational algebra
– evaluation applied on A programs
• R2A full reflective relational algebra
– evaluation applied on A and RA programs
19
Relational algebra programs
An A program is a sequence of statements.
Each statement has the form ‘X := E’, where X is a
(relation) variable and E is a term.
Each term is one of the following:
• a relation name;
• a constant of the form {(A : a)} where A is an
attribute name and a ∈ dom (A);
• a (relation) variable;
• if X, X1 , X2 are variables, then the following are
terms:
– X1 ∪ X2 (union);
– X1 − X2 (difference);
– X1 ⊲⊳ X2 (equi-join);
– π̂A (X) (projecting out attribute A);
– σA=B (X) and σA<B (X) (selections); and
– ρA/A′ (X) (renaming of attribute A to A′ ).
20
An A program
R is a relation over scheme {P, C}.
Find the children and grandchildren of Fred.
1)
2)
3)
4)
5)
6)
7)
8)
9)
X1
X2
X3
X4
X5
X6
X7
X3
X8
:= R;
:= {(P : Fred)};
:= X1 ⊲⊳ X2;
:= ρC/C ′ (X3);
:= ρP/C ′ (X1);
:= X4 ⊲⊳ X5;
:= πC ′ (X6);
:= X3 ∪ X7;
:= πP (X3).
21
Reification of an A program into a program
relation
1)
2)
3)
4)
5)
6)
7)
8)
9)
sno
1
2
3
4
5
6
7
8
9
var
X1
X2
X3
X4
X5
X6
X7
X3
X8
op
att-1
X1
X2
X3
X4
X5
X6
X7
X3
X8
:=
:=
:=
:=
:=
:=
:=
:=
:=
R;
{(P : Fred)};
X1 ⊲⊳ X2 ;
ρC/C ′ (X3 );
ρP/C ′ (X1 );
X4 ⊲⊳ X5 ;
πC ′ (X6 );
X3 ∪ X7 ;
πP (X3 ).
att-2
arg-1
arg-2
P
⊲⊳
ρ
ρ
⊲⊳
π
∪
π
C
P
C′
P
C′
C′
X1
X3
X1
X4
X6
X3
X3
rel
R
const
Fred
X2
X5
X7
22
Reification into program relations
• A programs can be reified into program relations.
• To construct program relations of A programs that
depend on the database, we need another operator
that generates numbers.
X := numbersno (R)
R =
A
a
b
c
numbersno (R) =
sno A
1
a
2
b
3
c
23
Reification of an A program that depends
on the database
Compute the program relation for πA (R) given the catalog inforAtt-R
att
A
mation
B
C
D
Y1
att
B
1. Compute
Y2
att
C
D
and
2. Compute from Y1 the relation Y3
var
X
op
π̂
att-1
B
att-2
arg-1
R
arg-2
rel
R
const
arg-2
rel
X
X
const
3. Compute from Y2 the relation Y4
var
X
X
op
π̂
π̂
att-1
C
D
att-2
arg-1
X
X
4. Y5 := Y3 ∪ Y4
5. Y6 := numbersno(Y5 )
sno
1
2
3
var
X
X
X
op
π̂
π̂
π̂
att-1
B
C
D
att-2
arg-1
R
X
X
arg-2
rel
R
X
X
const
24
Reflective relational algebra RA
A term of RA is either
1. a term of A;
2. an expression of the form numberN (X), with X a
variable; or
3. an expression of the form eval(X), with X a variable.
Semantics of eval(X): if X holds a program relation
representing an A-program P , then eval(X) executes P
and evaluates to P ’s final result.
25
Evaluation
sno
1
2
3
var
X
X
X
op
π̂
π̂
π̂
att-1
B
C
D
att-2
arg-1
R
X
X
arg-2
rel
R
X
X
const
Y7 := eval(Y6)
Notice that the RA program is unaffected by
changes to the attributes of R.
meta-data independence
26
“What is known about John?” in A
rel
Persons
Persons
Children
Children
Hobbies
Hobbies
att
name
age
parent
child
person
hobby
rel
Persons
Persons
Children
Children
Children
Hobbies
Hobbies
Hobbies
att
name
age
parent
child
child
person
hobby
hobby
val
John
24
John
Steve
Iris
John
ping-pong
math
For every fixed database scheme, this query can be expressed in the relational algebra as
[
(rel : R) ⊲⊳ (att : A) ⊲⊳ ρA/val (πA(σA′ =John (R)))
R,A,A′
This formulation of the query is meta-data dependent.
27
“What is known about John?” in RA
[
(rel : R) ⊲⊳ (att : A) ⊲⊳ ρA/val (πA(σA′ =John (R)))
R,A,A′
1. Generate all triples of the form (R, A, A′ );
2. For each (R, A, A′) make program relation fragment
that corresponds to
(rel : R) ⊲⊳ (att : A) ⊲⊳ ρA/val (πA(σA′ =John (R)));
3. Auxiliary relation for the unions;
4. number operator to create proper ordering on program statements;
5. eval this program relation.
28
Other scenarios
• Given catalog information and relations, create a
triple store to represent the union of all the data in
the database.
• Assume a Log storing queries.
– Determine the queries that depend on R? (syntactical)
– Determine the queries in the log that reference
the most relations. (syntactical)
– Determine the queries in the log that return an
empty answer on the current database. (semantical)
– View expansion: replace, in each query in the
log, each view name by its definition as given in
the system catalog. (syntactical)
– Given a list of new view definitions (under the
old names), which queries in the log give a different answer on the current instance under the
new view definitions?” (semantical)
29
Discussion
• What is the provenance of a data object?
Provenance = data + program?
• How should it be modeled?
• How should it be computed?
• How do we manipulate and query it?
Should we separate these concerns?
Should we seek an integration in a single database programming environment?
30
Download