Reflection: A Tool for to Model and Reason about Provenance Dirk Van Gucht Indiana University Computer Science Department vgucht@cs.indiana.edu 1 Overview • Discussion on data provenance: data + program • What is reflection? • Why reflection? • How is reflection realized in programming and query languages? • Is reflection useful to model and reason about provenance? 2 Motivation • What is the provenance of a data object? • How should it be modeled? • How should it be computed? • How do we manipulate and query it? Should we separate these concerns? Should we seek an integration in a single database programming environment? 3 What is provenance? • Workflow provenance of a piece of data is information recording a complete history of its derivation. It is information describing how derived data has been calculated from raw information; tracking the interaction of programs (software), but also the involvement of external devices (sensors, cameras, collecting equipment). • Data provenance of a piece of data is 1. the description of its origin(s) and 2. the process by which it arrived from this origin. Data provenance = data + program 4 Where and Why Provenance 1. Where-provenance: Where does a given piece of data come from? The where-provenance of an data object is simply the identification of the source element(s) where the object is copied or derived from. 2. Why-provenance: Why is a given piece of data in the database? In addition to the where-provenance of a data object, one keeps the justification or explanation for the appearance of that object because it can exists through the execution of a query, a copy-paste operation, a computation, an update etc. 5 Modeling data provenance • Annotations • Colors • Expressions over colors • Semi-ring elements • Dataflow and workflow models 6 Computing Data Provenance Lazy: compute the provenance of data only when needed. (Debugging) Eager: compute data provenance by carrying the provenance of data along as the data is transformed. (Annotation propagation) 7 Manipulating and querying provenance “Which process created a data product” “When was is created” “What were the input data products to a process” “Find differences between two workflow runs with slightly different services” etc.. 8 What is reflection? (Stemple et.al) • The ability of a program to generate new program fragments and to integrate these into its own execution. • New program fragment are dependent on the current code of the program and the current environment. 9 Why reflection? There are many scenarios wherein both data and program become part of database applications. (Provenance is one of these.) • Stonebraker et.al argue for stored procedures (POSTGRES); • Jim Gray urges database community to lower the wall between data and program (SIGMOD Innovation Award 1993); • Asilomar report: unification between data and programs need to be high on database research agenda; 10 Query languages for data and programs Scenarios meta-data ontological data database designs constraints indexes queries views procedures rules (if condition then action) access control mappings wrappers provenance ··· 11 How is reflection realized in programming languages? • Reification: – Input: Program (Process, Action) – Output: Data • Evaluation: – Input: Data – Output: Program (Process, Action) 12 Scheme • Reification: quote (’) • Evaluation: eval > (+ 3 4) 7 > ’(+ 3 4) (+ 3 4) > (cons ’(+ 3 4)) (3 4) > (eval ’(+ 3 4)) 7 13 Scheme > (lambda(x) (+ x x) 2) 4 > ’(lambda(x) (+ x x) 2) (lambda(x) (+ x x) 2) > (eval ’(lambda(x) (+ x x))) 4 14 Dynamic SQL • Reification: string building • Evaluation: EXEC SELECT A FROM R DECLARE @query = varchar @query = ’SELECT’ + ’ A’ + ’ FROM’ + ’ R’ EXEC @query 15 Dynamic SQL - Stored Queries Queries QueryName: varchar QueryCode: varchar QA ’SELECT A FROM R’ QB ’SELECT B FROM R’ DECLARE @query varchar SELECT QueryCode INTO @query FROM Queries WHERE QueryName = ’QA’ EXEC @query 16 Reflective query languages? • Design a relational query language with reflection • Start from a relational QL and add a limited number of primitives 17 The problem of reification How should queries (procedures etc) be represented as data in tables? Not unique. Can be messy. 18 Reflective relational algebra • A relational algebra programs • RA reflective relational algebra – evaluation applied on A programs • R2A full reflective relational algebra – evaluation applied on A and RA programs 19 Relational algebra programs An A program is a sequence of statements. Each statement has the form ‘X := E’, where X is a (relation) variable and E is a term. Each term is one of the following: • a relation name; • a constant of the form {(A : a)} where A is an attribute name and a ∈ dom (A); • a (relation) variable; • if X, X1 , X2 are variables, then the following are terms: – X1 ∪ X2 (union); – X1 − X2 (difference); – X1 ⊲⊳ X2 (equi-join); – π̂A (X) (projecting out attribute A); – σA=B (X) and σA<B (X) (selections); and – ρA/A′ (X) (renaming of attribute A to A′ ). 20 An A program R is a relation over scheme {P, C}. Find the children and grandchildren of Fred. 1) 2) 3) 4) 5) 6) 7) 8) 9) X1 X2 X3 X4 X5 X6 X7 X3 X8 := R; := {(P : Fred)}; := X1 ⊲⊳ X2; := ρC/C ′ (X3); := ρP/C ′ (X1); := X4 ⊲⊳ X5; := πC ′ (X6); := X3 ∪ X7; := πP (X3). 21 Reification of an A program into a program relation 1) 2) 3) 4) 5) 6) 7) 8) 9) sno 1 2 3 4 5 6 7 8 9 var X1 X2 X3 X4 X5 X6 X7 X3 X8 op att-1 X1 X2 X3 X4 X5 X6 X7 X3 X8 := := := := := := := := := R; {(P : Fred)}; X1 ⊲⊳ X2 ; ρC/C ′ (X3 ); ρP/C ′ (X1 ); X4 ⊲⊳ X5 ; πC ′ (X6 ); X3 ∪ X7 ; πP (X3 ). att-2 arg-1 arg-2 P ⊲⊳ ρ ρ ⊲⊳ π ∪ π C P C′ P C′ C′ X1 X3 X1 X4 X6 X3 X3 rel R const Fred X2 X5 X7 22 Reification into program relations • A programs can be reified into program relations. • To construct program relations of A programs that depend on the database, we need another operator that generates numbers. X := numbersno (R) R = A a b c numbersno (R) = sno A 1 a 2 b 3 c 23 Reification of an A program that depends on the database Compute the program relation for πA (R) given the catalog inforAtt-R att A mation B C D Y1 att B 1. Compute Y2 att C D and 2. Compute from Y1 the relation Y3 var X op π̂ att-1 B att-2 arg-1 R arg-2 rel R const arg-2 rel X X const 3. Compute from Y2 the relation Y4 var X X op π̂ π̂ att-1 C D att-2 arg-1 X X 4. Y5 := Y3 ∪ Y4 5. Y6 := numbersno(Y5 ) sno 1 2 3 var X X X op π̂ π̂ π̂ att-1 B C D att-2 arg-1 R X X arg-2 rel R X X const 24 Reflective relational algebra RA A term of RA is either 1. a term of A; 2. an expression of the form numberN (X), with X a variable; or 3. an expression of the form eval(X), with X a variable. Semantics of eval(X): if X holds a program relation representing an A-program P , then eval(X) executes P and evaluates to P ’s final result. 25 Evaluation sno 1 2 3 var X X X op π̂ π̂ π̂ att-1 B C D att-2 arg-1 R X X arg-2 rel R X X const Y7 := eval(Y6) Notice that the RA program is unaffected by changes to the attributes of R. meta-data independence 26 “What is known about John?” in A rel Persons Persons Children Children Hobbies Hobbies att name age parent child person hobby rel Persons Persons Children Children Children Hobbies Hobbies Hobbies att name age parent child child person hobby hobby val John 24 John Steve Iris John ping-pong math For every fixed database scheme, this query can be expressed in the relational algebra as [ (rel : R) ⊲⊳ (att : A) ⊲⊳ ρA/val (πA(σA′ =John (R))) R,A,A′ This formulation of the query is meta-data dependent. 27 “What is known about John?” in RA [ (rel : R) ⊲⊳ (att : A) ⊲⊳ ρA/val (πA(σA′ =John (R))) R,A,A′ 1. Generate all triples of the form (R, A, A′ ); 2. For each (R, A, A′) make program relation fragment that corresponds to (rel : R) ⊲⊳ (att : A) ⊲⊳ ρA/val (πA(σA′ =John (R))); 3. Auxiliary relation for the unions; 4. number operator to create proper ordering on program statements; 5. eval this program relation. 28 Other scenarios • Given catalog information and relations, create a triple store to represent the union of all the data in the database. • Assume a Log storing queries. – Determine the queries that depend on R? (syntactical) – Determine the queries in the log that reference the most relations. (syntactical) – Determine the queries in the log that return an empty answer on the current database. (semantical) – View expansion: replace, in each query in the log, each view name by its definition as given in the system catalog. (syntactical) – Given a list of new view definitions (under the old names), which queries in the log give a different answer on the current instance under the new view definitions?” (semantical) 29 Discussion • What is the provenance of a data object? Provenance = data + program? • How should it be modeled? • How should it be computed? • How do we manipulate and query it? Should we separate these concerns? Should we seek an integration in a single database programming environment? 30