Provenance as Dependency Analysis

Provenance as Dependency Analysis Provenance in Databases Workshop May 21, 2007 James Cheney University of Edinburgh Amal Ahmed Umut Acar Toyota Technological Institute, Chicago What is provenance? ● ● “Where did that result come from?” – Where-provenance, annotation propagation – Idea: Trace sequence of “copies” “Why is that result in the output?” – Why-provenance, lineage – Idea: Trace data on which output “depends” Previous work ● ● Lots! – “Polygen” model [Wang, Madnick '90] – Lineage [Cui, Widom, Wiener '00] – Why- and where- [Buneman Khanna Tan '01] – Copy-paste [Buneman, Chapman, C, Vansummeren '06, '07] – Semirings [Green, Karvounarakis, Tannen '07] (Why) Aren't we done? Why we're not happy There is a big gap between the strong informal motivations/descriptions “relevant”, “depends on”, “comes from”, "causes" and the formal definitions of previous work on provenance. I find most of them somewhat ad hoc. including the ones I've worked on. No offense! What is provenance for? ● Understand/debug query – ● Show what data a result “relies on” – ● What we must trust to believe result Show where errors in result “come from” – ● show what query is “really” doing What to blame if there is an error Propagate annotations on input to “relevant” output What else is good for similar problems? ● Program dependence graphs – ● Information flow analysis – ● for tracking data secrecy and integrity Program slicing – ● for understanding/debugging programs for identifying parts of program on which wrong output depends Key idea: dependency analysis Goal ● ● Explore connection to PL ideas – dependency – information flow – program slicing Improve understanding of provenance – How to generalize to richer query languages? – What formal guarantees can/should it provide? Simple dependency analysis w=5 x=1 y=0 z=1 x := y; y := 12; z := w + y; w=5 x=0 y = 12 z = 17 Simple dependency analysis w=5 If we change part of input... x=1 y=0 z=1 x := y; y := 12; z := w + y; w=5 x=0 y = 12 z = 17 Simple dependency analysis w=5 If we change part of input... x=1 y=0 x := y; z=1 What part of output might change? y := 12; z := w + y; w=5 x=0 y = 12 z = 17 Simple dependency analysis w=5 x=1 y=0 z=1 Trace back to find last “use” (assignment) x := y; y := 12; z := w + y; w=5 x=0 y = 12 z = 17 Simple dependency analysis w=5 x=1 y=0 x := y; z=1 Use-def Make use-def chainschains y := 12; z := w + y; w=5 x=0 y = 12 z = 17 Simple dependency analysis w=5 x=1 y=0 x := y; z=1 Now simplify y := 12; z := w + y; w=5 x=0 y = 12 z = 17 Simple dependency analysis w=5 x=1 y=0 x := y; z=1 Now simplify y := 12; z := w + y; w=5 x=0 y = 12 z = 17 Simple dependency analysis w=5 x=1 y=0 z=1 f g c h w=5 x=0 y = 12 z = 17 This dependence information correctly approximates program behavior if: there exists f, g, h, c such that w' = f(w) x' = g(y) y' = c z' = h(w,y) Dependency analysis: if-then w=5 x=1 y=0 x := y; if (z > 0) x := w; else y := z; z=1 Dependency analysis: if-then w=5 x=1 y=0 z=1 x := y; if (z > 0) x := w; else y := z; w=5 x=5 y=0 z=1 Dependency analysis: if-then w=5 x=1 y=0 z = -42 x := y; if (z > 0) x := w; else y := z; w=5 x=0 y = -42 z = -42 Dependency analysis: if-then Control dependences needed to reflect w=5 x=1 alternative branch y=0 z=1 x := y; if (z > 0) x := w; else y := z; w=5 x=5 y=1 z=1 Dependency analysis: if-then ● w=5 x=1 y=0 z=1 w=5 x=5 y=1 z=1 To get all dependences, – need to consider what did happen – but also what might have happened Extending to DBs: main idea ● ● Given input DB, query Q Link each “part” p of input to all “parts” of output that depend on p – Intuition: If we change DB at p, then linked parts may change – The rest of the DB not depending on p must not change if we change the DB at p. 1000 words Q DB p' Q(DB) 1000 words q3 Q DB p' Q(DB) q1 q2 1000 words q3 Q DB p' Provenance links are “dependency correct” if... Q(DB) q1 q2 1000 words q3 Q DB q1 p' Whenever DB and DB' are “the same except at p” Q(DB) DB' p' q2 1000 words Q DB Q(DB) q1 p' Whenever DB and DB' are “the same except at p” Then Q(DB) and Q(DB') are “the same except q3' at q ,...,q ” 1 n q2 q3' DB' p' Q Q(DB') q1' q2' Easier said than done. ● Several challenges – How do we address “parts” of DB? – What is a “change”? – What do we mean by “same place” in two query results on changed input? ● Gloss over here, details in paper. ● Show basic idea using pictures – Much easier and more fun – but hard to formalize... Selection R 1 2 3 1 3 4 2 3 4 A=1(R) 1 2 3 1 3 4 Selection R 1 42 3 1 3 4 2 3 4 Change here Visible here A=1(R) 1 42 3 1 3 4 Selection R 1 42 3 1 3 4 2 3 4 A=1(R) 1 42 3 1 3 4 Selection R 42 2 3 1 3 4 2 3 4 Visible here Change here A=1(R) 1 3 4 Selection R 42 2 3 1 3 4 2 3 4 A=1(R) 1 3 4 Selection R 1 2 3 1 3 4 1 3 4 Change here Visible here A=1(R) 1 2 3 1 3 4 1 3 4 Selection R 1 2 3 “Data” dependences 1 3 4 2 3 4 “Control” dependences A=1(R) 1 2 3 1 3 4 Projection R 1 2 3 1 3 4 2 3 Only data dependences 4 23(R) 2 3 3 4 3 4 Join R 1 S 2 1 3 2 3 2 3 Mix of 4 3 4 dependences 2 R JOIN S 1 2 3 1 2 4 1 3 4 2 3 4 Union R 1 S 2 1 3 2 3 2 3 2 Only data 4 4 dependences 4 RUS 1 2 1 3 2 3 2 3 2 4 4 4 Difference R 1 S 2 1 3 2 3 2 R-S 1 2 1 3 3 2 4 control 3 4 Only dependences Grouping/aggregation R 1 Data dependence on aggregated field 2 1 3 2 3 1 5 2 Control dependence on grouped field SUM 3 What is actually in the paper ● ● ● Precise definition & formalization of dependency correctness for NRC queries Proof that “minimal” dependency provenance noncomputable Define provenance semantics that is dependency-correct – ● but might be "inaccurate" Define static approximation – via type-based analysis Future work ● ● Negation, = handled inelegantly – As in most/all other techniques – “Equal except for” is too strong – Can we do better? Implementation – Prototype implementation now – How to scale dynamic tracking to large DBs? Conclusions ● ● Dependence analysis – provides a solid foundation for provenance – can track data, row, table dependences – can deal with grouping, aggregation Minimizing nontrivial (undecidable) – ● Introduced dynamic and static approximation techniques Lots of possible variations to explore!

Provenance as Dependency Analysis

Related documents

Products

Support

Provenance as Dependency Analysis

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib