Provenance as Dependency Analysis Provenance in Databases Workshop May 21, 2007 James Cheney University of Edinburgh Amal Ahmed Umut Acar Toyota Technological Institute, Chicago What is provenance? ● ● “Where did that result come from?” – Where-provenance, annotation propagation – Idea: Trace sequence of “copies” “Why is that result in the output?” – Why-provenance, lineage – Idea: Trace data on which output “depends” Previous work ● ● Lots! – “Polygen” model [Wang, Madnick '90] – Lineage [Cui, Widom, Wiener '00] – Why- and where- [Buneman Khanna Tan '01] – Copy-paste [Buneman, Chapman, C, Vansummeren '06, '07] – Semirings [Green, Karvounarakis, Tannen '07] (Why) Aren't we done? Why we're not happy There is a big gap between the strong informal motivations/descriptions “relevant”, “depends on”, “comes from”, "causes" and the formal definitions of previous work on provenance. I find most of them somewhat ad hoc. including the ones I've worked on. No offense! What is provenance for? ● Understand/debug query – ● Show what data a result “relies on” – ● What we must trust to believe result Show where errors in result “come from” – ● show what query is “really” doing What to blame if there is an error Propagate annotations on input to “relevant” output What else is good for similar problems? ● Program dependence graphs – ● Information flow analysis – ● for tracking data secrecy and integrity Program slicing – ● for understanding/debugging programs for identifying parts of program on which wrong output depends Key idea: dependency analysis Goal ● ● Explore connection to PL ideas – dependency – information flow – program slicing Improve understanding of provenance – How to generalize to richer query languages? – What formal guarantees can/should it provide? Simple dependency analysis w=5 x=1 y=0 z=1 x := y; y := 12; z := w + y; w=5 x=0 y = 12 z = 17 Simple dependency analysis w=5 If we change part of input... x=1 y=0 z=1 x := y; y := 12; z := w + y; w=5 x=0 y = 12 z = 17 Simple dependency analysis w=5 If we change part of input... x=1 y=0 x := y; z=1 What part of output might change? y := 12; z := w + y; w=5 x=0 y = 12 z = 17 Simple dependency analysis w=5 x=1 y=0 z=1 Trace back to find last “use” (assignment) x := y; y := 12; z := w + y; w=5 x=0 y = 12 z = 17 Simple dependency analysis w=5 x=1 y=0 x := y; z=1 Use-def Make use-def chainschains y := 12; z := w + y; w=5 x=0 y = 12 z = 17 Simple dependency analysis w=5 x=1 y=0 x := y; z=1 Now simplify y := 12; z := w + y; w=5 x=0 y = 12 z = 17 Simple dependency analysis w=5 x=1 y=0 x := y; z=1 Now simplify y := 12; z := w + y; w=5 x=0 y = 12 z = 17 Simple dependency analysis w=5 x=1 y=0 z=1 f g c h w=5 x=0 y = 12 z = 17 This dependence information correctly approximates program behavior if: there exists f, g, h, c such that w' = f(w) x' = g(y) y' = c z' = h(w,y) Dependency analysis: if-then w=5 x=1 y=0 x := y; if (z > 0) x := w; else y := z; z=1 Dependency analysis: if-then w=5 x=1 y=0 z=1 x := y; if (z > 0) x := w; else y := z; w=5 x=5 y=0 z=1 Dependency analysis: if-then w=5 x=1 y=0 z = -42 x := y; if (z > 0) x := w; else y := z; w=5 x=0 y = -42 z = -42 Dependency analysis: if-then Control dependences needed to reflect w=5 x=1 alternative branch y=0 z=1 x := y; if (z > 0) x := w; else y := z; w=5 x=5 y=1 z=1 Dependency analysis: if-then ● w=5 x=1 y=0 z=1 w=5 x=5 y=1 z=1 To get all dependences, – need to consider what did happen – but also what might have happened Extending to DBs: main idea ● ● Given input DB, query Q Link each “part” p of input to all “parts” of output that depend on p – Intuition: If we change DB at p, then linked parts may change – The rest of the DB not depending on p must not change if we change the DB at p. 1000 words Q DB p' Q(DB) 1000 words q3 Q DB p' Q(DB) q1 q2 1000 words q3 Q DB p' Provenance links are “dependency correct” if... Q(DB) q1 q2 1000 words q3 Q DB q1 p' Whenever DB and DB' are “the same except at p” Q(DB) DB' p' q2 1000 words Q DB Q(DB) q1 p' Whenever DB and DB' are “the same except at p” Then Q(DB) and Q(DB') are “the same except q3' at q ,...,q ” 1 n q2 q3' DB' p' Q Q(DB') q1' q2' Easier said than done. ● Several challenges – How do we address “parts” of DB? – What is a “change”? – What do we mean by “same place” in two query results on changed input? ● Gloss over here, details in paper. ● Show basic idea using pictures – Much easier and more fun – but hard to formalize... Selection R 1 2 3 1 3 4 2 3 4 A=1(R) 1 2 3 1 3 4 Selection R 1 42 3 1 3 4 2 3 4 Change here Visible here A=1(R) 1 42 3 1 3 4 Selection R 1 42 3 1 3 4 2 3 4 A=1(R) 1 42 3 1 3 4 Selection R 42 2 3 1 3 4 2 3 4 Visible here Change here A=1(R) 1 3 4 Selection R 42 2 3 1 3 4 2 3 4 A=1(R) 1 3 4 Selection R 1 2 3 1 3 4 1 3 4 Change here Visible here A=1(R) 1 2 3 1 3 4 1 3 4 Selection R 1 2 3 “Data” dependences 1 3 4 2 3 4 “Control” dependences A=1(R) 1 2 3 1 3 4 Projection R 1 2 3 1 3 4 2 3 Only data dependences 4 23(R) 2 3 3 4 3 4 Join R 1 S 2 1 3 2 3 2 3 Mix of 4 3 4 dependences 2 R JOIN S 1 2 3 1 2 4 1 3 4 2 3 4 Union R 1 S 2 1 3 2 3 2 3 2 Only data 4 4 dependences 4 RUS 1 2 1 3 2 3 2 3 2 4 4 4 Difference R 1 S 2 1 3 2 3 2 R-S 1 2 1 3 3 2 4 control 3 4 Only dependences Grouping/aggregation R 1 Data dependence on aggregated field 2 1 3 2 3 1 5 2 Control dependence on grouped field SUM 3 What is actually in the paper ● ● ● Precise definition & formalization of dependency correctness for NRC queries Proof that “minimal” dependency provenance noncomputable Define provenance semantics that is dependency-correct – ● but might be "inaccurate" Define static approximation – via type-based analysis Future work ● ● Negation, = handled inelegantly – As in most/all other techniques – “Equal except for” is too strong – Can we do better? Implementation – Prototype implementation now – How to scale dynamic tracking to large DBs? Conclusions ● ● Dependence analysis – provides a solid foundation for provenance – can track data, row, table dependences – can deal with grouping, aggregation Minimizing nontrivial (undecidable) – ● Introduced dynamic and static approximation techniques Lots of possible variations to explore!