The Future of Provenance James Cheney Principles of Provenance Theme Closing Lecture May 15, 2009 Friday, 15 May 2009 Who am I? What do I do? Friday, 15 May 2009 Provenance is... Record of... identity creation ownership influences Friday, 15 May 2009 Provenance is... Evidence of... authenticity integrity quality “good process” Friday, 15 May 2009 Provenance is... Needed for accountability transparency credit blame Friday, 15 May 2009 Why is data provenance important? For traditional (paper) information: Creation process leaves “paper trail” Easier to detect modification, copying, forgery Can usually judge a book by its cover For electronic information: Often no such thing as a “bit trail” Easy to forge, plagiarize, alter data undetected Can't judge a database by its cover - there isn't one Provenance essential for judging quality of data Friday, 15 May 2009 eScience Motivations Scientific workflows Scientific databases “Electronic lab notebooks”? Friday, 15 May 2009 Provenance helps understand problems Friday, 15 May 2009 Scientific Workflows Workflows interfaces to Grid computing Provenance needed to understand, repeat computation Friday, 15 May 2009 Curated scientific data Data “curated” by scientists Manual quality control Provenance needed for audit, error recovery Friday, 15 May 2009 Other motivations Financial eGovernment Healthcare ... all have needs for transparency, accountability to meet regulatory/legislative requirements Friday, 15 May 2009 Provenance failures can be expensive Friday, 15 May 2009 Provenance in 2004 election “Killian documents”/“Rathergate” Friday, 15 May 2009 Principles of Provenance? What is (and isn’t) provenance? Why do we (think we) need it? How will we know when we have “enough”? What is hard about this “obvious” problem? Friday, 15 May 2009 Goals Define, model provenance Encourage multi-disciplinary interaction Within computer science Between CS and natural/social sciences Disseminate leading research on provenance Friday, 15 May 2009 Plan 4 “research symposia” 5-10 visitors, 3-5 days each 3 “workshops” one added midstream Opening, closing, ad hoc lectures Umut Acar, Stuart Madnick, Michael Hicks Friday, 15 May 2009 Workshop I (Nov 2007) Principles of Provenance Hosted by ICMS Retroactively adopted as Theme kickoff 1.5 days, 15 speakers Helped clarify plan for Theme Friday, 15 May 2009 Research Symposium (May 2008) Provenance in Databases Biology (Dunbar), astronomy (Mann) Uncertain, probabilistic databases (Green,Tannen) Where-provenance (Vansummeren) Workflow/dataflow (Kwasnikowska) Reflection (Van Gucht) Friday, 15 May 2009 Revisiting the simple model of where-provena nce Ingredients: Every (sub)object is identifiable 5 Provenance links only if copy x 8 A 1 8 2 No link? → Object is constructed Dynamic SQL - Stored Queries A 1 8 Queries QueryName: varchar QueryCode: varchar QA ’SELECT A FROM R’ QB ’SELECT B FROM R’ Let’s record how objects are constructed! DECLARE @query varchar SELECT QueryCode INTO @query FROM Queries WHERE QueryName = ’QA’ S. Vansummeren B 2 1 C 3 4 9 A 1 8 Where-Provenance Revisited EXEC @query Run Inference Rules ! "#$%&'()*+,-.++%/ )01)! 23 4 5',6 ! " #$% &'(>? ) #" * +, - &$) % 16 9#7 :# ; <$7 <$; '$= <)"J#7J$7)K)"J#;J$; ')"J#;J$= . #&$/)))'&'$'1@)%)A)B+,)>@)01)>? ),'@5,1 )B+,)>")01)C>@DEF),'@5,1 ))) )C>"DEF)G ))) C0H'HI)'&'$'1@)%)A)>? EFEF)GD 78 Friday, 15 May 2009 B 2 1 B 5 1 1 Research Symposium II (Oct 2008) Provenance in Workflows Organized by B. Ludaescher & J. Freire Salt Lake City, UT Talks by Barga, Simmhan, Plale Ludaescher, Freire, Missier Van den Bussche, Kwasnikowska, Hidders (I didn’t make it to this one) Friday, 15 May 2009 ! ) 3 / 1 ( s t e N n o s k Jac Provenance in Science: Not a new issue When Lab notebooks have been used for a long time !! R5: OR split: t: R1: Sequential place spli –! Reproduce results –! Evidence in patent disputes ! p1 t2 p3 t1 p2 t1 )! !(t1) = ( !(t2) + !(t3) ) ; !(p3) ) )! !(p1) = ( !(p2) ; ( !(t1 Annotation t3 R4: AND split: ion split: R2: Sequential transit t2 t1 Observed data p1 t3 DNA recombination By Lederberg Freire p1 single place Reducing WF net to a generates a type t1 p2 ablished by Rules independently est tel et al, BPM 2003 ach W kiPiotr Chrzastows )! !(p1) = ( !(p2) # !(t1) 3 p3 ) )! !(p1) = ( !(p2) || !(p3 ; !(t3) ) )! !(t1) = (!(t2) ; ( !(p1) R3: Loop addition: Provenance Analytics – Provenance in Workflows Symposium, 2008 p2 p1 20 III: recoverable loss of precision X1 Wrong Support Can Make Scien X2 P0 Y1 Y2 [b1...bi...bm] [a1...ai...an] X:s X: l(s) P1 P2 Y:s Y:l(s) [c1...ci...cm] [a12... ai2 ...an2] X1:s X2:s P3 Y tists Unhappy “in Taverna we worked on tive fancy knowledge-based descrip t tha techniques for services so workflows would be composed this automatically. It turned out that all. wasn’t what the users wanted at ing find of ys They wanted quick wa a relevant service and then they wanted help for them to build workflows themselves.” “f is index-preserving” P1 ! " X . X2 P2 ! " X . f X P3 ! " X1 . " X2 . X1 + X2 Let P0:Y1 = [a1...an], P0:Y2=[b1...bm] Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c] And lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 } De Roure and Goble, 2007 lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2[i] } [a12+c1... ai2+ci ... am2+cm] 27! 10 27 Friday, 15 May 2009 Workshop II (Feb 2009) Theory and Practice of Provenance Support from USENIX Collocated with File and Storage Technologies (FAST2009), San Francisco, CA 9-member PC reviewed & selected program 2 invited speakers 5 full papers 8 short papers Friday, 15 May 2009 File System Provenance User Actions: a cp /a /b / mv /b /bakfldr/ cp b I verb nouns => I process files mv From Footprints to Paths Provenance !fldr/ 2/23/2009 Action Preferred Left Parent Right Parent Source 6 ! Null Null Null User 6 " Left Wounded Cas Null # Both 2 4 Null 2 4 Null Null Document3 Null Null Document5 A Null Null Document3 Null Null Document5 Seq num KeyVal AttrName AttrVal 20 7 Injured USENIX TAPP 09--Story Book: An Efficient 7 19 Extensible Provenance Framework Injured 3 18 7 Wounded 6 18 7 Cas # 7 {document3} 17 2 Cas 7 16 15 14 2 4 4 Wounded Cas Wounded 6 value: 7 $ $ $ 7 {document3} value: 7 6 {102, cas} Both value: 7 $ Null E Null {104, cas} E 7 value:Null $ $ Null {user input} value: 6 {document5} value: 6 $ ! {102, wounded} value: 6 E E {document5} value: 6 Null cas} {104, value: 7 $ {104, injured} value: 6 Provenance Tracking Semanti cs A {104, wounded} value: 6 {104, wounded} value: 6 (Provenance Tracking) Opera tional Semantics p [a : κa !v : κv ] ! Q −→ a $v : snd(p , κa ); κv % ! Q !!!!!!!!!!!!"#!!!!!!!!!!!$ provenance aggregation Friday, 15 May 2009 Research Symposium III (Mar 2009) Provenance in Software Systems Explore connections with programming languages & software engineering Sensor networks & scientific data (Skalka) Bidirectional computation (Foster) Traceability (Stevens) Audit (Vaughan) Security (Chong) Friday, 15 May 2009 E transformation Sim (m1 : MM ; m2 : MM) { top relation ContainersMatch { inter1,inter2 : MM::Inter; checkonly domain m1 c1:Container {inter = inter1}; checkonly domain m2 c2:Container {inter = inter2}; where {IntersMatch (inter1,inter2);} } Thinking about decryption failures. !"""!"! !"""!"! !!!!!!! !!!!!!! ! xa:Container relation IntersMatch xi:Inter { thing1,thing2 : MM::Thing; checkonly domain m1 i1:Inter {thing = thing1}; checkonly domain m2 i2:Inter {thing = thing2}; where {ThingsMatch (thing1,thing2);} } xc:Thing " relation ThingsMatch { s : String; checkonly domain m1 thing1:Thing {value = s}; checkonly domain m2 thing2:Thing {value = s}; } " value="c" a1:Container i1:Inter xd:Thing tc1:Thing value="d" value="c" Model M1 Model M2 } !!!!!!! " 19/27 Data security T satisfies data security for execution 〈l1=v1, …, ln=vn ; c〉 v if: 〈l1=v1, …, ln=vn ; c〉 v ! T and for any execution〈l1=w1, …, ln=wn ; c〉 w vj = wj for all low security lj if then 〈l1=w1, …, ln=wn ; c〉 w ! T S V If inputs look the same Updated Updated S V Semantics for Provenance Security, Stephen Chong, Harvard University. 12 Friday, 15 May 2009 then T describes execution 11 Workshop III (Apr 2009) Use Cases for Provenance Added mid-stream Goals: Elicit ambitious goals, “use cases” from scientific or other users or from people working directly with them Was not easy to find speakers! Friday, 15 May 2009 Dam Safety Control Scenario (3/8) Concepts – legacy data Hypotheses confirmed !# !" ## #" Zim1,E12.5 E43rik,E12.5 Highly heterogeneous data: e.g., archive must comprise legacy information as project drawings and handwritten observations! National Laboratory of Civil Engineering & INESC-ID: information Systems Group 8 ApoBEC2,E11.5 Digital Preservation of heterogeneous Data 22-04-2009 HoxA2,E12.5 Why are those questions so important? WELL ! Becomes Coded into Used to constrain Map or Model The raw data from 1867 obsolete terminology drilled for water NOT geology!!! © NERC All rights reserved Friday, 15 May 2009 Click for Next Slide Research Symposium IV (May 2009) Provenance in Secure & Advanced Systems Connect to security, privacy, audit in file, DB, OS research Databases (Miklau, Re) Security (Chapman, LeFevre, Martin, Hicks) Systems (Seltzer, Gehani) Friday, 15 May 2009 ructure Secure Logging Infrast Why Integrate Provenance? Jun Ho Huh and Andrew (2008) Martin More Forensic Analysis Check digital chain of custody from f0 to f1 Chain(f 5 0 , f1 ) := Chain(f, f1 ) ∧ Output(p, f ) ∧ Input(p, f0 ) ∧ e :! Output(p, f ) ∧ e :! Input(p, f0 ) PSACS: May 2009 Find files derived from f0 Derivatives(f0 ) := Input(p, f0 ) ∧ Output(p, f1 ) ∧ • Implemented Fable as part of the Links web programming language Derivatives(f1 ) Log table, History table and Ex – We call it “security-enhanced” Links ample Queries client Log table IP time type eid name dept sal 101 Bob Sales 10 Jack 1.1.1 Jack 2.1.1 100 upd 101 Kate - 3.1.1 - 200 12 upd 101 - Mgmt - 0 ins Kate 4.1.1 300 upd 101 Jack - 1.1.1 - 15 0 ins 201 Chris Jack 2.1.1 HR 300 8 upd 201 Kate - 4.1.1 Mgmt 500 10 del 201 - - - System Support for Forensic Inference – p. 16/17 History table eid name dept sal from to 101 Bob Sales 10 0 100 101 Bob Sales 12 100 200 101 Bob Mgmt 12 200 300 101 Bob Mgmt 15 300 now 201 Chris HR 8 0 300 201 Chris Mgmt 10 300 500 ! Queries: ! Q1. ! Q2. Friday, 15 May 2009 19 Lessons Learned What is provenance? need formal definitions Why do we need it? Need clear use cases/goals When do we have “enough”? complete “causal”, “dependence” chain? Key challenges Friday, 15 May 2009 What is provenance? Information that... links “real world” entities to electronic data places data “in context” “explains” relationship between input & output of a (computational) process gives evidence for (or against) accepting data at face value Theme activities led to formal definitions and more exploration of design space Friday, 15 May 2009 Why do we need provenance? Use cases workshop: Bioinformatics analysis/discovery Engineering failure analysis Social science & policymaking Security Data reuse & scientific recordkeeping Friday, 15 May 2009 What is “enough” provenance? Still not so clear Need clear definition of “enough” for each application Some guidelines: “Causal models”, “actual cause”? Capturing all “true dependences”? Friday, 15 May 2009 Key challenges Combine insights from different parts workflow vs data provenance security Build systems exhibiting new ideas both practical and theoretical challenges “Provenance everywhere”? both benefits and risks Friday, 15 May 2009 Key challenges Identified causality as a key concept see also information flow, dependence But still a lot to do... Friday, 15 May 2009 What would Hume do? Is this all in our heads? as Hume believed about causality? Jury’s still out Friday, 15 May 2009 A cross-cutting concern Provenance a recurring theme in many aspects of CS Should be studied on its own, much like concurrency security incremental computation Friday, 15 May 2009 Future steps Plan to apply for a Dagstuhl Seminar on provenance hopefully, invite everyone involved in theme and people who weren’t able to make it Follow-up publications planned Research (finally!) Friday, 15 May 2009 Evaluating the Theme Did we succeed? Well... we were careful not to promise concrete results Unrealistic in 1 year anyway Encouraging signs, even so: ~5 surveys, invited papers, other publications by TLs Positive comments from participants USENIX impressed with TaPP workshop will support for 2-3 years Friday, 15 May 2009 Irresponsible Prognostication Next 10 years Over next 10 years... rich provenance tracking in most computer systems Web standards, interoperability laws and regulatory guidelines security and privacy challenges to overcome (Vision courtesy of Margo Seltzer) Friday, 15 May 2009 Conclusions This concludes the Principles of Provenance Theme It has been a lot of work... Also a lot of fun http://wiki.esi.ac.uk/Principles_of_Provenance Friday, 15 May 2009 Acknowledgments Thanks to > 40 Theme invited speakers Peter Buneman Bertram Ludaescher Anna Kenway, Lee Callaghan, and everyone else at eSI Friday, 15 May 2009 Friday, 15 May 2009 Friday, 15 May 2009 Friday, 15 May 2009