The Future of Provenance James Cheney Principles of Provenance Theme

advertisement
The Future of
Provenance
James Cheney
Principles of Provenance Theme
Closing Lecture
May 15, 2009
Friday, 15 May 2009
Who am I? What do I do?
Friday, 15 May 2009
Provenance is...
Record of...
identity
creation
ownership
influences
Friday, 15 May 2009
Provenance is...
Evidence of...
authenticity
integrity
quality
“good process”
Friday, 15 May 2009
Provenance is...
Needed for
accountability
transparency
credit
blame
Friday, 15 May 2009
Why is data provenance important?
For traditional (paper) information:
Creation process leaves “paper trail”
Easier to detect modification, copying, forgery
Can usually judge a book by its cover
For electronic information:
Often no such thing as a “bit trail”
Easy to forge, plagiarize, alter data undetected
Can't judge a database by its cover - there isn't one
Provenance essential for judging quality of data
Friday, 15 May 2009
eScience Motivations
Scientific workflows
Scientific databases
“Electronic lab notebooks”?
Friday, 15 May 2009
Provenance helps
understand problems
Friday, 15 May 2009
Scientific Workflows
Workflows interfaces to Grid
computing
Provenance
needed to
understand, repeat
computation
Friday, 15 May 2009
Curated scientific data
Data “curated”
by scientists
Manual quality
control
Provenance
needed for audit,
error recovery
Friday, 15 May 2009
Other motivations
Financial
eGovernment
Healthcare
... all have needs for
transparency, accountability
to meet regulatory/legislative requirements
Friday, 15 May 2009
Provenance failures
can be expensive
Friday, 15 May 2009
Provenance in 2004 election
“Killian documents”/“Rathergate”
Friday, 15 May 2009
Principles of Provenance?
What is (and isn’t) provenance?
Why do we (think we) need it?
How will we know when we have
“enough”?
What is hard about this “obvious”
problem?
Friday, 15 May 2009
Goals
Define, model provenance
Encourage multi-disciplinary interaction
Within computer science
Between CS and natural/social sciences
Disseminate leading research on
provenance
Friday, 15 May 2009
Plan
4 “research symposia”
5-10 visitors, 3-5 days each
3 “workshops”
one added midstream
Opening, closing, ad hoc lectures
Umut Acar, Stuart Madnick, Michael Hicks
Friday, 15 May 2009
Workshop I (Nov 2007)
Principles of Provenance
Hosted by ICMS
Retroactively adopted as Theme kickoff
1.5 days, 15 speakers
Helped clarify plan for Theme
Friday, 15 May 2009
Research Symposium (May 2008)
Provenance in Databases
Biology (Dunbar), astronomy (Mann)
Uncertain, probabilistic databases
(Green,Tannen)
Where-provenance (Vansummeren)
Workflow/dataflow (Kwasnikowska)
Reflection (Van Gucht)
Friday, 15 May 2009
Revisiting the simple model of where-provena
nce
Ingredients:
Every (sub)object is identifiable
5
Provenance links only if copy
x
8
A
1
8
2
No link? → Object is constructed
Dynamic SQL - Stored Queries
A
1
8
Queries
QueryName: varchar QueryCode: varchar
QA
’SELECT A FROM R’
QB
’SELECT B FROM R’
Let’s record how objects are constructed!
DECLARE @query varchar
SELECT QueryCode INTO @query
FROM
Queries
WHERE QueryName = ’QA’
S. Vansummeren
B
2
1
C
3
4
9
A
1
8
Where-Provenance Revisited
EXEC @query
Run Inference Rules
! "#$%&'()*+,-.++%/ )01)! 23 4 5',6
! " #$% &'(>? )
#"
* +, - &$)
%
16
9#7
:# ;
<$7
<$; '$=
<)"J#7J$7)K)"J#;J$; ')"J#;J$=
. #&$/)))'&'$'1@)%)A)B+,)>@)01)>? ),'@5,1
)B+,)>")01)C>@DEF),'@5,1
)))
)C>"DEF)G
)))
C0H'HI)'&'$'1@)%)A)>? EFEF)GD
78
Friday, 15 May 2009
B
2
1
B
5
1
1
Research Symposium II (Oct 2008)
Provenance in Workflows
Organized by B. Ludaescher & J. Freire
Salt Lake City, UT
Talks by
Barga, Simmhan, Plale
Ludaescher, Freire, Missier
Van den Bussche, Kwasnikowska, Hidders
(I didn’t make it to this one)
Friday, 15 May 2009
!
)
3
/
1
(
s
t
e
N
n
o
s
k
Jac
Provenance in Science: Not a new issue
When
Lab notebooks have
been used for a long time
!!
R5: OR split:
t:
R1: Sequential place spli
–! Reproduce results
–! Evidence in patent
disputes
!
p1
t2
p3
t1
p2
t1
)!
!(t1) = ( !(t2) + !(t3)
) ; !(p3) ) )!
!(p1) = ( !(p2) ; ( !(t1
Annotation
t3
R4: AND split:
ion split:
R2: Sequential transit
t2
t1
Observed
data
p1
t3
DNA recombination
By Lederberg
Freire
p1
single place
Reducing WF net to a
generates a type
t1
p2
ablished by
Rules independently est tel et al, BPM 2003
ach
W
kiPiotr Chrzastows
)!
!(p1) = ( !(p2) # !(t1)
3
p3
) )!
!(p1) = ( !(p2) || !(p3
; !(t3) ) )!
!(t1) = (!(t2) ; ( !(p1)
R3: Loop addition:
Provenance Analytics – Provenance in Workflows Symposium, 2008
p2
p1
20
III: recoverable loss of precision
X1
Wrong Support Can Make Scien
X2
P0
Y1
Y2
[b1...bi...bm]
[a1...ai...an]
X:s
X: l(s)
P1
P2
Y:s
Y:l(s)
[c1...ci...cm]
[a12... ai2 ...an2]
X1:s X2:s
P3
Y
tists Unhappy
“in Taverna we worked on
tive
fancy knowledge-based descrip
t
tha
techniques for services so
workflows would be composed
this
automatically. It turned out that
all.
wasn’t what the users wanted at
ing
find
of
ys
They wanted quick wa
a relevant service and then they
wanted help for them to build
workflows themselves.”
“f is index-preserving”
P1 ! " X . X2
P2 ! " X . f X
P3 ! " X1 . " X2 . X1 + X2
Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]
Then, P1:Y = [a12...an2], P2:Y=c
P3:Y = [a12+c... am2+c]
And
lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }
De Roure and Goble, 2007
lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2[i] }
[a12+c1... ai2+ci ... am2+cm]
27!
10
27
Friday, 15 May 2009
Workshop II (Feb 2009)
Theory and Practice of Provenance
Support from USENIX
Collocated with File and Storage Technologies
(FAST2009), San Francisco, CA
9-member PC reviewed & selected program
2 invited speakers
5 full papers
8 short papers
Friday, 15 May 2009
File System Provenance
User Actions:
a
cp /a /b
/
mv /b /bakfldr/
cp
b
I verb nouns =>
I process files
mv
From Footprints to Paths
Provenance
!fldr/
2/23/2009
Action Preferred Left
Parent
Right
Parent
Source
6
!
Null
Null
Null
User
6
"
Left
Wounded
Cas
Null
#
Both
2
4
Null
2
4
Null
Null
Document3
Null
Null
Document5
A
Null
Null
Document3
Null
Null
Document5
Seq
num
KeyVal
AttrName AttrVal
20
7
Injured
USENIX TAPP 09--Story Book: An Efficient
7
19
Extensible Provenance Framework
Injured
3
18
7
Wounded
6
18
7
Cas
#
7 {document3}
17
2
Cas
7
16
15
14
2
4
4
Wounded
Cas
Wounded
6
value: 7
$
$
$
7
{document3}
value: 7
6
{102, cas}
Both
value: 7
$
Null E
Null
{104, cas} E
7
value:Null
$
$
Null
{user input}
value: 6
{document5}
value: 6
$
!
{102, wounded}
value: 6
E
E
{document5}
value: 6
Null
cas}
{104,
value: 7
$
{104, injured}
value: 6
Provenance Tracking Semanti
cs
A
{104, wounded}
value: 6
{104, wounded}
value: 6
(Provenance Tracking) Opera
tional Semantics
p [a : κa !v : κv ] ! Q
−→
a $v : snd(p , κa ); κv % ! Q
!!!!!!!!!!!!"#!!!!!!!!!!!$
provenance aggregation
Friday, 15 May 2009
Research Symposium III (Mar 2009)
Provenance in Software Systems
Explore connections with programming
languages & software engineering
Sensor networks & scientific data (Skalka)
Bidirectional computation (Foster)
Traceability (Stevens)
Audit (Vaughan)
Security (Chong)
Friday, 15 May 2009
E
transformation Sim (m1 : MM ; m2 : MM)
{
top relation ContainersMatch
{
inter1,inter2 : MM::Inter;
checkonly domain m1 c1:Container {inter = inter1};
checkonly domain m2 c2:Container {inter = inter2};
where {IntersMatch (inter1,inter2);}
}
Thinking about decryption failures.
!"""!"!
!"""!"!
!!!!!!!
!!!!!!!
!
xa:Container
relation IntersMatch
xi:Inter
{
thing1,thing2 : MM::Thing;
checkonly domain m1 i1:Inter {thing = thing1};
checkonly domain m2 i2:Inter {thing = thing2};
where {ThingsMatch (thing1,thing2);}
}
xc:Thing
"
relation ThingsMatch
{
s : String;
checkonly domain m1 thing1:Thing {value = s};
checkonly domain m2 thing2:Thing {value = s};
}
"
value="c"
a1:Container
i1:Inter
xd:Thing
tc1:Thing
value="d"
value="c"
Model M1
Model M2
}
!!!!!!!
"
19/27
Data security
T satisfies data security for execution
〈l1=v1, …, ln=vn ; c〉 v if:
〈l1=v1, …, ln=vn ; c〉 v ! T and
for any execution〈l1=w1, …, ln=wn ; c〉 w
vj = wj for all low security lj
if
then 〈l1=w1, …, ln=wn ; c〉 w ! T
S
V
If inputs look the same
Updated
Updated
S
V
Semantics for Provenance Security, Stephen Chong, Harvard University.
12
Friday, 15 May 2009
then T describes execution
11
Workshop III (Apr 2009)
Use Cases for Provenance
Added mid-stream
Goals:
Elicit ambitious goals, “use cases”
from scientific or other users
or from people working directly with them
Was not easy to find speakers!
Friday, 15 May 2009
Dam Safety Control Scenario (3/8)
Concepts – legacy data
Hypotheses confirmed
!#
!"
##
#"
Zim1,E12.5
E43rik,E12.5
Highly heterogeneous data: e.g., archive must comprise legacy information as
project drawings and handwritten observations!
National Laboratory of Civil Engineering & INESC-ID: information Systems Group
8
ApoBEC2,E11.5
Digital Preservation of heterogeneous Data
22-04-2009
HoxA2,E12.5
Why are those questions so important?
WELL !
Becomes
Coded
into
Used to constrain
Map or Model
The raw data
from 1867
obsolete
terminology
drilled for water
NOT geology!!!
© NERC All rights reserved
Friday, 15 May 2009
Click for Next Slide
Research Symposium IV (May 2009)
Provenance in Secure & Advanced Systems
Connect to security, privacy, audit
in file, DB, OS research
Databases (Miklau, Re)
Security (Chapman, LeFevre, Martin,
Hicks)
Systems (Seltzer, Gehani)
Friday, 15 May 2009
ructure
Secure Logging Infrast
Why Integrate Provenance?
Jun Ho Huh and Andrew
(2008)
Martin
More Forensic
Analysis
Check digital chain of custody from f0 to f1
Chain(f
5 0 , f1 ) :=
Chain(f, f1 ) ∧ Output(p, f ) ∧
Input(p, f0 ) ∧ e :! Output(p, f ) ∧
e :! Input(p, f0 )
PSACS: May 2009
Find files derived from f0
Derivatives(f0 ) :=
Input(p, f0 ) ∧ Output(p, f1 ) ∧ • Implemented Fable as part of the Links web
programming language
Derivatives(f1 )
Log table, History table and Ex
– We call it “security-enhanced” Links
ample Queries
client
Log table
IP
time
type
eid
name
dept
sal
101
Bob
Sales
10
Jack
1.1.1
Jack
2.1.1
100
upd
101
Kate
-
3.1.1
-
200
12
upd
101
-
Mgmt
-
0
ins
Kate
4.1.1
300
upd
101
Jack
-
1.1.1
-
15
0
ins
201
Chris
Jack
2.1.1
HR
300
8
upd
201
Kate
-
4.1.1
Mgmt
500
10
del
201
-
-
-
System Support for Forensic Inference – p. 16/17
History table
eid
name
dept
sal
from
to
101
Bob
Sales
10
0
100
101
Bob
Sales
12
100
200
101
Bob
Mgmt
12
200
300
101
Bob
Mgmt
15
300
now
201
Chris
HR
8
0
300
201
Chris
Mgmt
10
300
500
! Queries:
! Q1.
! Q2.
Friday, 15 May 2009
19
Lessons Learned
What is provenance?
need formal definitions
Why do we need it?
Need clear use cases/goals
When do we have “enough”?
complete “causal”, “dependence” chain?
Key challenges
Friday, 15 May 2009
What is provenance?
Information that...
links “real world” entities to electronic data
places data “in context”
“explains” relationship between input & output of a
(computational) process
gives evidence for (or against) accepting data at face
value
Theme activities led to formal definitions
and more exploration of design space
Friday, 15 May 2009
Why do we need provenance?
Use cases workshop:
Bioinformatics analysis/discovery
Engineering failure analysis
Social science & policymaking
Security
Data reuse & scientific recordkeeping
Friday, 15 May 2009
What is “enough” provenance?
Still not so clear
Need clear definition of “enough” for
each application
Some guidelines:
“Causal models”, “actual cause”?
Capturing all “true dependences”?
Friday, 15 May 2009
Key challenges
Combine insights from different parts
workflow vs data provenance
security
Build systems exhibiting new ideas
both practical and theoretical challenges
“Provenance everywhere”?
both benefits and risks
Friday, 15 May 2009
Key challenges
Identified causality as a key concept
see also information flow, dependence
But still a lot to do...
Friday, 15 May 2009
What would Hume do?
Is this all in our
heads?
as Hume believed
about causality?
Jury’s still out
Friday, 15 May 2009
A cross-cutting concern
Provenance a recurring theme in many
aspects of CS
Should be studied on its own, much like
concurrency
security
incremental computation
Friday, 15 May 2009
Future steps
Plan to apply for a Dagstuhl Seminar on
provenance
hopefully, invite everyone involved in theme
and people who weren’t able to make it
Follow-up publications planned
Research (finally!)
Friday, 15 May 2009
Evaluating the Theme
Did we succeed?
Well... we were careful not to promise concrete results
Unrealistic in 1 year anyway
Encouraging signs, even so:
~5 surveys, invited papers, other publications by TLs
Positive comments from participants
USENIX impressed with TaPP workshop
will support for 2-3 years
Friday, 15 May 2009
Irresponsible Prognostication
Next 10 years
Over next 10 years...
rich provenance tracking in most computer
systems
Web standards, interoperability
laws and regulatory guidelines
security and privacy challenges to overcome
(Vision courtesy of Margo Seltzer)
Friday, 15 May 2009
Conclusions
This concludes the Principles of Provenance
Theme
It has been a lot of work...
Also a lot of fun
http://wiki.esi.ac.uk/Principles_of_Provenance
Friday, 15 May 2009
Acknowledgments
Thanks to
> 40 Theme invited speakers
Peter Buneman
Bertram Ludaescher
Anna Kenway, Lee Callaghan, and
everyone else at eSI
Friday, 15 May 2009
Friday, 15 May 2009
Friday, 15 May 2009
Friday, 15 May 2009
Download