Layering in Provenance Systems Provenance in Secure and Advanced Computer Systems

advertisement
Layering in Provenance
Systems
Margo Seltzer
May 13, 2009
Provenance in Secure and Advanced
Computer Systems
The Vision: Provenance
Everywhere
•
•
•
•
•
All data has provenance.
Applications generate provenance.
Systems generate provenance.
Users generate provenance.
Provenance is:
– Secure.
– Queryable.
– Globally searchable.
• There are provenance-aware algorithms.
PSACS: May 2009
2
The Problem: Provenance
Comes from Different Places
• Depending on the source, provenance is
attached to different kinds of objects:
–
–
–
–
Operating system: files
Database systems: tuples
Workflow engines: objects
Applications:
• Variables (from an interpreter)
• Links (from a browser)
PSACS: May 2009
3
Data are related
•
•
•
•
•
Tuples live in files.
Files comprise data sets.
Browsers write files.
Variables relate to each other.
Objects may be files, tuples, or data sets.
Must integrate provenance from different
representations.
PSACS: May 2009
4
Why Integrate Provenance?
PSACS: May 2009
5
Outline
• Provenance disclosure and integration
• Layering and provenance
• Parting remarks
PSACS: May 2009
6
Provenance Observation
versus Disclosure
• Disclosed provenance:
– Provenance that is explicitly provided.
– Provider understands semantics of the data referenced by
provenance.
– Example: This image is the result of aligning these other two
images.
• Observed provenance:
– Provenance deduced by interpreting events.
– Observer translates event into a provenance relationship.
– Example: Process P wrote file F, therefore file F depends on
file P
PSACS: May 2009
7
Your observed provenance is
my disclosed provenance.
• The distinction between observed and
disclosed provenance is one of vantage point.
• A file system observes that the workflow
engine produced the file atlas.x.gif.
• The workflow engine can disclose that
atlas.x.gif is the result of a 5-step
process that began with reading warp.air.
PSACS: May 2009
8
Problem Overview
• Systems capture provenance at different levels of
abstraction:
–
–
–
–
–
File systems: files and processes
Database systems: tuples and queries
Workflow engines: objects and operators
Interpreters: variable and operations
Browsers: URLs and traversals
• Users want to query across these abstractions.
PSACS: May 2009
9
Use Case: PA-Browser
• Browsers capture a user’s search and traversal
patterns.
• Action: User inadvertently downloads a virus.
• Without layering:
– Browser knows this came from virus.com.
– File system knows what files were affected.
• With layering:
– How did user get to the virus?
– What else was downloaded from that site?
– Are there other files that might be similarly tainted?
PSACS: May 2009
10
Use Case: PA-Python
Applications
• Python wrappers generate trace of processing steps
internal to python.
• Usage: Program reads 100 input files, uses two of
them to produce a graph.
• Without layering:
– Python knows which files were actually used to produce the
graph.
– File system knows that Python read 100 files and produced
an output file.
• With layering
– Can identify that two input files lead directly to output file.
PSACS: May 2009
11
Integrating Requires Layering
• Layering implies that provenance collection and
tracking systems interact directly with one another.
• Why not a centralized provenance repository?
– Requires a mechanism to translate names.
– Every participant must agree on naming convention.
– Must be able to generate references to objects created by
other participants.
– What happens when you add a new participant with a new
naming mechanism?
• Layering provides a natural way to transmit and
integrate provenance.
PSACS: May 2009
12
Outline
• Provenance disclosure and integration
• Layering and provenance
• Parting remarks
PSACS: May 2009
13
Provenance-Aware Agents
• An agent that is provenance-aware:
– Accepts disclosed provenance from others.
– Observes events and generates provenance from
them.
– Discloses provenance to others.
• Implications:
– Both input and output are disclosed provenance
– Participation in an integrated provenance-aware
system requires an API for disclosed provenance.
PSACS: May 2009
14
DPAPI: The Disclosed
Provenance API
• Grew out of our experience designing and
building PASS (Provenance-Aware Storage
Systems).
• Used as the universal internal API between
components in the PASS architecture.
• Used to extend PASS to NFS.
• Used by provenance-aware applications.
• Has evolved through three generations.
PSACS: May 2009
15
DPAPI Concepts
• Pnode
– Unique ID assigned at object creation.
– Never recycled.
– Used to access an object’s provenance.
• Provenance record
– An attribute/value pair.
– Plain value or cross-reference.
• Version
– Objects change; changes are reflected in
versions.
PSACS: May 2009
16
DPAPI Functions
• Pass_read: Reads data with a reference to its
provenance.
• Pass_write: Writes data with provenance.
• Pass_freeze: Subsequent modifications to object
create a new version.
• Pass_mkobj: Create an object to represent
something at a different abstraction layer.
• Pass_reviveobj: Given a pnode number, obtain a
reference to the appropriate object.
• Pass_sync: Flush an object’s provenance to disk.
PSACS: May 2009
17
Example Stack: NFS
Application
Syscall API
PASS
PA-Application
libpass
DPAPI
DPAPI
NFS
PSACS: May 2009
18
Example 5-stack
DPAPI
PA Python
Library
PA-Python Application
lib API
DPAPI
DPAPI
PA-Python Interpreter
DPAPI
Syscall API
PASS
DPAPI
NFS
PSACS: May 2009
19
Benefits to Layering
• Ability to query across layers.
• Access objects by the name that is
meaningful to the user.
• Automatic association between names at
different layers.
• Associate related objects named differently.
• Extensible data model.
PSACS: May 2009
20
Outline
• Provenance disclosure and integration
• Layering and provenance
• Parting remarks
PSACS: May 2009
21
Lessons Learned (1)
• Guidelines for making applications or
systems provenance-aware:
– Identify what provenance you want to collect.
• Create objects as necessary using dpapi_mkobj
• Accumulate provenance records for those objects
– Replace read calls with dpapi_read calls.
– Replace write calls with dpapi_write calls.
– Use cross-references to relate objects.
– If necessary, export DPAPI to higher layers
PSACS: May 2009
22
Lessons Learned (2)
• Application architecture dictates how difficult
this is.
– Firefox’s modular architecture makes it difficult to
have provenance and data flow together hrough
the browser
• APIs are never done.
– DPAPI continues to evolve.
– Added two new calls early in 2009.
PSACS: May 2009
23
Lessons Learned (3)
• Differentiating applications from substrates:
– We initially thought that our Python wrappers
made Python provenance-aware.
– Instead they enabled provenance-aware Python
appcliations.
– Making Python provenance-aware requires
changes to the interpreter -- similar to those to
make an operating system provenance-aware.
PSACS: May 2009
24
Making Provenance
Ubiquitous
• One size does not fit all.
• Provenance is useful at all levels of the
system:
– Capture semantics of applications.
– Capture execution mode of interpreter.
– Capture system dependencies.
• Data and provenance live in a world with
many names.
PSACS: May 2009
25
Layering Enables
Interoperability
• Data objects are the point of interoperability.
– Users exchange or share data, not provenance.
– Users query provenance.
• The names people associate with their data
must be available in provenance queries.
• A layered approach associates names with
one another.
• Layering enables consistency between
provenance and data.
PSACS: May 2009
26
New Layers
• We have explored layering in:
–
–
–
–
–
–
Operating system
Network-attached storage
Interpreters
Language libraries
Browsers
Workflow engines (Kepler)
• We welcome new layers to our stack:
– Database?
PSACS: May 2009
27
Thank You!
Margo Seltzer
margo@eecs.harvard.edu
May 13, 2009
Provenance in Secure and
Advanced Computer Systems
PSACS: May 2009
28
DPAPI (detail)
int dpapi_freeze(int fd);
int dpapi_mkobj(int reference_fd);
int dpapi_revive_obj(int reference_fd,
__pnode_t pnode, version_t version);
ssize_t paread(int fd, void *data, size_t
datalen, __pnode_t *pnode_ret, version_t
*version_ret);
ssize_t pawrite(int fd, const void *data,
size_t datalen, const struct dpapi_addition
*records, unsigned numrecords);
int dpapi_sync(int fd);
PSACS: May 2009
29
Why Integrate Provenance?
PSACS: May 2009
30
Download