Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems The Vision: Provenance Everywhere • • • • • All data has provenance. Applications generate provenance. Systems generate provenance. Users generate provenance. Provenance is: – Secure. – Queryable. – Globally searchable. • There are provenance-aware algorithms. PSACS: May 2009 2 The Problem: Provenance Comes from Different Places • Depending on the source, provenance is attached to different kinds of objects: – – – – Operating system: files Database systems: tuples Workflow engines: objects Applications: • Variables (from an interpreter) • Links (from a browser) PSACS: May 2009 3 Data are related • • • • • Tuples live in files. Files comprise data sets. Browsers write files. Variables relate to each other. Objects may be files, tuples, or data sets. Must integrate provenance from different representations. PSACS: May 2009 4 Why Integrate Provenance? PSACS: May 2009 5 Outline • Provenance disclosure and integration • Layering and provenance • Parting remarks PSACS: May 2009 6 Provenance Observation versus Disclosure • Disclosed provenance: – Provenance that is explicitly provided. – Provider understands semantics of the data referenced by provenance. – Example: This image is the result of aligning these other two images. • Observed provenance: – Provenance deduced by interpreting events. – Observer translates event into a provenance relationship. – Example: Process P wrote file F, therefore file F depends on file P PSACS: May 2009 7 Your observed provenance is my disclosed provenance. • The distinction between observed and disclosed provenance is one of vantage point. • A file system observes that the workflow engine produced the file atlas.x.gif. • The workflow engine can disclose that atlas.x.gifis the result of a 5-step process that began with reading warp.air. PSACS: May 2009 8 Problem Overview • Systems capture provenance at different levels of abstraction: – – – – – File systems: files and processes Database systems: tuples and queries Workflow engines: objects and operators Interpreters: variable and operations Browsers: URLs and traversals • Users want to query across these abstractions. PSACS: May 2009 9 Use Case: PA-Browser • Browsers capture a user’s search and traversal patterns. • Action: User inadvertently downloads a virus. • Without layering: – Browser knows this came from virus.com. – File system knows what files were affected. • With layering: – How did user get to the virus? – What else was downloaded from that site? – Are there other files that might be similarly tainted? PSACS: May 2009 10 Use Case: PA-Python Applications • Python wrappers generate trace of processing steps internal to python. • Usage: Program reads 100 input files, uses two of them to produce a graph. • Without layering: – Python knows which files were actually used to produce the graph. – File system knows that Python read 100 files and produced an output file. • With layering – Can identify that two input files lead directly to output file. PSACS: May 2009 11 Integrating Requires Layering • Layering implies that provenance collection and tracking systems interact directly with one another. • Why not a centralized provenance repository? – Requires a mechanism to translate names. – Every participant must agree on naming convention. – Must be able to generate references to objects created by other participants. – What happens when you add a new participant with a new naming mechanism? • Layering provides a natural way to transmit and integrate provenance. PSACS: May 2009 12 Outline • Provenance disclosure and integration • Layering and provenance • Parting remarks PSACS: May 2009 13 Provenance-Aware Agents • An agent that is provenance-aware: – Accepts disclosed provenance from others. – Observes events and generates provenance from them. – Discloses provenance to others. • Implications: – Both input and output are disclosed provenance – Participation in an integrated provenance-aware system requires an API for disclosed provenance. PSACS: May 2009 14 DPAPI: The Disclosed Provenance API • Grew out of our experience designing and building PASS (Provenance-Aware Storage Systems). • Used as the universal internal API between components in the PASS architecture. • Used to extend PASS to NFS. • Used by provenance-aware applications. • Has evolved through three generations. PSACS: May 2009 15 DPAPI Concepts • Pnode – Unique ID assigned at object creation. – Never recycled. – Used to access an object’s provenance. • Provenance record – An attribute/value pair. – Plain value or cross-reference. • Version – Objects change; changes are reflected in versions. PSACS: May 2009 16 DPAPI Functions • Pass_read: Reads data with a reference to its provenance. • Pass_write: Writes data with provenance. • Pass_freeze: Subsequent modifications to object create a new version. • Pass_mkobj: Create an object to represent something at a different abstraction layer. • Pass_reviveobj: Given a pnode number, obtain a reference to the appropriate object. • Pass_sync: Flush an object’s provenance to disk. PSACS: May 2009 17 Example Stack: NFS Application Syscall API PASS PA-Application libpass DPAPI DPAPI NFS PSACS: May 2009 18 Example 5-stack DPAPI PA Python Library PA-Python Application lib API DPAPI DPAPI PA-Python Interpreter DPAPI Syscall API PASS DPAPI NFS PSACS: May 2009 19 Benefits to Layering • Ability to query across layers. • Access objects by the name that is meaningful to the user. • Automatic association between names at different layers. • Associate related objects named differently. • Extensible data model. PSACS: May 2009 20 Outline • Provenance disclosure and integration • Layering and provenance • Parting remarks PSACS: May 2009 21 Lessons Learned (1) • Guidelines for making applications or systems provenance-aware: – Identify what provenance you want to collect. • Create objects as necessary using dpapi_mkobj • Accumulate provenance records for those objects – – – – Replace read calls with dpapi_read calls. Replace write calls with dpapi_write calls. Use cross-references to relate objects. If necessary, export DPAPI to higher layers PSACS: May 2009 22 Lessons Learned (2) • Application architecture dictates how difficult this is. – Firefox’s modular architecture makes it difficult to have provenance and data flow together hrough the browser • APIs are never done. – DPAPI continues to evolve. – Added two new calls early in 2009. PSACS: May 2009 23 Lessons Learned (3) • Differentiating applications from substrates: – We initially thought that our Python wrappers made Python provenance-aware. – Instead they enabled provenance-aware Python appcliations. – Making Python provenance-aware requires changes to the interpreter -- similar to those to make an operating system provenance-aware. PSACS: May 2009 24 Making Provenance Ubiquitous • One size does not fit all. • Provenance is useful at all levels of the system: – Capture semantics of applications. – Capture execution mode of interpreter. – Capture system dependencies. • Data and provenance live in a world with many names. PSACS: May 2009 25 Layering Enables Interoperability • Data objects are the point of interoperability. – Users exchange or share data, not provenance. – Users query provenance. • The names people associate with their data must be available in provenance queries. • A layered approach associates names with one another. • Layering enables consistency between provenance and data. PSACS: May 2009 26 New Layers • We have explored layering in: – – – – – – Operating system Network-attached storage Interpreters Language libraries Browsers Workflow engines (Kepler) • We welcome new layers to our stack: – Database? PSACS: May 2009 27 Thank You! Margo Seltzer margo@eecs.harvard.edu May 13, 2009 Provenance in Secure and Advanced Computer Systems PSACS: May 2009 28 DPAPI (detail) int dpapi_freeze(intfd); int dpapi_mkobj(intreference_fd); int dpapi_revive_obj(intreference_fd, __pnode_tpnode, version_t version); ssize_tparead(intfd, void *data, size_t datalen, __pnode_t *pnode_ret,version_t *version_ret); ssize_tpawrite(intfd, const void *data, size_t datalen, const struct dpapi_addition *records, unsigned numrecords); int dpapi_sync(intfd); PSACS: May 2009 29 Why Integrate Provenance? PSACS: May 2009 30