Provenance-based reasoning in e-Science Professor Luc Moreau L.Moreau@ecs.soton.ac.uk University of Southampton Acknowledgements Simon Miles, Paul Groth, Miguel Branco (Pasoa Southampton) Victor Tan, Liming Chen, Fenglian Xu (EU Provenance Southampton) Ian Wootten, Shrija Rajbhandari, Omer Rana, David Walker (PASOA Cardiff) Steven Willmott, Javier Vazquez, Laszlo Varga, Arpad Andics, John Ibbotson, Neil Hardman, Alexis Biller (EU Provenance) Overview Context Provenance Concept & Definitions Architectural Design Provenance based Reasoning Protocol for P-Assertions Recording Provenance Queries Conclusions Context: Importance of Past Processes Context (1) Bioinformatics: verification and auditing of “experiments” (e.g. for drug approval) High Energy Physics: tracking, analysing, verifying data sets in the ATLAS Experiment of the Large Hadron Collider (CERN) Context (2) Aerospace engineering: maintain a historical record of design processes, up to 99 years. Organ transplant management: tracking of previous decisions, crucial to maximise the efficiency in matching and recovery rate of patients Concepts & Definitions Provenance: common sense definition Oxford English Dictionary: Merriam-Webster Online dictionary: the fact of coming from some particular source or quarter; origin, derivation the history or pedigree of a work of art, manuscript, rare book, etc.; concretely, a record of the ultimate derivation and passage of an item through its various owners. the origin, source; the history of ownership of a valued object or work of art or literature Concept vs representation Provenance Definition Our definition of provenance in the context eScience, for which process matters to end users: The provenance of a piece of data is the process that led to that piece of data Our aim is to conceive a computer-based representation of provenance that allows us to perform useful analysis and reasoning to support our use cases Core Interfaces to Provenance “Lifecycle” Provenance Store Application Results Record Documentation of Execution Administer Store and its contents Provenance Store Query Provenance of Data Nature of Documentation We represent the provenance of some data by documenting the process that led to the data: documentation can be complete or partial; it can be accurate or inaccurate; it can present conflicting or consensual views of the actors involved; it can provide operational details of execution or it can be abstract. p-assertion A given element of process documentation will be referred to as a p-assertion p-assertion: is an assertion that is made by an actor and pertains to a process. Three views of provenance Since our goal is to support as many use cases as possible, we record as much as we can Provenance as a concept conceptual So, the challenge is to identify a subset of assertions that is relevant to the user (as expressed by query) computer based Set of all p-assertions recorded during execution Scalability concern and optimisation of recording for recording time specific queries Provenance query results in set of p-assertions or other derived representation query time Hence, we need clearly specified scoping mechanisms From Computer to Physical World Application Provenance Store Data Application producing Application is electronic composed by data a set of services Provenance of data can During be queried execution, pCan assertions we derive the are provenance of the physical recorded Electronic artefact from about data a is query a proxy the provenance of the for physical electronic data? artefact Assuming a one to one mapping What if mapping is between services and not one to one? physical “actuators/sensors”, e.g. robot Physical actuators/sensors result in physical artefact Physical artefact Architectural Design Service Oriented Architecture Broad definition of service as component that takes some inputs and produces some outputs. Services are brought together to solve a given problem typically via a workflow definition that specifies their composition. Interactions with services take place with messages that are constructed according to services interface specification. The term actor denotes either a client or a service in a SOA. A process is defined as execution of a workflow Process Documentation (1) From these p-assertions, we can derive that M3 was sent by Actor 1 and received by Actor 2 (and likewise for M4) Actor 2 Actor 1 M1 M3 are not very useful because If actors are black boxes, these assertions we do not know dependencies between messages M2 I received M1, M4 I sent M2, M3 M4 I received M3 I sent M4 Process Documentation (2) Actor 2 Actor 1 M1 M3 These assertions help identify order of messages, but not how data were computed M2 M2 is in reply to M1 M3 is caused by M1 M2 is caused by M4 M4 M4 is in reply to M3 Process Documentation (3) Actor 1 Actor 2 M1 M3 These assertions f1 help identify how data is computed, f but provide no information about non-functional characteristicsf2of the computation M4 M2 (time, resources used, etc) M3 = f1(M1) M2 = f2(M1,M4) M4 = f(M3) Process Documentation (4) Actor 2 Actor 1 M1 M3 M2 I used 386 cluster Request sat in queue for 6min M4 I used sparc processor I used algorithm x version x.y.z Types of p-assertions (1) Interaction p-assertion: is an assertion of the contents of a message by an actor that has sent or received that message I received M1, M4 I sent M2, M3 Types of p-assertions (2) Relationship p-assertion: is an assertion, made by an actor, that describes how the actor obtained output data or messages by applying some function to some input data or messages. M2 is in reply to M1 M3 is caused by M1 M2 is caused by M4 M3 = f1(M1) M2 = f2(M1,M4) Types of p-assertions (3) Actor state p-assertion: assertion made by an actor about its internal state in the context of a specific interaction I used sparc processor I used algorithm x version x.y.z Data flow Interaction p-assertions allow us to specify a flow of data between actors Relationship p-assertions allow us to characterise the flow of data “inside” an actor Overall data flow (internal + external) constitutes a DAG Provenance Modelling Interfaces to Provenance Store Application Results Record Documentation of Execution Administer Store and its contents Provenance Store Query Provenance of Data Provenance-based reasoning The case (1) Static Validation Operates on workflow source code Programming language static analyses (e.g. type inference, escape analysis, etc.) Workflow specific (concurrency analysis, graphbased partitioning, model checking, quality of service) Workflow script may not be accessible or may be expressed in a language not supported by analysis tool Dynamic Validation Service based: interface matching, runtime type checking The case (2) Provenance based reasoning Allows for validation of experiments after execution Third parties, such as reviewers and other scientists, may want to verify that the results obtained were computed correctly according to some criteria. These criteria may not be known when the experiment was designed or run. Important because science progresses (and models evolve!) Bioinformatics Scenario A biologist has a set of proteins, for each of which he/she wishes to determine a particular biological property ? Experiment Design The biologist designs a highlevel plan of an experiment, describing each activity that must be performed Each activity determines new information from analysing the information discovered in previous steps The steps can be seen as a linked flow of data from the protein to the final property HIGH LEVEL PLAN ? Experiment Services For each activity in the plan, the biologist must decide on the concrete service they will perform Each service may be designed by the biologist him/herself or adopted from the work of another biologist For each service there is a description of that service stating: what the service does what type of data it analyses (its inputs) and what type of results it produces (its outputs) All the descriptions are stored in a registry Service C • • • • • • • • B • Service ………. • ……… Service A ………. • …………….. ……… •………. …. …………….. ……… …. …………….. …. Registry Description of Service A Function: ….. Inputs: ….. Outputs: ….. Performing Experiment The biologist now performs the experiment as many times as they wish The details of each experimental process is documented in a provenance store This is achieved each service documenting its own execution using the recording interface • • • Service A ………. ……… ….. • • • Service B ………. ……… ….. Provenance Store • • • Service C ………. ……… ….. • • • • • • Service E ………. ……… ….. Service D ………. ……… ….. Provenance Questions Later, the biologist examines the experiment results and has questions about the validity of process that produced them 1. Did the services I used actually fulfil my high level plan? 2. Two of the experiments were performed on the same protein but have different results – did I alter the services between these experiments? 3. Did I perform each service on the type of data that the service was intended to analyse, i.e. were the inputs and outputs of each activity compatible? Answering the Questions Using the documentation in the Provenance Store, we can reconstruct the process that led to each result Along with the high level plan and the descriptions in the registry we have all the information required to answer the questions Q1: Did the experiment follow the plan? Retrieve procedure descriptions Provenance Store Registry Description of Service A Retrieve documentation of experiment that led to a result A Compare procedure function to planned activity Function: ….. Inputs: ….. Outputs: ….. ? Q2: Do services differ between experiments? Provenance Store Retrieve documentation of experiments Service A • • • ………. ……… …………….. Service A • • • • ………. ……… …………….. …. Highlight differences in services between experiments Q3: Were the inputs and outputs compatible? Provenance Store Retrieve descriptions for each service Registry A Retrieve each pair of services performed in an experiment, where one service’s output is the other’s input Description of Service A B Description of Service B Function: ….. Function: ….. Inputs: ….. Inputs: ….. Outputs: ….. Outputs: ….. Compare the output type of the first service with the input type of the second Ontological Reasoning In some cases, the high-level activity may be described in a more general way than the service which performs it Also, one service’s input may be a generalisation of the preceding service’s output Therefore, exact matching of types may produce a false negative: the biologist will wrongly be told the experiment was invalid By using an ontology, describing how types are related, we can reason about types and determine whether they are truly compatible Protein is generalisation of Human Protein Ontology P-assertions Recording Recording patterns Protocol: PReP Properties Implementation: PReServ Performance Separate Store Pattern (service) Provenance Store Record P-Assertions Invocation Result Service Separate Store Pattern (client) Provenance Store Record P-Assertions Invocation Client Result Context Passing Pattern Provenance Store Provenance Store Record P-Assertions Record P-Assertions context Client Service Shared Store Pattern Provenance Store Record P-Assertions Client Record P-Assertions Service Repeated Application of Patterns Provenance Store 1 Provenance Store 2 Client Initiator Workflow Enactment Engine Provenance Store 3 Service 1 PReP: P-Assertion Recording Protocol Formalisation as abstract machine Properties Termination Liveness Safety Stateless PReServ [Groth et al. 04] Implementation of PReP protocol and Query Interface Provenance store implemented as a Web Service Client side libraries for using Provenance Store Axis Handler for automatically recording communication between Axis-based Web Services PReServ Implementation Diagram WS Client PS Client Side Library Axis Handler Web Service Axis Handler PS Client Side Library Provenance Service WS Calls Java Calls Backend Store Interface PS Client Side Library Query Actor WS DB Store In-Memory Store Backend Stores … Evaluation in a Bioinformatics Application Bioinformatics workflow studying compressibility of biological sequences Implemented as a VDT workflow, scheduled by Condor Each service, script, command records passertions [HPDC’05] Bioinformatics Application (2) Recording Scalability Querying Scalability Query API Structure of Documentation The documentation of processes recorded by actors can be categorised into a hierarchy All documentation Message exchange Message sender’s view Message content Message exchange Message receiver’s view State of actor during exchange Relationships Query Interface [Miles et al. 05] Purpose Obtain the provenance of some specific data Allow for “navigation” of the data structure representing provenance Abstract interface Allows us to view the provenance store as if containing XML data structures Independent of technology used for running application and internal store representation Seamless navigation of application dependent and application independent provenance representation XML Query Languages Two existing query languages provide ways of navigating hierarchical data: XPath and XQuery For instance, we can use XPath to refer to: The message exchange with ID 345 The client’s view of that exchange The body of the message exchanged // messageExchange [id=“345”] / clientView / messageContent Navigating Message Content If message content is in XML format, or can be mapped to it, then XPath and XQuery can be used to navigate into the message content For example, we can add application-specific navigation to the previous XPath: The SOAP envelope that encloses the message The body of the message within the envelope The customer name within the body // messageExchange [id=“345”] / clientView / messageContent / soap:envelope / soap:body // customerName Other Query Requirements Execution Filtering: include/exclude all passertions that are marked as part of an execution by a single actor. Functionality Filtering: include/exclude passertions that have one of a given set of operation types. Process Filtering: include/exclude passertions that belong to a given (set of) process(es). Conclusions Conclusions Mostly unexplored area that is crucial to develop trusted systems Definition of provenance Specification of provenance representation Recording protocol Querying interfaces Reasoning based on provenance Presents lots of opportunities Conclusions Current work: System and protocol designing, architecture specification, generic support for use cases Pursue the deployment in concrete application and performance evaluation Work towards a standardisation proposal Download our software from www.pasoa.org Tell us about your use cases: we are keen to find new collaborations in this space! Publications 1. 2. 3. 4. 5. 6. 7. Sylvia C. Wong, Simon Miles, Weijian Fang, Paul Groth and Luc Moreau. Provenance-based Validation of E-Science Experiments. In Proceedings of the International Semantic Web Conference (ISWC’05), Nov 2005. Paul Groth, Simon Miles, Weijian Fang, Sylvia C. Wong, Klaus-Peter Zauner, and Luc Moreau. Recording and Using Provenance in a Protein Compressibility Experiment. In Proceedings of the 14th IEEE International Symposium on High Performance Distributed Computing (HPDC'05), July 2005. Paul T. Groth. Recording Provenance in Service-Oriented Architectures. 9 Month Report, University of Southampton; Faculty of Engineering, Science and Mathematics; School of Electronics and Computer Science, 2004. Paul Groth, Michael Luck, and Luc Moreau. A protocol for recording provenance in service-oriented Grids. In Proceedings of the 8th International Conference on Principles of Distributed Systems (OPODIS'04), Grenoble, France, December 2004. Paul Groth, Michael Luck, and Luc Moreau. Formalising a protocol for recording provenance in Grids. In Proceedings of the UK OST e-Science second All Hands Meeting 2004 (AHM'04), Nottingham, UK, September 2004. Simon Miles, Paul Groth, Miguel Branco, and Luc Moreau. The requirements of recording and using provenance in e-Science experiments. Technical report, University of Southampton, 2005. Paul Townend, Paul Groth, and Jie Xu. A Provenance-Aware Weighted Fault Tolerance Scheme for Service-Based Applications. In Proc. of the 8th IEEE International Symposium on Object-oriented Real-time distributed Computing (ISORC 2005), May 2005.