Provenance-based reasoning in e-Science Professor Luc Moreau

advertisement
Provenance-based
reasoning in e-Science
Professor Luc Moreau
L.Moreau@ecs.soton.ac.uk
University of Southampton
Acknowledgements




Simon Miles, Paul Groth, Miguel Branco
(Pasoa Southampton)
Victor Tan, Liming Chen, Fenglian Xu (EU
Provenance Southampton)
Ian Wootten, Shrija Rajbhandari, Omer Rana,
David Walker (PASOA Cardiff)
Steven Willmott, Javier Vazquez, Laszlo
Varga, Arpad Andics, John Ibbotson, Neil
Hardman, Alexis Biller (EU Provenance)
Overview







Context
Provenance Concept & Definitions
Architectural Design
Provenance based Reasoning
Protocol for P-Assertions Recording
Provenance Queries
Conclusions
Context:
Importance of Past Processes
Context (1)
Bioinformatics: verification and
auditing of “experiments” (e.g.
for drug approval)
High Energy Physics:
tracking, analysing, verifying
data sets in the ATLAS
Experiment of the Large
Hadron Collider (CERN)
Context (2)
Aerospace engineering:
maintain a historical record
of design processes, up to
99 years.
Organ transplant management:
tracking of previous decisions,
crucial to maximise the efficiency
in matching and recovery rate of
patients
Concepts & Definitions
Provenance:
common sense definition

Oxford English Dictionary:



Merriam-Webster Online dictionary:



the fact of coming from some particular source
or quarter; origin, derivation
the history or pedigree of a work of art,
manuscript, rare book, etc.; concretely, a
record of the ultimate derivation and passage
of an item through its various owners.
the origin, source;
the history of ownership of a valued object or
work of art or literature
Concept vs representation
Provenance Definition

Our definition of provenance in the context eScience, for which process matters to end
users:


The provenance of a piece of data is
the process that led to that piece of
data
Our aim is to conceive a computer-based
representation of provenance that allows us
to perform useful analysis and reasoning to
support our use cases
Core Interfaces to
Provenance “Lifecycle”
Provenance Store
Application
Results
Record Documentation of Execution
Administer
Store and its
contents
Provenance
Store
Query
Provenance
of
Data
Nature of Documentation

We represent the provenance of some data
by documenting the process that led to the
data:
 documentation can be complete or partial;
 it can be accurate or inaccurate;
 it can present conflicting or consensual
views of the actors involved;
 it can provide operational details of
execution or it can be abstract.
p-assertion

A given element of process
documentation will be referred to as a
p-assertion
 p-assertion: is an assertion that is
made by an actor and pertains to a
process.
Three views of provenance
Since our
goal is to
support as
many use
cases as
possible, we
record as
much as we
can
Provenance
as a concept
conceptual
So, the
challenge is to
identify a subset
of assertions that
is relevant to the
user (as expressed
by query)
computer based
Set of all
p-assertions
recorded
during execution
Scalability concern
and optimisation
of recording for recording time
specific queries
Provenance query
results in set of
p-assertions or other
derived representation
query time
Hence, we need
clearly specified
scoping
mechanisms
From Computer to Physical World
Application
Provenance
Store
Data
Application
producing Application is
electronic composed by
data
a set of
services
Provenance
of data can
During
be queried
execution, pCan assertions
we derive the
are
provenance
of the physical
recorded
Electronic
artefact from
about
data a
is query
a proxy
the provenance
of the
for physical
electronic
data?
artefact
Assuming
a one
to one
mapping
What
if mapping
is
between
services
and
not one to one?
physical
“actuators/sensors”,
e.g. robot
Physical
actuators/sensors
result in physical
artefact
Physical
artefact
Architectural Design
Service Oriented
Architecture





Broad definition of service as component that takes
some inputs and produces some outputs.
Services are brought together to solve a given
problem typically via a workflow definition that
specifies their composition.
Interactions with services take place with messages
that are constructed according to services interface
specification.
The term actor denotes either a client or a service in
a SOA.
A process is defined as execution of a workflow
Process Documentation (1)
From these p-assertions, we can derive that M3 was sent by Actor 1
and received by Actor 2 (and likewise for M4)
Actor 2
Actor 1
M1
M3 are not very useful because
If actors are black boxes, these assertions
we do not know dependencies between messages
M2
I received M1, M4
I sent M2, M3
M4
I received M3
I sent M4
Process Documentation (2)
Actor 2
Actor 1
M1
M3
These assertions help identify
order of messages,
but not how data were computed
M2
M2 is in reply to M1
M3 is caused by M1
M2 is caused by M4
M4
M4 is in reply to M3
Process Documentation (3)
Actor 1
Actor 2
M1
M3
These assertions
f1 help identify how data is computed,
f
but provide no information about non-functional
characteristicsf2of the computation
M4
M2
(time, resources used, etc)
M3 = f1(M1)
M2 = f2(M1,M4)
M4 = f(M3)
Process Documentation (4)
Actor 2
Actor 1
M1
M3
M2
I used 386 cluster
Request sat in
queue for 6min
M4
I used sparc
processor
I used algorithm
x version x.y.z
Types of p-assertions (1)

Interaction p-assertion: is an assertion of
the contents of a message by an actor that
has sent or received that message
I received M1, M4
I sent M2, M3
Types of p-assertions (2)

Relationship p-assertion: is an assertion,
made by an actor, that describes how the
actor obtained output data or messages by
applying some function to some input data
or messages.
M2 is in reply to M1
M3 is caused by M1
M2 is caused by M4
M3 = f1(M1)
M2 = f2(M1,M4)
Types of p-assertions (3)

Actor state p-assertion: assertion made by
an actor about its internal state in the
context of a specific interaction
I used sparc
processor
I used algorithm x
version x.y.z
Data flow



Interaction p-assertions allow us to
specify a flow of data between actors
Relationship p-assertions allow us to
characterise the flow of data “inside” an
actor
Overall data flow (internal + external)
constitutes a DAG
Provenance Modelling
Interfaces to Provenance
Store
Application
Results
Record Documentation of Execution
Administer
Store and its
contents
Provenance
Store
Query
Provenance
of
Data
Provenance-based reasoning
The case (1)

Static Validation





Operates on workflow source code
Programming language static analyses (e.g. type
inference, escape analysis, etc.)
Workflow specific (concurrency analysis, graphbased partitioning, model checking, quality of
service)
Workflow script may not be accessible or may be
expressed in a language not supported by analysis
tool
Dynamic Validation

Service based: interface matching, runtime type
checking
The case (2)

Provenance based reasoning




Allows for validation of experiments after
execution
Third parties, such as reviewers and other
scientists, may want to verify that the results
obtained were computed correctly according to
some criteria.
These criteria may not be known when the
experiment was designed or run.
Important because science progresses (and
models evolve!)
Bioinformatics Scenario

A biologist has a set of
proteins, for each of which
he/she wishes to determine a
particular biological property
?
Experiment Design



The biologist designs a highlevel plan of an experiment,
describing each activity that
must be performed
Each activity determines new
information from analysing
the information discovered in
previous steps
The steps can be seen as a
linked flow of data from the
protein to the final property
HIGH
LEVEL
PLAN
?
Experiment Services




For each activity in the plan, the biologist
must decide on the concrete service they
will perform
Each service may be designed by the
biologist him/herself or adopted from the
work of another biologist
For each service there is a description of
that service stating:
 what the service does
 what type of data it analyses (its
inputs) and
 what type of results it produces (its
outputs)
All the descriptions are stored in a registry
Service C
•
•
• •
• •
•
•
B
• Service
……….
•
………
Service
A
……….
•
……………..
………
•……….
….
……………..
………
….
……………..
….
Registry
Description of Service A
Function: …..
Inputs: …..
Outputs: …..
Performing Experiment


The biologist now
performs the
experiment as many
times as they wish
The details of each
experimental process
is documented in a
provenance store

This is achieved each
service documenting
its own execution
using the recording
interface
•
•
•
Service A
……….
………
…..
•
•
•
Service B
……….
………
…..
Provenance
Store
•
•
•
Service C
……….
………
…..
•
•
•
•
•
•
Service E
……….
………
…..
Service D
……….
………
…..
Provenance Questions

Later, the biologist examines the
experiment results and has questions about
the validity of process that produced them
1. Did the services I used actually fulfil my high
level plan?
2. Two of the experiments were performed on the
same protein but have different results – did I
alter the services between these experiments?
3. Did I perform each service on the type of data
that the service was intended to analyse, i.e.
were the inputs and outputs of each activity
compatible?
Answering the Questions


Using the documentation in the
Provenance Store, we can reconstruct
the process that led to each result
Along with the high level plan and the
descriptions in the registry we have all
the information required to answer the
questions
Q1: Did the experiment follow
the plan?
Retrieve
procedure
descriptions
Provenance
Store
Registry
Description of Service A
Retrieve
documentation
of experiment
that led to a
result
A
Compare
procedure
function
to planned
activity
Function: …..
Inputs: …..
Outputs: …..
?
Q2: Do services differ between
experiments?
Provenance
Store
Retrieve documentation of experiments
Service A
•
•
•
……….
………
……………..
Service A
•
•
•
•
……….
………
……………..
….
Highlight differences in services between
experiments
Q3: Were the inputs and outputs
compatible?
Provenance
Store
Retrieve descriptions for
each service
Registry
A
Retrieve each pair of services
performed in an experiment,
where one service’s output is
the other’s input
Description of Service A
B
Description of Service B
Function: …..
Function: …..
Inputs: …..
Inputs: …..
Outputs: …..
Outputs: …..
Compare the output type of the first service with the input
type of the second
Ontological Reasoning




In some cases, the high-level
activity may be described in a more
general way than the service which
performs it
Also, one service’s input may be a
generalisation of the preceding
service’s output
Therefore, exact matching of types
may produce a false negative: the
biologist will wrongly be told the
experiment was invalid
By using an ontology, describing
how types are related, we can
reason about types and determine
whether they are truly compatible
Protein
is generalisation
of
Human
Protein
Ontology
P-assertions Recording





Recording patterns
Protocol: PReP
Properties
Implementation: PReServ
Performance
Separate Store Pattern (service)
Provenance
Store
Record
P-Assertions
Invocation
Result
Service
Separate Store Pattern (client)
Provenance
Store
Record
P-Assertions
Invocation
Client
Result
Context Passing Pattern
Provenance
Store
Provenance
Store
Record
P-Assertions
Record
P-Assertions
context
Client
Service
Shared Store Pattern
Provenance
Store
Record
P-Assertions
Client
Record
P-Assertions
Service
Repeated Application of Patterns
Provenance
Store 1
Provenance
Store 2
Client
Initiator
Workflow
Enactment
Engine
Provenance
Store 3
Service 1
PReP: P-Assertion Recording
Protocol


Formalisation as
abstract machine
Properties




Termination
Liveness
Safety
Stateless
PReServ [Groth et al. 04]




Implementation of PReP protocol and
Query Interface
Provenance store implemented as a
Web Service
Client side libraries for using
Provenance Store
Axis Handler for automatically recording
communication between Axis-based
Web Services
PReServ Implementation
Diagram
WS Client
PS Client
Side
Library
Axis
Handler
Web Service
Axis
Handler
PS Client
Side
Library
Provenance Service
WS Calls
Java Calls
Backend Store Interface
PS Client
Side
Library
Query Actor WS
DB
Store
In-Memory
Store
Backend Stores
…
Evaluation in a Bioinformatics
Application



Bioinformatics
workflow studying
compressibility of
biological sequences
Implemented as a
VDT workflow,
scheduled by
Condor
Each service, script,
command records passertions
[HPDC’05]
Bioinformatics Application (2)

Recording Scalability

Querying Scalability
Query API
Structure of Documentation

The documentation of processes
recorded by actors can be categorised
into a hierarchy
All documentation
Message exchange
Message sender’s view
Message content
Message exchange
Message receiver’s view
State of actor during exchange
Relationships
Query Interface [Miles et al. 05]

Purpose



Obtain the provenance of some specific data
Allow for “navigation” of the data structure
representing provenance
Abstract interface



Allows us to view the provenance store as if
containing XML data structures
Independent of technology used for running
application and internal store representation
Seamless navigation of application dependent and
application independent provenance
representation
XML Query Languages


Two existing query languages provide
ways of navigating hierarchical data:
XPath and XQuery
For instance, we can use XPath to refer to:



The message exchange with ID 345
The client’s view of that exchange
The body of the message exchanged
// messageExchange [id=“345”]
/ clientView / messageContent
Navigating Message Content


If message content is in XML format, or can
be mapped to it, then XPath and XQuery can
be used to navigate into the message content
For example, we can add application-specific
navigation to the previous XPath:



The SOAP envelope that encloses the message
The body of the message within the envelope
The customer name within the body
// messageExchange [id=“345”]
/ clientView / messageContent
/ soap:envelope / soap:body // customerName
Other Query Requirements



Execution Filtering: include/exclude all passertions that are marked as part of an
execution by a single actor.
Functionality Filtering: include/exclude passertions that have one of a given set of
operation types.
Process Filtering: include/exclude passertions that belong to a given (set of)
process(es).
Conclusions
Conclusions






Mostly unexplored area that is crucial to
develop trusted systems
Definition of provenance
Specification of provenance representation
Recording protocol
Querying interfaces
Reasoning based on provenance

Presents lots of opportunities
Conclusions

Current work:





System and protocol designing, architecture
specification, generic support for use cases
Pursue the deployment in concrete application
and performance evaluation
Work towards a standardisation proposal
Download our software from www.pasoa.org
Tell us about your use cases: we are keen to
find new collaborations in this space!
Publications
1.
2.
3.
4.
5.
6.
7.
Sylvia C. Wong, Simon Miles, Weijian Fang, Paul Groth and Luc Moreau.
Provenance-based Validation of E-Science Experiments. In Proceedings of the
International Semantic Web Conference (ISWC’05), Nov 2005.
Paul Groth, Simon Miles, Weijian Fang, Sylvia C. Wong, Klaus-Peter Zauner, and
Luc Moreau. Recording and Using Provenance in a Protein Compressibility
Experiment. In Proceedings of the 14th IEEE International Symposium on High
Performance Distributed Computing (HPDC'05), July 2005.
Paul T. Groth. Recording Provenance in Service-Oriented Architectures. 9 Month
Report, University of Southampton; Faculty of Engineering, Science and
Mathematics; School of Electronics and Computer Science, 2004.
Paul Groth, Michael Luck, and Luc Moreau. A protocol for recording provenance in
service-oriented Grids. In Proceedings of the 8th International Conference on
Principles of Distributed Systems (OPODIS'04), Grenoble, France, December 2004.
Paul Groth, Michael Luck, and Luc Moreau. Formalising a protocol for recording
provenance in Grids. In Proceedings of the UK OST e-Science second All Hands
Meeting 2004 (AHM'04), Nottingham, UK, September 2004.
Simon Miles, Paul Groth, Miguel Branco, and Luc Moreau. The requirements of
recording and using provenance in e-Science experiments. Technical report,
University of Southampton, 2005.
Paul Townend, Paul Groth, and Jie Xu. A Provenance-Aware Weighted Fault
Tolerance Scheme for Service-Based Applications. In Proc. of the 8th IEEE
International Symposium on Object-oriented Real-time distributed Computing
(ISORC 2005), May 2005.
Download