Liming Chen, Victor Tan,
Fenglian Xu, Alexis Biller,
Paul Groth, Simon Miles ,
John Ibbotson, Michael Luck and Luc Moreau
• Asking questions about the provenance of something, i.e. the process by which it came to be as it is, is essential in many domains
• We are working with bioinformaticians, medics, aerospace engineers, physicists and have found a wide range of questions they wish to ask
• A simple example application can:
– Clarify the requirements on software to aid answering those questions
– Be used to explain the issues involved to non-domain experts
– Be extended in controlled ways to explore issues that arise in ‘real’ applications
• Recent work of the EU Provenance project:
– Developed a logical architecture for software to aid answering provenance-related questions, along with other research on security, scalability and user tool support.
– Now being applied to two project applications: organ transport management (UPC, Spain) and aerospace engineering (DLR,
Germany)
– The logical architecture document should be released next week: keep an eye on www.gridprovenance.org
• Recent work of the PASOA project:
– Has focused on e-Science applications and has gathered requirements, developed protocols and software
– EU Provenance used PASOA software for the work described in this talk
– PASOA will be discussed in the following two presentations
• The example application
• Asking provenance-related questions
• The example as a service-oriented process
• Recording documentation of a process
• What does the example show us?
• What are the limits of the example?
• Conclusions
• INGREDIENTS
– 110g (4oz) Butter
110g (4oz) Caster Sugar
110g (4oz) Self-raising Flour
2 Eggs
Vanilla Essence or 1 tsp Grated Lemon Rind
• RECIPE
– Preheat oven to 190°C: 375°F: Gas 5.
Whisk together the butter and sugar until light and creamy.
Add the beaten eggs gradually with a little of the flour.
Fold in the remaining sieved flour and add the flavouring.
Divide equally between two 15cm (6 inch) sandwich tins.
Bake for 20 - 25 minutes.
Turn out on to a wire rack to cool.
• This is not so a contrived an example!
www.thefoody.com
and 20g butter
20g sugar whisk them together get mixture
1
2 eggs mix the beaten eggs with mixture 1 obtain mixture 2 beat the eggs for 2 minutes
together with mixture 2 fold to mixture 3
100g flour
set baking temperature to 180˚C set baking time to
30min put mixture 3 into oven obtain a cake
cake
We then set a time for baking
• Some questions can be asked after baking a cake
• Answers to the questions can be found if we record details of the baking process during its execution
• Details of the baking process is what we call the provenance of a cake
• Did we follow the recipe accurately?
– Did we use the correct ingredients at the right time?
– Did we provide the correct quantities? Correct units?
– Did we perform actions for the right duration?
We need to keep a record of all actions performed with all their parameters (such as the number of eggs used)
• Organ transplant example : Did the medics follow the correct procedure?
• Bioinformatics example : Did I analyse a amino acid sequence using tools that actually only apply to nucleotide sequences?
• Other factors can affect the baking process:
– Amount of flour required varies with altitude
– Oven is broken and baked at a different temperature
We need to know the “internal state” of the different entities participating in the baking process (such as actual oven temperature or oven altitude)
• Organ transplant example : By what criteria did a team decide to accept or reject an organ?
• Bioinformatics example : What script was used by the services to perform each stage of the experiment?
• Did we use the same amount of ingredients for baking cake 1 and cake 2? or in the same proportion?
• What was the longest step in the execution of a recipe?
• Why did not we finish the process? Where did we stop?
The process that led to a given cake should be delimited and analysable
• Organ transplant example: Which patient’s death led to the organ now being transplanted?
• Bioinformatics example: What samples led to the final analysis result?
• Did the baker follow the user’s instructions (regardless of any claim from the baker)?
• Did each step of the baking process follow the user’s instructions? Did they receive the correct instructions?
– Did they follow the received instructions?
All entities should document their view of a process because it may vary
Organ transplant example : Were there differing opinions on the suitability of an organ for transplant?
Bioinformatics example : I claim I used a database in my experiments whose license allows me to patent my results: does the database owner confirm this?
• We implemented the application as a set of Web Services, and then implemented clients that answered the provenancerelated questions by querying the provenance store
• This involved mapping the scenario onto a service-oriented architecture
User Baker
Whisk
Beat &
Mix
Fold
Sugar + Flour
+ Beating Time + Temperature
Butter + Sugar
Mixture 1
Mixture 1 + Eggs + Beating Time
Mixture 2
Flour + Mixture 2
Mixture 3
Mixture 3 + Temperature + Baking Time
Cake
Cake
Oven
Bake
User Baker
Whisk
Beat &
Mix
Fold
Oven
Bake
Provenance
Store
After baking, the provenance store contains a trace of the different activities that were involved in the production of a cake.
The provenance of a cake is the documentation of the process that led to that cake
Baker (Sugar, Flour, Beating Time, Temperature
Whisk (Butter, Sugar)
WhiskReturn (Mixture 1)
Beat&Mix (Mixture 1, Eggs, Beating Time)
Beat&MixReturn (Mixture 2)
Fold (Flour, Mixture 2)
FoldReturn (Mixture 3)
OvenBake (Mixture 3, Temperature, Baking Time)
OvenBakeReturn (Cake)
BakerReturn (Cake)
• We distinguish
– process documentation (the documentation recorded into a provenance store about a process)
– provenance (the information retrieved from a provenance store about a process)
• This is because we have found there to be different requirements on each
Process documentation
Processing
Provenance
• Should allow questions about the provenance of entities to be answered
• Should follow a consistent, application-independent structure so that independent parties can record documentation that is easily combined
– e.g. oven may be owned by someone other than the user, but their documentation is combined to answer whether the requested temperature was used
• Should state exactly what those recording it know to have happened, not confuse it with what they guessed or inferred had happened
– e.g. baker states that it put the cake in the oven, not that the cake was successfully baked, because the oven may have been broken
• Should give the client asking for the provenance of something control over the scope of the answer
– e.g. whether the process that produced the flour is included in the provenance of the cake
• Should be/provide the information relevant to answering a client’s/user’s questions (not swamp them with detail)
– e.g. report how much flour used rather than giving XML structure sent between application components
• May (in order to achieve the above) include inferred information
– e.g. infer from baker putting mixture in oven and getting cake out that the cake was successfully baked from the mixture
• Should allow different parties to record independent documentation if they want to
– e.g. user and baker can record independently, allowing discrepancies to be noticed
• Should have no dependence on any one workflow engine/language, and no requirement for (explicit) workflows to be used at all
– e.g. our example application was written in Java, and baking in reality follows a plan in someone’s head
• Should have independence from any one product of a process: should not be necessary to store process documentation with any one result of a process
– e.g. the provenance of the cake, the provenance of the ingredients and the provenance of the intermediate mixtures overlap, so cannot claim it ‘belongs’ to any
• The current example has limitations:
– Physical world treated as if it mapped directly to the electronic world: how does a baker record documentation in a provenance store Web Service? through a GUI? what if the GUI goes wrong or they use the GUI wrongly, do we still have sound process documentation?
– None of the objects in the process have constituent parts that we may want to independently find the provenance of
– Assumes a single provenance store that every service happily submits documentation to
• …but the strength of the example is that it can be simply extended to remove these limitations
• The simple example allows us to determine the requirements on software to record process documentation and make it available to users
• We have used it as a testbed, extending it to explore other aspects of provenance (along with other applications)
• It is rich enough to continue extending to mirror, in a controlled way, issues discovered in the future
• IBM United Kingdom Limited
• University of Southampton
• University of Wales, Cardiff
• Deutsches Zentrum fur Luft- und
Raumfahrt s.V
• Universitat Politecnica de Catalunya
• Magyar Tudomanyos Akademia
Szamitastechnikai es Automatizalasi
Kutato Intezet