Provenance and pragmatics Computer and database experts are doing much work to address provenance issues in formal terms. Meanwhile organisations which collect, organise and serve data have evolved pragmatic practices of differing levels of functionality. The aim of this breakout group was to expose existing practice of databases and discuss it in the context of database methods. Participants from four different domains presented brief overviews of the practice within their domain: Engineering – the maintenance of list of components with different versions used in electronic engineering products Astronomy – data from sky surveys at different wavelengths and catalogues of objects Environmental simulation – data generated by simulations of the marine environment in a river mouth Bioinformatics – shared data collections of biomolecular information. Engineering Astronomy Simulation Bioinformatics Key driver for provenance Ensuring correctness Evoking scientific trust Ensuring comparabilty of data, liability issues Evoking trust, data quality, attribution Key issues Recording “used in” information to allow updates Data transformation from raw data Tweaking of details of simulation creates incompatibilities Assertions based on whole database comparisons Motivations for capturing provenance information The disciplines under consideration identified a range of motivations for wishing to collect provenance information. These were broadly shared, but the emphasis differed from discipline to discipline. Quality/Trust Provenance information was important in enabling scientists to reasonably assess how much trust they could put in particular information. This was particularly true in astronomy. It was noted that “cheap” and easily reproduced information (e.g., DNA sequence) is less jeopardised by poor provenance information as typically scientists will typically resequence to check data crucial to their scientific direction. By contrast, there is substantial concern about the provenance of expensive information (e.g., macromolecular structures) where repeating experiments is impractical, or for “onechance-to-collect” information (e.g., cosmic events). Attribution Correct attribution of scientific findings depends on the databases which include them carrying provenance information. In the biological arena this is seen as important as part of scientific credit, though it breaks down where composite objects are derived from the work of hundreds of scientists. It seems less of an issue for large-scale genome data or sky surveys, where the attribution is pretty clear. Liability (blame assignment) was seen as the dark side of attribution. Priority In the biological world patent documentation searches may need to know exactly who created what and when. At the EBI special systems to support this have been developed (at the expense of the European Patent Office). Interpretation Provenance of the information can be important to ensuring its correct interpretation. This requirement was particularly important in the environmental simulation data, where one would like to draw conclusions by comparing periodic data generated over a long time scale by evolving simulation methods. Interoperability Connections between different data collections depends on the ability to identify equivalent (or related) objects. Roll-back Often the data presented are the result of refinement/reduction/transformation of “raw” data. If this pre-processing turns out to be suspect it is necessary to re-examine or recompute the data. Cost/Benefit/Risk It became clear that, in practice, data curation centres are either implicitly or explicitly making cost/benefit decisions about the collection and propagation of provenance information along with an intuitive assessment of the risks associated with failure to do so. Thus: Cheap, reproducible data carry little risk as they can be redetermined, and so there is less pressure to record provenance High-aggregation provenance (e.g., information derived from comparison with large, changing databases) seems expensive to collect well, and its suppliers simply give up on provenance. Expensive data (e.g., sky-surveys, macromolecular structures) create pressure to capture provenance because doing so is cheap by comparison with recollecting the data. Unique event data (e.g., cosmic events) cannot be verified after the fact, so they are only as good as the associated provenance information. Life-critical data – (e.g., drug ingredient pedigree, aircraft component versions) create pressure to capture provenance due to the high risk associated with not doing so. We are therefore prepared to tolerate a fairly high cost in capturing the information. Reducing the cost of provenance information A defeatist might conclude that we effectively only create provenance information when it’s essential or easy, and give up otherwise. However the group was of the view that there were some obvious strategies to reduce the burden of collecting the information, and was willing to accept that there might also be some non-obvious strategies. Archive rather than serve provenance Sometimes the need to access the provenance information may be rare and insufficient to merit little more than securing it such that it can be reworked. For example, old versions of sequence databases are typically held, but it is substantial work to retrieve them. Automate If the tools which the data producers use along the way to creating their data record what they do in a standard way then this information can be “harvested” to create high quality provenance information with less effort. …and the non-obvious Can the task of recording the provenance of a complex and ever-changing database be turned into a database engineering problem rather than a scientific domain problem. Could the very technology used to maintain the database also maintain the provenance information. The domain specialists look to the computer scientists for help here.