Position Statement: Musings on Provenance, Workflow and (Semantic Web) Annotations for Bioinformatics Prof. Carole Goble, University of Manchester, UK. carole@cs.man.ac.uk Provenance: birthplace, cradle, place of origin Lineage: line, line of descent, descent, bloodline, pedigree, ancestry, parentage, stock, filiation, derivation (the descendants of an individual, the kinship relation between an individual and their progenitors, inherited properties shared with others of your bloodline) History: the aggregate of past events, the continuum of events occurring in succession, account, chronicle, story, a body of knowledge, all that is remembered of the past preserved in writing. Derive: deduce, infer, deduct, obtain, come from, educe, develop or evolve, descend. [Source: WordNet Online, Princeton web site, 29th September 2002, 13:30 BST, edited by Carole Goble on 7 th October 2002] Is Carole Goble attending this workshop? Web page http://www-fp.mcs.anl.gov/~foster/provenance/ has Carol Goble and after 24/09/02 Carole Goble. Are these the same? Who made the change and why? When is the position statement due? Email message from Foster on 10/09/02 says 07/10/02. Email message from Foster on 24/09/02 says 03/10/02; Web page says 03/10/02; How long should the position statement be? Email 10/09/02 says 1-5 pages, web page says 1-2 pages. How many copies of the uncorrected or different data are there? Did changes get propagated? What was the situation on 05/09/02? What do you believe? I’m working off 10/09/02 data – that’s what was copied to my calender! Provenance in Bioinformatics The biological science community is highly fragmented. Different disciplines act autonomously, producing data repositories and analytical tools that operate over them in isolation. Rather than a few international facilities producing vast amounts of data that needs to be accessible, biology copes with a very large number of sites (potentially 1000's of individual laboratories) around the world each using cheap, commodity technology to continuously generate substantial quantities of different kinds of data. Currently 500 public data repositories are in active service. Thus most biological knowledge resides in a large number of heterogeneous and distributed resources; the data presents serious analytical and linkage challenges, and to answer even the simplest question requires the intelligent interlinking of multiple repositories, which must be up-to-date and reliable. In many resources, each record is analogous to an individual publication with not only raw data, but also additional annotation supplied by a small number of human experts (curators) or automated systems and published as biological literature (increasingly in electronic form). The annotations are the accumulated knowledge attributed to a sequence, structure, protein, etc. Annotations are typically semi-structured texts that make some use of keywords and controlled vocabularies, under the premise that a scientist will read and interpret the texts. A distinction is made between primary and secondary databases: primary databases hold the raw data, such as sequence (GenBank, EMBL); secondary databases are those holding added-value results accumulating and editing data from others and holding new data arising from analysis. For example, SWISS-PROT and PRINTS are secondary databases. Databases are published from the production database at regular intervals, ranging from daily to quarterly. Users can access the databases online for the most upto-date information or download versions locally. Most sites download for provenance reasons. SWISS-PROT draws on TrEMBL (which draws on EMBL) and holds distilled information about proteins; PRINTS draws on SWISS-PROT to collect custom-built GPCR ‘fingerprints’. PRINTS annotations aggregates culled from SWISS-PROT annotations, Medline abstracts, OMIM, GRAP and PDB entries, etc. GPCR draws on PRINTS, and so on. Secondary databases not only hold new, derived data but also accumulates copies (possibly corrected, often editorialised, sometimes transformed) of information from the primary and secondary databases they draw from. This viral propagation of information makes tracking sources of data crucial for forming a view on its quality, validity, and freshness. If our curator updates the fingerprint some time later and gets different results, she will need to be told that a change has occurred and what that change is. If she does not trust the result, she will want to check on the raw data to ascertain the root of the problem. Unfortunately, audit trails are rarely kept and hardly ever published with the information. Two examples of annotation are given at the end of this paper. I use the term information and not data – data often carries the implicit connotation that it is numeric whereas it is often descriptive (the function of a gene product, for example, is described through the controlled vocabulary Gene Ontology1). The annotation pipeline uses both automated and human means of analysis and integration organised into a workflow. Although traditional database integration techniques to create virtual federated databases or warehouses have a major role to play, workflows are seen by many bioinformaticians as the primary interoperation mechanism, not just for annotation but also for the general representation of in silico experiments. The e-Scientist is at the centre of the in silico experimentation process: interacting with the executing processes (for example, altering the parameters of a BLAST alignment analysis); navigating between databases; and interacting with colleagues. Workflows are specified in abstract (a search of a protein database, followed by an alignment analysis, followed by a user-defined filter, followed by another database search); instantiated with concrete services (SWISS-PROT > BLASTp > myFilter > PDB); invoked (SWISS-PROT version 40 at EBI > BLASTp NCBI with default parameters > interactive filtering > PDB SDSC); dynamically interacted with and altered (BLASTp doesn’t generate sensible results, so alter some parameters and rerun that activity before continuing, storing intermediate results in myRepository). Throughout the procedures personal local collections are both used and dynamically created by siphoning off intermediate and final results. The focus is not just on the ends (the result) and but also the means (the experimental process) of acquiring those results. This information is essential in order to promote reuse and to justify the findings later, particularly in understanding the quality and provenance of derived information. These workflows represent experimental practice and know-how, and are valuable knowledge commodities in their own right commodities to be reused, shared, adapted and annotated with interpretations of quality, provenance, security etc. The increased prevalence and complexity of experimental techniques will lead to unnecessary replication of experiments, setting up of equipment in inappropriate ways, or drawing conclusions that are not fully justified by the technique that has been followed. Thus data results arising from processes should be linked to those processes. Personal notes are made on findings; the parameters of tools such as BLAST are customised and tuned. Moreover the information that in silico experiments draw upon is not stable, but rather subject to constant refreshing—what if the human geneticists refine their view of the target site, or find that it was actually irrelevant? We need to be alerted to such information in order to be able to refine, or abandon, our analysis. Thus, it would seem self-evident that provenance is a crucial pre-request to informed understanding of biological information in primary, secondary and personal repositories. Our biologists and bioinformaticans agree. However, current practice for the systematic capture, propagation and processing of provenance metadata is at best sporadic. Commonly, once completed the process (a.k.a. workflow) is lost, or privately held in a lab book or a README file and not propagated. For example, the SWISS-PROT public database keeps reasonable provenance records for internal use but doesn’t publish them. What is provenance? Provenance is an overloaded term (in biology at least). The quotation from WordNet at the beginning of this position statement illustrates some of the many takes on provenance. Is it audit? Or quality? Or evolution? Is it for justification or reproducibility? Is it for sharing know-how? There are many categories; it plays many roles; it applies to many different kinds of information; it is intended for different uses. We can agree that it is metadata. Of course, one scientist’s metadata is another’s data. Thus a serialised workflow instance with all its parameter settings and values is a provenance record for the data arising from it, but also itself needs provenance information (which workflow specification was it instantiated from, who enacted it, was it interactively steered, and if so how). All objects in our experimental (in silico, in vivo, in vitro) world attract provenance: a bench experiment, an instrument, a database, a document, a database entry, an analytical tool, a workflow instance as executed, a service provider, a grid service (e.g. an ontology server, a distributed query processor, a registry), a notification event (when was it sent, who sent it?). Data provenance gives the misleading impression that provenance only applies to database entries of scientific data. 1 http://www.geneontology.org Moreover, an object could be defined at a range of granularities (part of a data record, a whole record, a whole database; an input of a workflow activity, a workflow activity, the whole workflow). In a survey of biologists and bioinformaticians we found that the origins of bioinformatics tools and personal data were important. The recording of in silico experiments, and whether data retrieved could be attributed to a wetlab or in silico experiment from a particular establishment, is essential. The provenance of data held in public bio-repositories was thought to be of lesser importance since most users were aware of the assumptions and limitations associated with it. These musings have an impact on key properties, representation, generation, used and lifetime management. For example, provenance intended for an entirely automated analysis process has different requirements than that intended for a scientist to remind themselves about why they dismissed some results based on their personal opinion of the experimenter. What form does provenance take? Provenance should be metadata that is intended for machine consumption. Free text makes it suitable for humans not computation. There are two major forms: an annotation attached to an object or collection of objects, such as a database entry, in a structured, semi-structured or free text form; a derivation path such as a workflow, a database query, or a program & its parameters. The derivation path could be a copy (copying a part of a SWISS-PROT data entry into a PRINTS record) or an edit, for example evolution of a workflow (substituting a BLAST for PSI-BLAST or altering the parameters of a BLAST activity while a workflow is being enacted). What is the relationship between these two? What does provenance represent? Obviously the seven W’s (Who, What, Where, Why, When, Which, (W)how). Provenance does fundamentally depend upon being able to assert identity, and to attach an identity to an object (which might be a piece of XML in a document). So it also depends on security (is this really the service it says it is?) and trust. The Life Science Identifier (LSID) is a step towards a uniform common reference format for entries in public databases [1]. Where does provenance come from? In a survey we found that “the bioinformatics processes of in silico experiments are generally not recorded by users since it is extremely time-consuming to record the large amounts of metadata and intermediary results with enough detail to make the process repeatable by another user.” Setting aside whether provenance about repeatability, certain forms of provenance might be incidental, gathered by recording a detailed audit trail of process. Others, such as “goals” or “hypothesis”, are (somehow) supplied by the scientist. How much will users be prepared to record? How do we persuade service providers to supply and maintain provenance information? What is provenance used for? There seem to be a number of reasons; here are some: Reliability & quality: do we trust, believe or respect the source or the process that lead to this object? “The problem is: the databases are God-awful … If the data is still fundamentally flawed, then better algorithms add little”. If the origin of flawed information can be identified, what are the legal implications? Justification & audit: an accurate historical record of the source and method of the in silico experiment, equivalent to that found in (wet) lab books and reproducible for the “methods” sections of papers. Reusability, reproducibility & repeatability: a derivation path is not just a record of what has been done, but also a means by which others can repeat and validate the experiment. However, repeating an in silico experiment is not the same as reproducing it. Reproducability is only possible if exactly the same conditions (same database, same version, same content, same tool, same algorithm same indexes) are reproducible, and that is only possible if snapshots are kept. Many biologists, fearing that versions of public databases will disappear, make a practice of copying versions locally. It is not possible, for example, to get hold of SWISS-PROT version 35 unless you made your own copy. Reusing and repeating is using the provenance as know-how, learning from history and disseminating knowledge and best practice (appropriate settings for parameters for example). Change & evolution: Audit trails support change management. Annotations now invalid could be because of changes in the underlying data or contradictory annotations elsewhere. Provenance derivation paths become event notification routes. Ownership, security, credit & copyright: Objects originate from somewhere. As objects migrate so must their provenance. Service providers would like credit for their databases being used (their funding may depend upon it, as might accounting procedures). Other considerations include: Immutability: some metadata is mutable, and some is not. Is provenance intrinsically immutable? Migration & storage: as information travels through various databases, so should its provenance. But how should provenance be stored? Separately from its data, or together with it? Some will be encapsulated (e.g. ownership), others separately (e.g. a workflow linking several data entries, none of which is owned by the workflow owner). Aggregation: as information aggregates in annotations, so does its provenance. How does this work when copying pieces of text, phrases or controlled vocabulary terms in some database annotation? What does it mean to aggregate provenance information? Versioning: as objects version, how does their provenance version? Provenance and the Semantic Web. Provenance is metadata. Much is descriptive annotation. The Semantic Web is fundamentally about the representation of computationally accessible metadata through annotations and controlled vocabularies. We might use RDF as a means of representing provenance, and use graph matching to aggregate provenance data. We could use ontology languages such as DAML+OIL and OWL 2, or even Topic Maps, to represent the controlled vocabularies used in provenance records. Semantic web annotation mechanisms have already been used in some Grid projects, for example in Geodise 3 for annotating logs of CFD optimisation runs [2]. myGrid4 plans to use the COHSE annotation system [3] for associating workflows, data and other XMLbased myGrid objects. myGrid already uses DAML+OIL to annotate myGrid services with domain and service metadata [4]. However, one must be clear what one means by annotation [5]. The questions are: (a) how can semantic web technologies be used for provenance? For example, are RDF query languages up to the challenges? Can we use RDF graphs to aggregate provenance information? What role can reasoning play in provenance annotations? (b) what are the provenance implications that using semantic web technologies bring? For example, when the EMBL-EBI5 publish a new service they might create new concepts in an ontology to describe it, which should have provenance information associated. Final word: KISS6. A small contribution is valuable. There will be a continuum of provenance forms, not one solution. Doing something as simple as automatically representing a workflow instance in XML and making it available through a portal would be helpful. References [1] LSID http://www.i3c.org. [2] Chen L, Cox SJ, Goble C, Keane AJ, Roberts A, Shadbolt NR, Smart P, Tao F Engineering Knowledge for Grid Applications – Geodise Leveraging Knowledge for Grid Computing submitted to EuroWeb 2002 [3] Carr L, Bechhofer S, Goble CA, Hall W Conceptual Linking; Ontology-based Open Hypermedia in WWW10, 10 th World Wide Web Conference, Hong Kong, May 2001. [4] Wroe C, Stevens R, Goble CA, Roberts A, Greenwood M A suite of DAML+OIL Ontologies to Describe Bioinformatics Web Services and Data To appear in International Journal of Cooperative Information Systems. [5] Bechhofer S, Carr L, Goble CA, Kampa S, Miles-Board T. The Semantics of Semantic Annotation. To appear in: ODBASE: First International Conference on Ontologies, DataBases, and Applications of Semantics for Large Scale Information Systems, Irvine, California, October 2002. 3 2 OWL Web Ontology Language 1.0 http://www.w3.org/TR/owl-ref/ http://www.geodise.org 4 http://www.mygrid.org.uk 5 European Molecular Biology Laboratory-European Bioinformatics Institute 6 Keep It Simple, Stupid. ID AC DE OS OC OC OX RN RP RX RA RT RL RN RP RX RA RT RL CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC DR DR DR DR DR DR DR KW PRIO_HUMAN STANDARD; PRT; 253 AA. P04156; MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) (ASCR). Homo sapiens (Human). Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. NCBI_TaxID=9606; [1] SEQUENCE FROM N.A. MEDLINE=86300093 [NCBI, ExPASy, Israel, Japan]; PubMed=3755672; Kretzschmar H.A., Stowring L.E., Westaway D., Stubblebine W.H., Prusiner S.B., Dearmond S.J.; "Molecular cloning of a human prion protein cDNA."; DNA 5:315-324(1986). [4] STRUCTURE BY NMR OF 118-221. MEDLINE=20359708; PubMed=10900000; [NCBI, ExPASy, EBI, Israel, Japan] Calzolai L., Lysek D.A., Guntert P., von Schroetter C., Riek R., Zahn R., Wuethrich K.; "NMR structures of three single-residue variants of the human prion protein."; Proc. Natl. Acad. Sci. U.S.A. 97:8340-8345(2000). -!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THE HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS. -!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED "RODS". -!- SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR. -!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS AND ANIMALS INFECTED WITH NEURODEGENERATIVE DISEASES KNOWN AS TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION DISEASES, LIKE: CREUTZFELDT-JAKOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROME (GSS), FATAL FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; SCRAPIE IN SHEEP AND GOAT; BOVINE SPONGIFORM ENCEPHALOPATHY (BSE) IN CATTLE; TRANSMISSIBLE MINK ENCEPHALOPATHY (TME); CHRONIC WASTING DISEASE (CWD) OF MULE DEER AND ELK; FELINE SPONGIFORM ENCEPHALOPATHY (FSE) IN CATS AND EXOTIC UNGULATE ENCEPHALOPATHY(EUE) IN NYALA AND GREATER KUDU. THE PRION DISEASES ILLUSTRATE THREE MANIFESTATIONS OF CNS DEGENERATION: (1) INFECTIOUS (2) SPORADIC AND (3) DOMINANTLY INHERITED FORMS. TME, CWD, BSE, FSE, EUE ARE ALL THOUGHT TO OCCUR AFTER CONSUMPTION OF PRION-INFECTED FOODSTUFFS. -!- SIMILARITY: BELONGS TO THE PRION FAMILY. HSSP; P04925; 1AG2. [HSSP ENTRY / SWISS-3DIMAGE / PDB] MIM; 176640; -. [NCBI / EBI] InterPro; IPR000817; -. Pfam; PF00377; prion; 1. PRINTS; PR00341; PRION. PROSITE; PS00291; PRION_1; 1. PROSITE; PS00706; PRION_2; 1. Prion; Brain; Glycoprotein; GPI-anchor; Repeat; Signal; Polymorphism; Disease mutation. Exerpt from a typical SWISS-PROT entry. Prion protein signature PROSITE; PS00291 PRION_1; PS00706 PRION_2 BLOCKS; BL00291 PFAM; PF00377 prion INTERPRO; IPR000817 1. STAHL, N. AND PRUSINER, S.B. Prions and prion proteins. FASEB J. 5 2799-2807 (1991). 2. BRUNORI, M., CHIARA SILVESTRINI, M. AND POCCHIARI, M. The scrapie agent and the prion hypothesis. TRENDS BIOCHEM.SCI. 13 309-313 (1988). 3. PRUSINER, S.B. Scrapie prions. ANNU.REV.MICROBIOL. 43 345-374 (1989). Prion protein (PrP) is a small glycoprotein found in high quantity in the brain of animals infected with certain degenerative neurological diseases, such as sheep scrapie and bovine spongiform encephalopathy (BSE), and the human dementias Creutzfeldt-Jacob disease (CJD) and Gerstmann-Straussler syndrome (GSS). PrP is encoded in the host genome and is expressed both in normal and infected cells. During infec tion, however, the PrP molecules become altered and polymerise, yielding fibrils of modified PrP protein. PrP molecules have been found on the outer surface of plasma membranes of nerve cells, to which they are anchored through a covalent-linked glycolipid, suggesting a role as a membrane receptor. PrP is also expressed in other tissues, indicating that it may have different functions depending on its location. The primary sequences of PrP's from different sources are highly similar: all bear an N -terminal domain containing multiple tandem repeats of a Pro/Gly rich octapeptide; sites of Asn-linked glycosylation; an essential disulphide bond; and 3 hydrophobic segments. These sequences show some similarity to a chicken glycoprotein, thought to be an acetylcholine receptor-inducing activity (ARIA) molecule. It has been suggested that changes in the octapeptide repeat region may indicate a predisposition to disease, but it is not known for certain whether the repeat can meaningfully be used as a fingerprint t o indicate susceptibility. Excerpt from a distilled PRINTS annotation