Carole Goble Position Statement: Musings on Provenance, Workflow and (Semantic Web) Annotations for Bioinformatics

advertisement
Position Statement: Musings on Provenance, Workflow and (Semantic Web)
Annotations for Bioinformatics
Prof. Carole Goble, University of Manchester, UK. carole@cs.man.ac.uk
Provenance: birthplace, cradle, place of origin
Lineage: line, line of descent, descent, bloodline, pedigree, ancestry, parentage, stock, filiation, derivation
(the descendants of an individual, the kinship relation between an individual and their progenitors,
inherited properties shared with others of your bloodline)
History: the aggregate of past events, the continuum of events occurring in succession, account, chronicle,
story, a body of knowledge, all that is remembered of the past preserved in writing.
Derive: deduce, infer, deduct, obtain, come from, educe, develop or evolve, descend.
[Source: WordNet Online, Princeton web site, 29th September 2002, 13:30 BST, edited by Carole Goble on 7 th October 2002]

Is Carole Goble attending this workshop? Web page http://www-fp.mcs.anl.gov/~foster/provenance/
has Carol Goble and after 24/09/02 Carole Goble. Are these the same? Who made the change and
why?

When is the position statement due? Email message from Foster on 10/09/02 says 07/10/02. Email
message from Foster on 24/09/02 says 03/10/02; Web page says 03/10/02;
 How long should the position statement be? Email 10/09/02 says 1-5 pages, web page says 1-2 pages.
How many copies of the uncorrected or different data are there? Did changes get propagated? What was the
situation on 05/09/02? What do you believe? I’m working off 10/09/02 data – that’s what was copied to
my calender! 
Provenance in Bioinformatics
The biological science community is highly fragmented. Different disciplines act autonomously, producing
data repositories and analytical tools that operate over them in isolation. Rather than a few international
facilities producing vast amounts of data that needs to be accessible, biology copes with a very large
number of sites (potentially 1000's of individual laboratories) around the world each using cheap,
commodity technology to continuously generate substantial quantities of different kinds of data. Currently
500 public data repositories are in active service. Thus most biological knowledge resides in a large number
of heterogeneous and distributed resources; the data presents serious analytical and linkage challenges, and
to answer even the simplest question requires the intelligent interlinking of multiple repositories, which
must be up-to-date and reliable.
In many resources, each record is analogous to an individual publication with not only raw data, but also
additional annotation supplied by a small number of human experts (curators) or automated systems and
published as biological literature (increasingly in electronic form). The annotations are the accumulated
knowledge attributed to a sequence, structure, protein, etc. Annotations are typically semi-structured texts
that make some use of keywords and controlled vocabularies, under the premise that a scientist will read
and interpret the texts. A distinction is made between primary and secondary databases: primary databases
hold the raw data, such as sequence (GenBank, EMBL); secondary databases are those holding added-value
results accumulating and editing data from others and holding new data arising from analysis. For example,
SWISS-PROT and PRINTS are secondary databases. Databases are published from the production database
at regular intervals, ranging from daily to quarterly. Users can access the databases online for the most upto-date information or download versions locally. Most sites download for provenance reasons.
SWISS-PROT draws on TrEMBL (which draws on EMBL) and holds distilled information about proteins;
PRINTS draws on SWISS-PROT to collect custom-built GPCR ‘fingerprints’. PRINTS annotations
aggregates culled from SWISS-PROT annotations, Medline abstracts, OMIM, GRAP and PDB entries, etc.
GPCR draws on PRINTS, and so on. Secondary databases not only hold new, derived data but also
accumulates copies (possibly corrected, often editorialised, sometimes transformed) of information from
the primary and secondary databases they draw from. This viral propagation of information makes tracking
sources of data crucial for forming a view on its quality, validity, and freshness. If our curator updates the
fingerprint some time later and gets different results, she will need to be told that a change has occurred and
what that change is. If she does not trust the result, she will want to check on the raw data to ascertain the
root of the problem. Unfortunately, audit trails are rarely kept and hardly ever published with the
information. Two examples of annotation are given at the end of this paper.
I use the term information and not data – data often carries the implicit connotation that it is numeric
whereas it is often descriptive (the function of a gene product, for example, is described through the
controlled vocabulary Gene Ontology1).
The annotation pipeline uses both automated and human means of analysis and integration organised into a
workflow. Although traditional database integration techniques to create virtual federated databases or
warehouses have a major role to play, workflows are seen by many bioinformaticians as the primary
interoperation mechanism, not just for annotation but also for the general representation of in silico
experiments. The e-Scientist is at the centre of the in silico experimentation process: interacting with the
executing processes (for example, altering the parameters of a BLAST alignment analysis); navigating
between databases; and interacting with colleagues. Workflows are specified in abstract (a search of a
protein database, followed by an alignment analysis, followed by a user-defined filter, followed by another
database search); instantiated with concrete services (SWISS-PROT > BLASTp > myFilter > PDB);
invoked (SWISS-PROT version 40 at EBI > BLASTp NCBI with default parameters > interactive filtering
> PDB SDSC); dynamically interacted with and altered (BLASTp doesn’t generate sensible results, so alter
some parameters and rerun that activity before continuing, storing intermediate results in myRepository).
Throughout the procedures personal local collections are both used and dynamically created by siphoning
off intermediate and final results. The focus is not just on the ends (the result) and but also the means (the
experimental process) of acquiring those results. This information is essential in order to promote reuse and
to justify the findings later, particularly in understanding the quality and provenance of derived
information. These workflows represent experimental practice and know-how, and are valuable knowledge
commodities in their own right commodities to be reused, shared, adapted and annotated with
interpretations of quality, provenance, security etc.
The increased prevalence and complexity of experimental techniques will lead to unnecessary replication of
experiments, setting up of equipment in inappropriate ways, or drawing conclusions that are not fully
justified by the technique that has been followed. Thus data results arising from processes should be linked
to those processes.
Personal notes are made on findings; the parameters of tools such as BLAST are customised and tuned.
Moreover the information that in silico experiments draw upon is not stable, but rather subject to constant
refreshing—what if the human geneticists refine their view of the target site, or find that it was actually
irrelevant? We need to be alerted to such information in order to be able to refine, or abandon, our analysis.
Thus, it would seem self-evident that provenance is a crucial pre-request to informed understanding of
biological information in primary, secondary and personal repositories. Our biologists and bioinformaticans
agree. However, current practice for the systematic capture, propagation and processing of provenance
metadata is at best sporadic. Commonly, once completed the process (a.k.a. workflow) is lost, or privately
held in a lab book or a README file and not propagated. For example, the SWISS-PROT public database
keeps reasonable provenance records for internal use but doesn’t publish them.
What is provenance?
Provenance is an overloaded term (in biology at least). The quotation from WordNet at the beginning of
this position statement illustrates some of the many takes on provenance. Is it audit? Or quality? Or
evolution? Is it for justification or reproducibility? Is it for sharing know-how? There are many categories;
it plays many roles; it applies to many different kinds of information; it is intended for different uses. We
can agree that it is metadata. Of course, one scientist’s metadata is another’s data. Thus a serialised
workflow instance with all its parameter settings and values is a provenance record for the data arising from
it, but also itself needs provenance information (which workflow specification was it instantiated from,
who enacted it, was it interactively steered, and if so how). All objects in our experimental (in silico, in
vivo, in vitro) world attract provenance: a bench experiment, an instrument, a database, a document, a
database entry, an analytical tool, a workflow instance as executed, a service provider, a grid service (e.g.
an ontology server, a distributed query processor, a registry), a notification event (when was it sent, who
sent it?). Data provenance gives the misleading impression that provenance only applies to database entries
of scientific data.
1
http://www.geneontology.org
Moreover, an object could be defined at a range of granularities (part of a data record, a whole record, a
whole database; an input of a workflow activity, a workflow activity, the whole workflow). In a survey of
biologists and bioinformaticians we found that the origins of bioinformatics tools and personal data were
important. The recording of in silico experiments, and whether data retrieved could be attributed to a wetlab or in silico experiment from a particular establishment, is essential. The provenance of data held in
public bio-repositories was thought to be of lesser importance since most users were aware of the
assumptions and limitations associated with it.
These musings have an impact on key properties, representation, generation, used and lifetime
management. For example, provenance intended for an entirely automated analysis process has different
requirements than that intended for a scientist to remind themselves about why they dismissed some results
based on their personal opinion of the experimenter.
What form does provenance take?
Provenance should be metadata that is intended for machine consumption. Free text makes it suitable for
humans not computation. There are two major forms:


an annotation attached to an object or collection of objects, such as a database entry, in a structured,
semi-structured or free text form;
a derivation path such as a workflow, a database query, or a program & its parameters. The derivation
path could be a copy (copying a part of a SWISS-PROT data entry into a PRINTS record) or an edit,
for example evolution of a workflow (substituting a BLAST for PSI-BLAST or altering the parameters
of a BLAST activity while a workflow is being enacted).
What is the relationship between these two?
What does provenance represent?
Obviously the seven W’s (Who, What, Where, Why, When, Which, (W)how). Provenance does
fundamentally depend upon being able to assert identity, and to attach an identity to an object (which might
be a piece of XML in a document). So it also depends on security (is this really the service it says it is?)
and trust. The Life Science Identifier (LSID) is a step towards a uniform common reference format for
entries in public databases [1].
Where does provenance come from?
In a survey we found that “the bioinformatics processes of in silico experiments are generally not recorded
by users since it is extremely time-consuming to record the large amounts of metadata and intermediary
results with enough detail to make the process repeatable by another user.” Setting aside whether
provenance about repeatability, certain forms of provenance might be incidental, gathered by recording a
detailed audit trail of process. Others, such as “goals” or “hypothesis”, are (somehow) supplied by the
scientist. How much will users be prepared to record? How do we persuade service providers to supply and
maintain provenance information?
What is provenance used for?
There seem to be a number of reasons; here are some:
Reliability & quality: do we trust, believe or respect the source or the process that lead to this object? “The
problem is: the databases are God-awful … If the data is still fundamentally flawed, then better algorithms
add little”. If the origin of flawed information can be identified, what are the legal implications?
Justification & audit: an accurate historical record of the source and method of the in silico experiment,
equivalent to that found in (wet) lab books and reproducible for the “methods” sections of papers.
Reusability, reproducibility & repeatability: a derivation path is not just a record of what has been done,
but also a means by which others can repeat and validate the experiment. However, repeating an in silico
experiment is not the same as reproducing it. Reproducability is only possible if exactly the same
conditions (same database, same version, same content, same tool, same algorithm same indexes) are
reproducible, and that is only possible if snapshots are kept. Many biologists, fearing that versions of public
databases will disappear, make a practice of copying versions locally. It is not possible, for example, to get
hold of SWISS-PROT version 35 unless you made your own copy. Reusing and repeating is using the
provenance as know-how, learning from history and disseminating knowledge and best practice
(appropriate settings for parameters for example).
Change & evolution: Audit trails support change management. Annotations now invalid could be because
of changes in the underlying data or contradictory annotations elsewhere. Provenance derivation paths
become event notification routes.
Ownership, security, credit & copyright: Objects originate from somewhere. As objects migrate so must
their provenance. Service providers would like credit for their databases being used (their funding may
depend upon it, as might accounting procedures).
Other considerations include:
Immutability: some metadata is mutable, and some is not. Is provenance intrinsically immutable?
Migration & storage: as information travels through various databases, so should its provenance. But how
should provenance be stored? Separately from its data, or together with it? Some will be encapsulated (e.g.
ownership), others separately (e.g. a workflow linking several data entries, none of which is owned by the
workflow owner).
Aggregation: as information aggregates in annotations, so does its provenance. How does this work when
copying pieces of text, phrases or controlled vocabulary terms in some database annotation? What does it
mean to aggregate provenance information?
Versioning: as objects version, how does their provenance version?
Provenance and the Semantic Web.
Provenance is metadata. Much is descriptive annotation. The Semantic Web is fundamentally about the
representation of computationally accessible metadata through annotations and controlled vocabularies. We
might use RDF as a means of representing provenance, and use graph matching to aggregate provenance
data. We could use ontology languages such as DAML+OIL and OWL 2, or even Topic Maps, to represent
the controlled vocabularies used in provenance records. Semantic web annotation mechanisms have already
been used in some Grid projects, for example in Geodise 3 for annotating logs of CFD optimisation runs [2].
myGrid4 plans to use the COHSE annotation system [3] for associating workflows, data and other XMLbased myGrid objects. myGrid already uses DAML+OIL to annotate myGrid services with domain and
service metadata [4]. However, one must be clear what one means by annotation [5]. The questions are:
(a) how can semantic web technologies be used for provenance? For example, are RDF query languages
up to the challenges? Can we use RDF graphs to aggregate provenance information? What role can
reasoning play in provenance annotations?
(b) what are the provenance implications that using semantic web technologies bring? For example, when
the EMBL-EBI5 publish a new service they might create new concepts in an ontology to describe it,
which should have provenance information associated.
Final word: KISS6. A small contribution is valuable. There will be a continuum of provenance forms, not
one solution. Doing something as simple as automatically representing a workflow instance in XML and
making it available through a portal would be helpful.
References
[1] LSID http://www.i3c.org.
[2] Chen L, Cox SJ, Goble C, Keane AJ, Roberts A, Shadbolt NR, Smart P, Tao F Engineering Knowledge for Grid Applications –
Geodise Leveraging Knowledge for Grid Computing submitted to EuroWeb 2002
[3] Carr L, Bechhofer S, Goble CA, Hall W Conceptual Linking; Ontology-based Open Hypermedia in WWW10, 10 th World Wide
Web Conference, Hong Kong, May 2001.
[4] Wroe C, Stevens R, Goble CA, Roberts A, Greenwood M A suite of DAML+OIL Ontologies to Describe Bioinformatics Web
Services and Data To appear in International Journal of Cooperative Information Systems.
[5] Bechhofer S, Carr L, Goble CA, Kampa S, Miles-Board T. The Semantics of Semantic Annotation. To appear in: ODBASE: First
International Conference on Ontologies, DataBases, and Applications of Semantics for Large Scale Information Systems, Irvine,
California, October 2002.

3
2
OWL Web Ontology Language 1.0 http://www.w3.org/TR/owl-ref/
http://www.geodise.org
4
http://www.mygrid.org.uk
5
European Molecular Biology Laboratory-European Bioinformatics Institute
6
Keep It Simple, Stupid.
ID
AC
DE
OS
OC
OC
OX
RN
RP
RX
RA
RT
RL
RN
RP
RX
RA
RT
RL
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
DR
DR
DR
DR
DR
DR
DR
KW
PRIO_HUMAN
STANDARD;
PRT;
253 AA.
P04156;
MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) (ASCR).
Homo sapiens (Human).
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
NCBI_TaxID=9606;
[1]
SEQUENCE FROM N.A.
MEDLINE=86300093 [NCBI, ExPASy, Israel, Japan]; PubMed=3755672;
Kretzschmar H.A., Stowring L.E., Westaway D., Stubblebine W.H., Prusiner S.B., Dearmond S.J.;
"Molecular cloning of a human prion protein cDNA.";
DNA 5:315-324(1986).
[4]
STRUCTURE BY NMR OF 118-221.
MEDLINE=20359708; PubMed=10900000; [NCBI, ExPASy, EBI, Israel, Japan]
Calzolai L., Lysek D.A., Guntert P., von Schroetter C., Riek R., Zahn R., Wuethrich K.;
"NMR structures of three single-residue variants of the human prion protein.";
Proc. Natl. Acad. Sci. U.S.A. 97:8340-8345(2000).
-!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THE HOST GENOME AND IS
EXPRESSED BOTH IN NORMAL AND INFECTED CELLS.
-!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED "RODS".
-!- SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR.
-!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS AND ANIMALS INFECTED WITH
NEURODEGENERATIVE DISEASES KNOWN AS TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION
DISEASES, LIKE: CREUTZFELDT-JAKOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROME (GSS),
FATAL FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; SCRAPIE IN SHEEP AND GOAT; BOVINE
SPONGIFORM ENCEPHALOPATHY (BSE) IN CATTLE; TRANSMISSIBLE MINK ENCEPHALOPATHY (TME);
CHRONIC WASTING DISEASE (CWD) OF MULE DEER AND ELK; FELINE SPONGIFORM ENCEPHALOPATHY
(FSE) IN CATS AND EXOTIC UNGULATE ENCEPHALOPATHY(EUE) IN NYALA AND GREATER KUDU. THE
PRION DISEASES ILLUSTRATE THREE MANIFESTATIONS OF CNS DEGENERATION: (1) INFECTIOUS (2)
SPORADIC AND (3) DOMINANTLY INHERITED FORMS. TME, CWD, BSE, FSE, EUE ARE ALL THOUGHT TO
OCCUR AFTER CONSUMPTION OF PRION-INFECTED FOODSTUFFS.
-!- SIMILARITY: BELONGS TO THE PRION FAMILY.
HSSP; P04925; 1AG2. [HSSP ENTRY / SWISS-3DIMAGE / PDB]
MIM; 176640; -. [NCBI / EBI]
InterPro; IPR000817; -.
Pfam; PF00377; prion; 1.
PRINTS; PR00341; PRION.
PROSITE; PS00291; PRION_1; 1.
PROSITE; PS00706; PRION_2; 1.
Prion; Brain; Glycoprotein; GPI-anchor; Repeat; Signal; Polymorphism; Disease mutation.
Exerpt from a typical SWISS-PROT entry.
Prion protein signature
PROSITE; PS00291 PRION_1; PS00706 PRION_2
BLOCKS; BL00291
PFAM; PF00377 prion
INTERPRO; IPR000817
1. STAHL, N. AND PRUSINER, S.B.
Prions and prion proteins.
FASEB J. 5 2799-2807 (1991).
2. BRUNORI, M., CHIARA SILVESTRINI, M. AND POCCHIARI, M.
The scrapie agent and the prion hypothesis.
TRENDS BIOCHEM.SCI. 13 309-313 (1988).
3. PRUSINER, S.B.
Scrapie prions.
ANNU.REV.MICROBIOL. 43 345-374 (1989).
Prion protein (PrP) is a small glycoprotein found in high quantity in the brain of animals infected with
certain degenerative neurological diseases, such as sheep scrapie and bovine spongiform encephalopathy
(BSE), and the human dementias Creutzfeldt-Jacob disease (CJD) and Gerstmann-Straussler syndrome (GSS). PrP
is encoded in the host genome and is expressed both in normal and infected cells. During infec tion,
however, the PrP molecules become altered and polymerise, yielding fibrils of modified PrP protein.
PrP molecules have been found on the outer surface of plasma membranes of nerve cells, to which they are
anchored through a covalent-linked glycolipid, suggesting a role as a membrane receptor. PrP is also
expressed in other tissues, indicating that it may have different functions depending on its location.
The primary sequences of PrP's from different sources are highly similar: all bear an N -terminal domain
containing multiple tandem repeats of a Pro/Gly rich octapeptide; sites of Asn-linked glycosylation; an
essential disulphide bond; and 3 hydrophobic segments. These sequences show some similarity to a chicken
glycoprotein, thought to be an acetylcholine receptor-inducing activity (ARIA) molecule. It has been
suggested that changes in the octapeptide repeat region may indicate a predisposition to disease, but it is
not known for certain whether the repeat can meaningfully be used as a fingerprint t o indicate
susceptibility.
Excerpt from a distilled PRINTS annotation
Download