Knowledge and Provenance: A knowledge model perspective Carole Goble, University of Manchester, UK Talk roadmap What is this provenance about and for? Knowledge for Provenance Knowledge technologies How do we represent knowledge for and about provenance? The Provenance of Knowledge Where do knowledge assertions come from? my Context Knowledge-driven Middleware for data intensive in silico experiments in biology http://www.mygrid.org.uk A real bio provenance log Any and every experimental item attracts provenance (so long as you can ID it). • • • Experimental design components – workflow specifications; query specifications; notes describing objectives; applications; databases; relevant papers; the web pages of important workers, services Experimental instances that are records of enacted experiments – data results; a history of services invoked by a workflow engine; instances of services invoked; parameters set for an application; notes commenting on the results Experimental glue that groups and links design and instance components – a query and its results; a workflow linked with its outcome; links between a workflow and its previous and subsequent versions; a group of all these things linked to a document discussing the conclusions of the biologist Provenance is metadata … • intended for sharing, retrieving, integrating, aggregating and processing. • generated with the hope that it is comprehensive enough to be future-proofed. • recorded for those who we do not yet know will use the object and who will likely use it in a different way. • machine computational: free text of limited help. • Provenance is the knowledge that makes – An item interpretable and reusable within a context – An item reproducible or at least repeatable. • Its part of the information model of any system Question: mouse? What ATPase superfamily proteins are found in 1. Q9CQV8 O70468 143B_MOUSE from Swiss-Prot version Database query 30, 05/11/02, 16:45 GMT, EBI server. (know-what) 2. O70455, P54775 143B_MOUSE from Swiss-Prot version 29, 05/11/02 16:45 GMT, local copy. 3. P43686 and P54775 derived by a distributed query over Virtual data products DB1 and DB2. (know-how) 4. InterPro (no particular version) is a pattern database for protein superfamilies and domains for GPCR’s but you need Workflow an account. (know-how) 5. The publicly available workflow mouse ATPase (http://www.somelab.edu/bio/carole/wf/3345.wsfl) will generate the result from data in your personal repository and Personalised profile you have permission to run the services it needs. Click to run (know-whom-to) it. 6. The Attwood lab expertise is in nucleotide binding proteins Collaboration & (ATPase superfamily proteins are nucleotide binding community proteins). (know-where, 7. Jones published a new paper on this in Nature Genetics know-when) two weeks ago, and you have an account to access it on-line. 8. Smith in your lab asked this question yesterday and the answer he got is annotated by a commentary in his e-Log Digital archive Book. (know-which) 9. P43686 (human) calculated by applying the algorithm ABC located at NCBI using data in database AAA Provenance (know-wherefrom) Replicas (know-which) Ontology and Inference (know-whether) Authorisation, Authentication and Accounting (know-who) Explanation (know-why) Annotation & notes (know-that) Provenance is contextual metadata • We look at the same things in different ways and different things in the same way • Our data alone does not describe our work • We have to capture this context. Hero http://hero.geog.psu.edu/ Hero_knowledge_management.pdf Downloaded 301103 Provenance forms mass = 200 decay = bb • Derivations – A path like a workflow, script or query. – Linking items, usually in a directed graph. – An explanation of when, who, how something produced. – Execution Process-centric • Annotations – Attached to items or collections of items, in a structured, semi-structured or free text form. – Annotations on one item or linking items. – An explanation of why, when, where, who, what, how. – Data-centric mass = 200 decay = ZZ mass = 200 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000 mass = 200 decay = WW stability = 1 mass = 200 event = 8 mass = 200 decay = WW event = 8 mass = 200 decay = WW plot = 1 mass = 200 plot = 1 mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 Workflows as in silico experiments • Freefluo workflow enactment engine – WSFL – Scufl • • • • Semantic Workflow discovery – Finding workflows that others have done, and that I have done myself Semantic service discovery – Finding classes of services – Guiding service composition – (We don’t do automated composition) Dynamic workflow enactment service discovery and invocation – Choose services instances when running workflow User involvement Semantic discovery – services & workflows • A registry browser • • • Services and workflows in registry have RDF and OWL descriptions Selection by the types of inputs they use, outputs they produce, the bioinformatics tasks they perform… Querying using RDQL over RDF UDDI registry for operational metadata Matching using FaCT OWL classification for conceptbased metadata A workflow wizard Provenance forms in myGrid • Derivations – FreeFluo Workflow Enactment Engine provides a detailed provenance record stored in the myGrid Information Repository (mIR) describing what was done, with what services and when – XML document, soon to be an RDF model • Annotations – Every mIR object has Dublin Core provenance properties described in an attribute value model Provenance of data • Operational execution trail Gene:AC005412.6 SNP:000010197 input run_for urn: Clare Jennings output process start time end time by_service lsid:HGVBase_retrieve Provenance of knowledge • Declarative semantic execution trail contains_single_nucleotide_polymorphism Gene:AC005412.6 input as stated by run_for urn: Claire Jennings SNP:000010197 output process start time end time by_service lsid:HGVBase_retrieve Provenance of knowledge urn: Carole Goble • Trust and attribution disputed by contains_single_nucleotide_polymorphism Gene:AC005412.6 input as stated by run_for urn: Claire Jennings SNP:000010197 output process start time end time by_service lsid:HGVBase_retrieve Provenance of knowledge • Aggregation and integration run_for urn: Bill Jones process start time end time by_service lsid:BIGDbretrieve as stated by contains_single_nucleotide_polymorphism Gene:AC005412.6 input as stated by run_for urn: Claire Jennings SNP:000010197 output process start time end time by_service lsid:HGVBase_retrieve 20,000 feet and ground level Top Down provenance – What is going on? – Unification and summaries of collective provenance knowledge. – Collaborative, Awareness, Experience base, Scientific Corporate memory. – “What projects have something to do with human SNPs?” – “What experiments use the PSI-BLAST service regardless of version?” Bottom Up provenance – Where did this data object http://doh.dah.ac.uk/… come from? – Which version of SwissProt was run in workflow http:/blah.ac.uk/…? User Trust Domain Experiment Execution Data Services Workflow Build up layers of provenance knowledge Provenance for People and Machines Subjective People Experiment User Manual/ semi-automated Trust Services Domain Objective Data Contextual Execution Workflow Context-free Machines Automated 1. Explicitly capture Context Reuse methods and strategies (e.g., protocols) Make explicit the situational bias that is normally implicit Enable future generations of scientists to follow our work To capture meaning, we must devise a way of representing concepts and their relationships Hero http://hero.geog.psu.edu/ Hero_knowledge_management.pdf Downloaded 301103 1. Explicitly capture Context Using models and terms that can be shared and interpreted that are extensible and preclude premature restrictions that are navigable and computationally processable Hero http://hero.geog.psu.edu/ Hero_knowledge_management.pdf Downloaded 301103 2. Bridge islands of exported provenance Service 1 Workflow 1 Experimental Investigation 1 Service 2 Data 1 Not all exports are the same Service 1 Workflow 1 Experimental Investigation 1 Service 2 Data 1 So we need to… • • • • • • • Uniquely identify items through URIs and Life Science Identifiers (GSH/GSR/Handle.net…) Explicitly expose provenance by assertions in a common data model… Publish and share consensually agreed ontologies so we can share the provenance metadata and add in background knowledge… Then we can query, filter, integrate and aggregate the provenance metadata … and reason over it to infer more provenance metadata using rules … and attribute trust to the provenance … Flexibly so that do not cast in stone models and terms, and so can cope with different degrees of description. What’s an Ontology? A common vocabulary of terms Some specification of the meaning of the terms Concepts, relationships, axioms A shared consensual understanding for people and machines W3C Metadata language/model Resource Description Framework • • • • • • • Common model for metadata Assertions as triples (subject, predicate, object) forming graphs. Associate URIs (LSIDs) with other URIs (LSIDs). Associate URIs with OWL concepts (which are URIs). RDQL, repositories, integration tools, presentation tools Query over, Link together, Aggregate, Integrate assertions. Avoids pre-commitment – – – – Data Workflow Experiment User Service Self-describing Incremental Extensible Advantage and drawback. Graphic based on Tim Berners-Lee http://www.w3.org/2003/Talks/0521-www-keynote-tbl/slide22-0.html Bridging islands Service 1 Workflow 1 Experimental Investigation 1 Service 2 Data 1 Bridging islands: Concepts and LSID Service 1 Service 2 Workflow 1 RDF RDF RDF RDF RDF Experimental Investigation 1 RDF Data 1 W3C Ontology language/model: OWL • • • • • Continuum of expressivity – Concepts, roles, individuals, axioms – From simple frames to description logics – Sound and complete formal semantics – Compositional and property based Reasoning to infer classification Eas(ier) to extend and evolve and merge ontologies A web language Tools, tools, tools! DAML OIL RDF DAML+OIL OWL Bridging islands: Concepts and LSIDs Service 1 Service 2 Workflow 1 RDF RDF RDF RDF RDF Experimental Investigation 1 RDF Data 1 Bridging islands: Concepts and LSIDs LSID LSID Service 1 LSID Workflow 1 Service 2 RDF LSID LSID RDF RDF RDF LSID LSID RDF LSID RDF LSID Experimental Investigation 1 LSID Data 1 LSID Layers of Knowledge Languages Attribution Explanation Rules & Inference Ontologies Metadata Standard Syntax Identity Wedding cake courtesy of Tim Berners-Lee myGrid everything has a concept & LSID Workflows Literature Provenance record of workflow runs Notes Ontologies People Data holdings Services Linking objects to objects via URIs and LSIDs People who wrote the workflow Literature People to notify of the workflow status Provenance record of workflow runs Provenance of the workflow template. Related workflows. Notes Data holdings Ontologies describing workflows Services used Generated link anchors Lymphocyte and neutrophil are subsumed by the concept white blood cell Annotating a workflow log with concepts 5. Create the annotation 4. Provide a description 3. Select the concept 1. Choose the ontology 2. Select an area to annotate with Generating provenance Data and metadata from the run RDF+OWL Scufl Workflow execution Template startTime, endTime, service instances invoked … RDF+OWL Identify workflow mIR Input data & parameters OWL descriptions RDF Bind services FreeFluo WFEE Execution Provenance log Workflow knowledge template Knowledge Provenance log registry Knowledge arising from workflow RDF+OWL P Afflard et al The Grid(s)? @ Novartis presented at PRISM PharmaGrid retreat, July 2003 William Pike, Ola Ahlqvist, Mark Gahegan, Sachin Oswal Supporting Collaborative Science through a Knowledge and Data Management Portal in 1st Semantic Web Conference (ISWC2003) Workshop on Retrieval of Scientific Data, Florida, USA, October 2003 Two views of a gravity model concept from the Hero CODEX web tool William Pike, Ola Ahlqvist, Mark Gahegan, Sachin Oswal Supporting Collaborative Science through a Knowledge and Data Management Portal in 1st Semantic Web Conference (ISWC2003) Workshop on Retrieval of Scientific Data, Florida, USA, October 2003 • An ontological description shows how one geoscientist constructs a model • a social network reveals which users favour different instances of the model, with edge length suggesting the degree of support. Collaboratory for Multi-Scale Chemical Science CMCS “Pedigree Graph” portlet showing provenance relationships between resources (colour coded by original relationship type). CMCS Pedigree Browser showing the metadata and relationships of the selected data set. Provenance dimensions connected by concepts and identifiers project Services pr oj ec t Author Workflow instances workflow template Based on http://www.w3.org/2003/Talks/0521-www-keynote-tbl/slide22-0.html Reflections: annotations • Annotation metadata model for myGrid holdings are a Graph – If it waddles like RDF and quacks like RDF, its RDF – Experiments in RDF scalability – Co-existence of RDF and other data models (relational) • Acquisition of annotations and adverts – Automated by mining WSDL docs, mining ws-info docs – Deep annotation works ok for bioinformatic service concepts (it’s an EMBL record) but… – Annotating with biologically meaningful concepts is harder • Data in the mIR (it’s a lymphocyte) • Manual annotation cost is high! – Service/workflow publication tools • Dealing with change – Ontology changes; service changes; annotations change. Random Thoughts • • • • • • Where does the knowledge come from (see Luc)? How do we model trust (see Luc)? Scalability of Semantic Web technologies? Visualisation of knowledge (see monica)? What’s the lifecycle of provenance? Different knowledge models for different disciplines? knowledge • • • • • • Layers of provenance Provenance that is domain knowledge Provenance for context vs execution workflow provenance People vs machine Different models for different items but still needs to be integrated Technologies for sharing and integrating that are flexible. Talk provenance • myGrid http://www.mygrid.org.uk – Jun Zhao, Mark Greenwood, Chris Wroe, Phil Lord, Chris Greenhalgh, Luc Moreau, Robert Stevens • Hero http://hero.geog.psu.edu/ – William Pike, Ola Ahlqvist, Mark Gahegan, Sachin Oswal • Collaboratory for Multi-Scale Chemical Science CMSC – James D. Myers, Carmen Pancerella, Carina Lansing, Karen L. Schuchardt, Brett Didier • Chimera – Michael Wilde, Ian Foster • Knowledge Space – Novartis • And special thanks to Ian Cottam for heroic support when my laptop died yesterday. Afternoon.