Knowledge and Provenance : A knowledge model perspective Carole Goble,

advertisement
Knowledge and Provenance:
A knowledge model perspective
Carole Goble,
University of Manchester, UK
Talk roadmap
What is this provenance
about and for?
Knowledge for
Provenance
Knowledge
technologies
How do we represent
knowledge for and about
provenance?
The Provenance
of Knowledge
Where do knowledge
assertions come from?
my Context
Knowledge-driven
Middleware for data
intensive in silico
experiments in biology
http://www.mygrid.org.uk
A real bio provenance log
Any and every experimental item attracts
provenance (so long as you can ID it).
•
•
•
Experimental design components
– workflow specifications; query
specifications; notes describing
objectives; applications; databases;
relevant papers; the web pages of
important workers, services
Experimental instances that are
records of enacted experiments
– data results; a history of services
invoked by a workflow engine;
instances of services invoked;
parameters set for an application;
notes commenting on the results
Experimental glue that groups and
links design and instance
components
– a query and its results; a workflow
linked with its outcome; links
between a workflow and its previous
and subsequent versions; a group of
all these things linked to a document
discussing the conclusions of the
biologist
Provenance is metadata …
• intended for sharing, retrieving, integrating,
aggregating and processing.
• generated with the hope that it is comprehensive
enough to be future-proofed.
• recorded for those who we do not yet know will use
the object and who will likely use it in a different way.
• machine computational: free text of limited help.
• Provenance is the knowledge that makes
– An item interpretable and reusable within a context
– An item reproducible or at least repeatable.
• Its part of the information model of any system
Question:
mouse?
What ATPase superfamily proteins are found in
1. Q9CQV8 O70468 143B_MOUSE from Swiss-Prot version
Database query
30, 05/11/02, 16:45 GMT, EBI server.
(know-what)
2. O70455, P54775 143B_MOUSE from Swiss-Prot version 29,
05/11/02 16:45 GMT, local copy.
3. P43686 and P54775 derived by a distributed query over
Virtual data products DB1 and DB2.
(know-how)
4. InterPro (no particular version) is a pattern database for
protein superfamilies and domains for GPCR’s but you need
Workflow
an account.
(know-how)
5. The publicly available workflow mouse ATPase
(http://www.somelab.edu/bio/carole/wf/3345.wsfl) will
generate the result from data in your personal repository and
Personalised profile
you have permission to run the services it needs. Click to run
(know-whom-to)
it.
6. The Attwood lab expertise is in nucleotide binding proteins
Collaboration &
(ATPase superfamily proteins are nucleotide binding
community
proteins).
(know-where,
7. Jones published a new paper on this in Nature Genetics
know-when)
two weeks ago, and you have an account to access it on-line.
8. Smith in your lab asked this question yesterday and the
answer he got is annotated by a commentary in his e-Log
Digital archive
Book.
(know-which)
9. P43686 (human) calculated by applying the algorithm ABC
located at NCBI using data in database AAA
Provenance
(know-wherefrom)
Replicas
(know-which)
Ontology and
Inference
(know-whether)
Authorisation,
Authentication
and Accounting
(know-who)
Explanation
(know-why)
Annotation & notes
(know-that)
Provenance is contextual metadata
• We look at the
same things in
different ways
and different
things in the
same way
• Our data alone
does not describe
our work
• We have to
capture this
context.
Hero http://hero.geog.psu.edu/
Hero_knowledge_management.pdf
Downloaded 301103
Provenance forms
mass = 200
decay = bb
• Derivations
– A path like a workflow, script
or query.
– Linking items, usually in a
directed graph.
– An explanation of when, who,
how something produced.
– Execution Process-centric
• Annotations
– Attached to items or
collections of items, in a
structured, semi-structured or
free text form.
– Annotations on one item or
linking items.
– An explanation of why, when,
where, who, what, how.
– Data-centric
mass = 200
decay = ZZ
mass = 200
mass = 200
decay = WW
stability = 3
mass = 200
decay = WW
mass = 200
decay = WW
stability = 1
LowPt = 20
HighPt = 10000
mass = 200
decay = WW
stability = 1
mass = 200
event = 8
mass = 200
decay = WW
event = 8
mass = 200
decay = WW
plot = 1
mass = 200
plot = 1
mass = 200
decay = WW
stability = 1
event = 8
mass = 200
decay = WW
stability = 1
plot = 1
Workflows as in silico experiments
•
Freefluo workflow enactment
engine
– WSFL
– Scufl
•
•
•
•
Semantic Workflow discovery
– Finding workflows that others
have done, and that I have
done myself
Semantic service discovery
– Finding classes of services
– Guiding service composition
– (We don’t do automated
composition)
Dynamic workflow enactment
service discovery and invocation
– Choose services instances
when running workflow
User involvement
Semantic discovery – services & workflows
•
A registry browser
•
•
•
Services and workflows in
registry have RDF and OWL
descriptions
Selection by the types of inputs
they use, outputs they produce,
the bioinformatics tasks they
perform…
Querying using RDQL over
RDF UDDI registry for
operational metadata
Matching using FaCT OWL
classification for conceptbased metadata
A workflow wizard
Provenance forms in myGrid
• Derivations
– FreeFluo Workflow
Enactment Engine
provides a detailed
provenance record
stored in the myGrid
Information Repository
(mIR) describing what
was done, with what
services and when
– XML document, soon to
be an RDF model
• Annotations
– Every mIR object has
Dublin Core
provenance properties
described in an attribute
value model
Provenance of data
• Operational execution trail
Gene:AC005412.6
SNP:000010197
input
run_for
urn: Clare
Jennings
output
process
start time
end time
by_service
lsid:HGVBase_retrieve
Provenance of knowledge
• Declarative semantic execution trail
contains_single_nucleotide_polymorphism
Gene:AC005412.6
input
as stated by
run_for
urn: Claire
Jennings
SNP:000010197
output
process
start time
end time
by_service
lsid:HGVBase_retrieve
Provenance of knowledge
urn: Carole
Goble
• Trust and attribution
disputed by
contains_single_nucleotide_polymorphism
Gene:AC005412.6
input
as stated by
run_for
urn: Claire
Jennings
SNP:000010197
output
process
start time
end time
by_service
lsid:HGVBase_retrieve
Provenance of knowledge
• Aggregation and
integration
run_for
urn: Bill
Jones
process
start time
end time
by_service
lsid:BIGDbretrieve
as stated by
contains_single_nucleotide_polymorphism
Gene:AC005412.6
input
as stated by
run_for
urn: Claire
Jennings
SNP:000010197
output
process
start time
end time
by_service
lsid:HGVBase_retrieve
20,000 feet and ground level
Top Down provenance
– What is going on?
– Unification and
summaries of collective
provenance knowledge.
– Collaborative,
Awareness, Experience
base, Scientific
Corporate memory.
– “What projects have
something to do with
human SNPs?”
– “What experiments use
the PSI-BLAST service
regardless of version?”
Bottom Up provenance
– Where did this data
object
http://doh.dah.ac.uk/…
come from?
– Which version of SwissProt was run in workflow
http:/blah.ac.uk/…?
User Trust
Domain Experiment
Execution
Data Services Workflow
Build up layers of
provenance knowledge
Provenance for People and Machines
Subjective
People
Experiment
User
Manual/
semi-automated
Trust
Services
Domain
Objective
Data
Contextual
Execution
Workflow
Context-free
Machines
Automated
1. Explicitly capture Context
Reuse methods and
strategies (e.g., protocols)
Make explicit the
situational bias that is
normally implicit
Enable future generations
of scientists to follow our
work
To capture meaning, we
must devise a way of
representing concepts
and their relationships
Hero http://hero.geog.psu.edu/
Hero_knowledge_management.pdf
Downloaded 301103
1. Explicitly capture Context
Using models and terms
that can be shared and
interpreted
that are extensible and
preclude premature
restrictions
that are navigable and
computationally
processable
Hero http://hero.geog.psu.edu/
Hero_knowledge_management.pdf
Downloaded 301103
2. Bridge islands of exported provenance
Service 1
Workflow 1
Experimental
Investigation 1
Service 2
Data 1
Not all exports are the same
Service 1
Workflow 1
Experimental
Investigation 1
Service 2
Data 1
So we need to…
•
•
•
•
•
•
•
Uniquely identify items through URIs
and Life Science Identifiers
(GSH/GSR/Handle.net…)
Explicitly expose provenance by
assertions in a common data model…
Publish and share consensually agreed
ontologies so we can share the
provenance metadata and add in
background knowledge…
Then we can query, filter, integrate and
aggregate the provenance metadata …
and reason over it to infer more
provenance metadata using rules …
and attribute trust to the provenance …
Flexibly so that do not cast in stone
models and terms, and so can cope
with different degrees of description.
What’s an Ontology?
A common vocabulary
of terms
Some specification of
the meaning of the
terms
Concepts,
relationships, axioms
A shared consensual
understanding for
people and machines
W3C Metadata language/model
Resource Description Framework
•
•
•
•
•
•
•
Common model for metadata
Assertions as triples (subject,
predicate, object) forming
graphs.
Associate URIs (LSIDs) with
other URIs (LSIDs).
Associate URIs with OWL
concepts (which are URIs).
RDQL, repositories, integration
tools, presentation tools
Query over, Link together,
Aggregate, Integrate assertions.
Avoids pre-commitment
–
–
–
–
Data
Workflow
Experiment
User
Service
Self-describing
Incremental
Extensible
Advantage and drawback.
Graphic based on Tim Berners-Lee
http://www.w3.org/2003/Talks/0521-www-keynote-tbl/slide22-0.html
Bridging islands
Service 1
Workflow 1
Experimental
Investigation 1
Service 2
Data 1
Bridging islands: Concepts and LSID
Service 1
Service 2
Workflow 1
RDF
RDF
RDF
RDF
RDF
Experimental
Investigation 1
RDF
Data 1
W3C Ontology language/model: OWL
•
•
•
•
•
Continuum of expressivity
– Concepts, roles,
individuals, axioms
– From simple frames to
description logics
– Sound and complete formal
semantics
– Compositional and property
based
Reasoning to infer
classification
Eas(ier) to extend and evolve
and merge ontologies
A web language
Tools, tools, tools!
DAML
OIL
RDF
DAML+OIL
OWL
Bridging islands: Concepts and LSIDs
Service 1
Service 2
Workflow 1
RDF
RDF
RDF
RDF
RDF
Experimental
Investigation 1
RDF
Data 1
Bridging islands: Concepts and LSIDs
LSID
LSID
Service 1
LSID
Workflow 1
Service 2
RDF
LSID
LSID
RDF
RDF
RDF
LSID
LSID
RDF
LSID
RDF
LSID
Experimental
Investigation 1
LSID
Data 1
LSID
Layers of Knowledge Languages
Attribution
Explanation
Rules &
Inference
Ontologies
Metadata
Standard
Syntax
Identity
Wedding cake courtesy of Tim Berners-Lee
myGrid
everything has a concept & LSID
Workflows
Literature
Provenance
record of
workflow runs
Notes
Ontologies
People
Data holdings
Services
Linking objects to objects via URIs and LSIDs
People who
wrote the
workflow
Literature
People to notify
of the workflow
status
Provenance
record of
workflow runs
Provenance of the workflow
template. Related
workflows.
Notes
Data holdings
Ontologies
describing
workflows
Services used
Generated link
anchors
Lymphocyte and
neutrophil are
subsumed by the
concept white
blood cell
Annotating a workflow log with concepts
5. Create the
annotation
4. Provide a
description
3. Select the
concept
1. Choose the
ontology
2. Select an area
to annotate with
Generating provenance
Data and metadata from the run RDF+OWL
Scufl
Workflow
execution
Template
startTime,
endTime,
service
instances
invoked …
RDF+OWL
Identify
workflow
mIR
Input data &
parameters
OWL
descriptions RDF
Bind
services
FreeFluo
WFEE
Execution
Provenance
log
Workflow
knowledge
template
Knowledge
Provenance
log
registry
Knowledge arising from workflow
RDF+OWL
P Afflard et al The Grid(s)? @ Novartis presented at PRISM PharmaGrid
retreat, July 2003
William Pike, Ola Ahlqvist, Mark Gahegan, Sachin Oswal Supporting Collaborative Science through a
Knowledge and Data Management Portal in 1st Semantic Web Conference (ISWC2003) Workshop on Retrieval
of Scientific Data, Florida, USA, October 2003
Two views of a gravity model concept
from the Hero CODEX web tool
William Pike, Ola Ahlqvist, Mark Gahegan, Sachin Oswal Supporting Collaborative Science through a
Knowledge and Data Management Portal in 1st Semantic Web Conference (ISWC2003) Workshop on Retrieval
of Scientific Data, Florida, USA, October 2003
• An ontological
description shows
how one
geoscientist
constructs a
model
• a social network reveals
which users favour
different instances of the
model, with edge length
suggesting the degree of
support.
Collaboratory for Multi-Scale Chemical
Science
CMCS “Pedigree
Graph” portlet
showing provenance
relationships between
resources (colour
coded by original
relationship type).
CMCS Pedigree
Browser
showing the
metadata and
relationships of the
selected data set.
Provenance dimensions connected by concepts
and identifiers
project
Services
pr
oj
ec
t
Author
Workflow
instances
workflow
template
Based on http://www.w3.org/2003/Talks/0521-www-keynote-tbl/slide22-0.html
Reflections: annotations
• Annotation metadata model for myGrid holdings
are a Graph
– If it waddles like RDF and quacks like RDF, its RDF
– Experiments in RDF scalability
– Co-existence of RDF and other data models (relational)
• Acquisition of annotations and adverts
– Automated by mining WSDL docs, mining ws-info docs
– Deep annotation works ok for bioinformatic service
concepts (it’s an EMBL record) but…
– Annotating with biologically meaningful concepts is
harder
• Data in the mIR (it’s a lymphocyte)
• Manual annotation cost is high!
– Service/workflow publication tools
• Dealing with change
– Ontology changes; service changes; annotations
change.
Random Thoughts
•
•
•
•
•
•
Where does the knowledge come from (see Luc)?
How do we model trust (see Luc)?
Scalability of Semantic Web technologies?
Visualisation of knowledge (see monica)?
What’s the lifecycle of provenance?
Different knowledge models for different disciplines?
knowledge
•
•
•
•
•
•
Layers of provenance
Provenance that is domain knowledge
Provenance for context vs execution
workflow
provenance
People vs machine
Different models for different items but still needs to be
integrated
Technologies for sharing and integrating that are flexible.
Talk provenance
•
myGrid
http://www.mygrid.org.uk
– Jun Zhao, Mark Greenwood, Chris Wroe, Phil Lord, Chris
Greenhalgh, Luc Moreau, Robert Stevens
• Hero http://hero.geog.psu.edu/
– William Pike, Ola Ahlqvist, Mark Gahegan, Sachin Oswal
• Collaboratory for Multi-Scale Chemical
Science CMSC
– James D. Myers, Carmen Pancerella, Carina Lansing,
Karen L. Schuchardt, Brett Didier
• Chimera
– Michael Wilde, Ian Foster
• Knowledge Space
– Novartis
• And special thanks to Ian Cottam for heroic support
when my laptop died yesterday. Afternoon.
Download