And now, for something completely different…

advertisement
And now, for something
completely different…
Web 2.0 + Web 3.0
= Web 5.0?
The HSFBCY + CIHR + Microsoft Research
SADI and CardioSHARE Projects
Mark Wilkinson & Bruce McManus
Heart + Lung Institute
iCAPTURE Centre, St. Paul’s Hospital, UBC
Who am i?
Mark Wilkinson
Carole Goble
Carole Goble
“Shopping for …data
should be as
easy as shopping for
shoes!!”
Carole’s Shoes
My shoes
It is hard to fill
Carole’s shoes!
Who am i?
Geneticist &
Molecular Biologist
Informatician
Software
Middleware
Robert Stevens
Shouldn’t
be
seen!
Then why am I
presenting to you?
enables
useful
and exciting
behaviours
Cool!
For cardiovascular
researchers
Please indulge me
as I expose
my underwear
…BRIEFly ;-)
(sorry, I couldn’t resist)
The Problem
The Problem
The Holy Grail:
(circa 2002)
Align the promoters of all serine threonine kinases
involved exclusively in the regulation of cell sorting
during wound healing in blood vessels.
Retrieve and align 2000nt 5' from every serine/threonine kinase
in Mus musculus expressed exclusively in the tunica [I | M |A]
whose expression increases 5X or more within 5 hours of
wounding but is not activated during the normal development of
blood vessels, and is <40% homologous in the active site to
kinases known to be involved in cell-cycle regulation in any
other species.
The Problem
The Problem
The Solution??
Not really…
Why not?
Heart
Heart
Occam’s Razor
“Pluralitas non est ponenda
sine neccesitate.”
“Plurality should not be
posited without necessity."
“Biology is hard!”
So… what can we do?
What we need
Our “information systems”
are
DATA-centric
Our “information systems”
are NOT
KNOWLEDGE-centric
(Source: Clarke and Rollo, Education and Training, 2001)
The REAL Solution
Machine-readable
Knowledge
Knowledge
How do we make knowledge
machine-readable?
Ontology
Ontology (Gr: “things which exist” +-logy)
An explicit formal specification of how to
represent the objects, concepts and other
entities that are assumed to exist in some
area of interest and the relationships that
hold among them.
Problem…
Ontology Spectrum
Because it
Catalog/
ID
Thesauri
“narrower
term”
relation
Terms/
glossary
WHY?
Informal
is-a
Because I
say so!
fulfils XXX
Frames
Selected
(Properties) Logical
Formal
is-a
Formal
Value
instance
Restrs.
Constraints
(disjointness,
inverse, …)
General
Logical
constraints
Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty;
– updated by McGuinness.
Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-abstract.html
My Definition of Ontology
(for this talk)
Ontologies explicitly define
the things that exist in “the
world” based on what
properties each kind of
thing must have
Ontology Spectrum
Catalog/
ID
Thesauri
“narrower
term”
relation
Terms/
glossary
Informal
is-a
Frames
Selected
(Properties) Logical
Formal
is-a
Formal
Value
instance
Restrs.
Constraints
(disjointness,
inverse, …)
General
Logical
constraints
My goal with this talk
Clay Shirky
“Ontology is Over-rated”
http://www.shirky.com/writings/ontology_overrated.html
COST
Catalog/
ID
Thesauri
“narrower
term”
relation
Terms/
glossary
Informal
is-a
Frames
Selected
(Properties) Logical
Formal
is-a
Formal
Value
instance
Restrs.
Constraints
(disjointness,
inverse, …)
General
Logical
constraints
COMPREHENSIBILITY
Catalog/
ID
Thesauri
“narrower
term”
relation
Terms/
glossary
Informal
is-a
Frames
Selected
(Properties) Logical
Formal
is-a
Formal
Value
instance
Restrs.
Constraints
(disjointness,
inverse, …)
General
Logical
constraints
Likelihood of being “right”
Catalog/
ID
Thesauri
“narrower
term”
relation
Terms/
glossary
Informal
is-a
Frames
Selected
(Properties) Logical
Formal
is-a
Formal
Value
instance
Restrs.
Constraints
(disjointness,
inverse, …)
General
Logical
constraints
Here’s my argument…
Semantic Web?
An information system where machines can receive
information from one source, re-interpret it, and
correctly use it for a purpose that the source had
not anticipated.
Semantic Web?
If we cannot achieve those two things, then IMO we
don’t have a “semantic web”, we only have a
distributed (??), linked database… and that isn’t
particularly exciting…
Where is the semantic web?
Catalog/
ID
Thesauri
“narrower
term”
relation
Terms/
glossary
Informal
is-a
Frames
Selected
(Properties) Logical
Formal
is-a
Constraints
(disjointness,
inverse, …)
Formal
Value
instance
Restrs.
REASON: “Because I say so” is not open to re-interpretation
General
Logical
constraints
The games we have been
playing for ~2 years
SADI
Find. Integrate. Analyse.
Founding partner
CardioSHARE
Data + Knowledge for Cardiologists.
Founding partner
Brief History
of
CardioSHARE
Automating the interaction between biologists and
their preferred data and analytical resources
(Running since 2002)
Moby defines an ontology of
datatypes, and a WS registry
that is aware of that ontology
I’m studying a mutation in my
favorite organism
I want to know what mutations
in the equivalent gene look like
in another organism
Oh no…
The Moby
Semantic Web Service
Approach
WORKFLOW
Analytical Workflow construction
Using Moby
No explicit coordination between providers
Automated discovery of appropriate tools
Automated execution of selected tools
The machine “understands” the data-type you
have in-hand, and assists you in choosing the
next step in your analysis.
This got me thinking…
Datatypes…
Biology?!?!?
CardioSHARE
The Cardiovascular Semantic Health
And Research Environment
Theatre Analogy
The Components of a Play
The Script
The Stage
The Casting Director & directors
The Actors
The Components of CardioSHARE
Cardiovascular Ontologies
(The Scripts)
SHARE
(The Stage & Set Decorator)
SADI
(The Casting Director)
Data
(The Actors)
DATA
the “actors”
in CardioSHARE
ACTORS:
No intelligence
No Use or Function
(…without a script)
more successful
when clean!!
No Intelligence?
No Use?
Data exhibits “Late binding”
Late binding:
“purpose and meaning”
of the data is
not determined until
the moment it is required
benefit/Consequence
of Late binding:
data is amenable to
constant re-interpretation
How do we achieve this?
Because we
DO NOT CLASSIFY
our data…
we simply hang properties on it
These are axioms
that enable interpretation
Catalog/
ID
Thesauri
“narrower
term”
relation
Terms/
glossary
Frames
Selected
(Properties) Logical
Formal
is-a
Informal
is-a
These are
classification systems
Formal
Value
instance
Restrs.
Constraints
(disjointness,
inverse, …)
General
Logical
constraints
The Components of CardioSHARE
Cardiovascular Ontologies
(The Scripts)
SHARE
(The Stage & Set Decorator)
SADI
(The Casting Director)
Data
(The Actors)
SADI
“Semantic Automated
Discovery and Integration”
SADI
=
…but smarter…
SADI
Instead of specifying
the analysis or database
you specify the
Biological/clinical
RELATIONSHIP
of interest
“Take this gene sequence,
run the BLAST algorithm to
search for genes with similar sequence,
and give me the result”
SADI
“get me the homologues for this gene”
SADI knows that the BLAST algorithm
is used to search for homologous
gene sequences
SADI finds a BLAST server on the Web
and executes your search for you
SADI
By analogy:
SADI, the casting director, knows which
actors are best able to fit the part,
automatically finds them,
and casts them in the play
SADI Web Services
WS Discovery is aided by reasoning
WS interfaces are defined in OWL as two
classes: Input and Output
Input and Output classes include the
property restriction axioms that define
the input to/output from the service
“mini ontology”
SADI Web Services
The WS consumes input data (native RDF)
and adds a new property
(predicate+object) to it.
The Components of CardioSHARE
Cardiovascular Ontologies
(The Scripts)
SHARE
(The Stage & Set Decorator)
SADI
(The Casting Director)
Data
(The Actors)
SHARE
SHARE is a completely generic “stage”
upon which a “play” is going to be
performed
It’s the place where the actors are
assembled to do their thing for an
audience
SHARE
Simply a place where you can ask
ANY question (i.e. the Play)
and see the result
SHARE
At the moment it isn’t
particularly impressive…
By analogy, we have a stage, but we
haven’t hired any set decorators yet!
You’ll see what I mean in a minute…
DEMO
Recap
what we just saw
A SPARQL database query was entered into
the SHARE environment
The query was passed to SADI and was interpreted
based on the properties being asked-about
SADI searched-for, found, and accessed the databases
and/or analytical tools required to generate those
properties
“The play was performed”
Recap
what we just saw
We asked, and answered a complex
“database query”
WITHOUT A DATABASE!!
The Holy Grail:
Align the promoters of all serine threonine kinases
involved exclusively in the regulation of cell sorting
during wound healing in blood vessels.
Retrieve and align 2000nt 5' from every serine/threonine kinase
in Mus musculus expressed exclusively in the tunica [I | M |A]
whose expression increases 5X or more within 5 hours of
wounding but is not activated during the normal development of
blood vessels, and is <40% homologous in the active site to
kinases known to be involved in cell-cycle regulation in any
other species.
The Components of CardioSHARE
Cardiovascular Ontologies
(The Scripts)
SHARE
(The Stage & Set Decorator)
SADI
(The Casting Director)
Data
(The Actors)
Cardiovascular
Ontologies:
Allow us to refer to complex clinical/molecular
phenomenon inside of our query
(i.e. Biological Knowledge!)
Allows experts to define these concepts, and
then have their expert-knowledge re-used by
non-experts in their own queries.
HUH?
QUERY:
Concept:
Retrieve images of
mutations from genes in
“Homologous
organism XXXMutant
that
Image” to this
share homology
gene in organism YYY
WORKFLOW
Phrased in terms of properties:
Find image P where {
Gene Q hasImage image P
Gene Q hasSequence Sequence Q
Gene R hasSequence Sequence R
Sequence Q similarTo Sequence R
Gene R = “my gene of interest” }
…but these are simply axioms…
HomologousMutantImage is {
Gene Q hasImage image P
Gene Q hasSequence Sequence Q
Gene R hasSequence Sequence R
Sequence Q similarTo Sequence R
Gene R = “my gene of interest” }
QUERY:
Retrieve
homologous
mutant images for
gene XXX
CardioSHARE
We construct small, independent OWL classes
representing these kinds of concepts
These classes simplify the construction of complex
queries by “encapsulating” data discovery, retrieval,
and analysis pipelines into simple, easy-to-understand
words and phrases.
CardioSHARE
These Classes are shared on the Web such that
third-parties, potentially with different
expertise, can utilize the expertise of the person
who designed the Class.
Easily share your expertise with others!
Easily utilize the expertise of others!
CardioSHARE
We are not building massive ontologies!
Publish small, independent Class definitions
Cheap
Scalable
Flexible
Don’t try to describe all of biology!
Importantly!
CardioSHARE/SADI releases Semantic Systems
from the constraints of pure logical reasoning
If the defining property restriction of your class
maps to e.g. a statistical or other analytical
service, then classification into that class can be
achieved through non-logical means
Biology is hard!
(Too hard to define purely logically)
DEMO #2
The Holy Grail:
Align the promoters of all serine threonine kinases
involved exclusively in the regulation of cell sorting
during wound healing in blood vessels.
Retrieve and align 2000nt 5' from every serine/threonine kinase
in Mus musculus expressed exclusively in the tunica [I | M |A]
whose expression increases 5X or more within 5 hours of
wounding but is not activated during the normal development of
blood vessels, and is <40% homologous in the active site to
kinases known to be involved in cell-cycle regulation in any
other species.
CardioSHARE architecture: Increasingly complex ontological layers
organize data into richer concepts, even hypotheses
Hypothesis
Ischemia
SADI
Web
“agents”
Hypertension
Blood Pressure
Database 1
Database 2
Analysis
AlgorithmXX
Recap
SADI interprets queries
(SPARQL + OWL Class Definitions)
Determine which properties are available,
and which need to be discovered/generated
Discovery of services via on-the-fly
“classification” of local data with small OWL
Classes representing service interfaces
Recap
CardioSHARE encapsulates individual data
retrieval and analysis workflows into
OWL Classes
An ontology of 1 class
Low-cost, high accuracy
Semantic Web
An information system where machines can receive
information from one source, re-interpret it, and
correctly use it for a purpose that the source had
not anticipated.
What we achieve
Re-interpretation :
The SADI data-store simply collects
properties, and matches them up with
OWL Classes in a SPARWL query and/or
from individual service provider’s
WS interface
What we achieve
Novel re-use:
Because we don’t pre-classify, there is no
way for the provider to dictate how their
data should be used. They simply add
their properties into the “cloud” and
those properties are used in whatever
way is appropriate for me.
What we achieve
Data remains distributed – no warehouse!
Data is not “exposed” as a SPARQL
endpoint  greater provider-control over
computational resources
Yet data appears to be a SPARQL
endpoint… no modification of SPARQL or
reasoner required.
Recap
CardioSHARE allows researchers to:
- Ask questions in natural, intuitive ways
- Execute complex analyses without a bioinformatician
- Access output from databases and analytical tools in
exactly the same way
- Share hypotheses and models in an EXPLICIT manner
- Evaluate other’s hypotheses over your own data
Join us!
SHARE and SADI are Open-Source projects
undertaken with the collaboration of,
and for the benefit of, the biological and bioinformatics
research community.
YOUR participation is welcome
as we move from prototype
to “the real world”!
Fin
Download