Sharing the wealth Bruce McManus & Mark Wilkinson iCAPTURE

advertisement
BioMOBY
the one that almost got away
Mark Wilkinson, iCAPTURE Centre
UBC, Vancouver, Canada
MOBY-S Update for VanBug
Vancouver, BC,Canada, 2004
Make some sense of this mess!
Along came web services
• Relatively recently added to the bioinformatics
tool-belt
• Didn’t help the situation much…
– A web service that consumes “string” data types
might be expecting a fasta sequence, or a
keyword.
– No clear way for a machine to know which
– UDDI/WSDL is not very useful in solving this
problem
• Biology/Bioinformatics has a lot of data-types!
Who is MOBY’s audience?
• Information is distributed
– Beyond Flybase, MIPS, EnsEMBL and TAIR
– MOST data never makes it off of the scientists hard drive
– This data should be added to the global scientific archive
• Biologists, by and large, are willing and able, but…
– The Web was embraced enthusiastically by biologists
– Most wet labs run a website in which they present at least some of
their results and data through HTML or CGI
– Unfortunately, this only adds to the chaos…
The interoperability solution we design must be
simple enough for a Biologist, with a little bit of
computer knowledge, to implement on their own
The MOBY Plan
• Define data-types commonly used in bioinformatics
• Organize these into an ontology
• Ontologically define web service inputs and outputs
• Register the inputs and outputs of each service
provider in a “yellow pages” registry
• Machines can find an appropriate service
• Machines can execute that service unattended
Overview of MOBY-S Transactions
MOBY hosts & services
Alignment
Sequence
Gene
names
Sequence
Align
Express.
Phylogeny
Protein
Primers
Alleles
…
MOBY
Central
MOBY-S Data Types
• My disappointment with web services not being (easily)
able to distinguish between a Fasta sequence and a
keyword led me to spend a lot of time thinking about
data-types.
• This consideration became the core focus of MOBY-S
• Constraints on MOBY-S are much more severe than on
an “archetypcal” computer-science solution
– our target audience are not high-level programmers
– Defining data types with XML schema is a non-starter: IT WILL
NEVER HAPPEN!
MOBY-S in detail
• MOBY-S Data typing system: Semantic Type
• MOBY-S Data typing system: Syntactic Type
• The MOBY-S Service Ontology
• The MOBY Central Registry
MOBY-S Semantic Typing:
Namespaces
• Any identifiable piece of data is an “entity”
• Identifiers fall into particular “Namespaces”
– NCBI has gi numbers (gi Namespace)
– GO Terms have accession numbers (GO Namespace)
• Namespaces indicate data’s semantic type.
– GO:0003476 represents a Gene Ontology Term, not a sequence
– gi|163483 represents a GenBank record
• However, we cannot tell if it is protein, RNA, or DNA sequence
• Namespace+ID is sufficient to specify a particular “entity”
• The namespace is assumed to be sufficiently descriptive
of the data’s semantic type that a service provider can
define their interface in terms of Namespaces
MOBY-S in detail
• MOBY-S Data typing system: Semantic Type
• MOBY-S Data typing system: Syntactic Type
• The MOBY-S Service Ontology
• The MOBY Central Registry
MOBY-S Syntactic Typing:
The Object Ontology
• Syntactic types are defined by a GO-like
ontology
– Type (Class) name at each node
– Edges define the relationships between one Class and another
– Gene Ontology used as a model because of its obvious success
and comprehension by the model organism community
• Edges define one of three relationship types
– ISA
• Inheritance relationship
• All properties of the parent are present in the child
– HASA
• Container relationship of ‘exactly 1’
– HAS
• Container relationship with ‘1 or more’
A portion of the MOBY-S
Object Ontology
ISA inheritance
relationship
• Classes become more specialized as you
move along the ISA relationship hierarchy
DNA_Sequence
ISA
Nucleotide_Sequence
ISA
Generic_Sequence
ISA
Virtual_Sequence
ISA
Object
• Objects do not become more complex as
a result of ISA relationships alone
HASA & HAS
relationships
• HASA and HAS relationships make
Classes more complex by embedding
Classes within Classes
•
•
•
•
Virtual_Sequence ISA Object
Virtual_Sequence HASA Length (Integer)
Generic_Sequence ISA Virtual_Sequence
Generic_Sequence HASA Sequence (String)
• Annotated_GIF ISA Image (base_64_GIF)
• Annotated_GIF HAS Description (String)
Legacy file formats
• Inheriting from “String” allows us to define ontological classes that
represent legacy data types
• NCBI_Blast_Report ISA text-formatted ISA String
<NCBI_Blast_Report namespace=‘NCBI_gi’ id=‘115325’>
TBLASTN 2.0.4 [Feb-24-1998]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A.
Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman
(1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
Query=
gi|1401126
(504 letters)
Database: Non-redundant GenBank+EMBL+DDBJ+PDB sequences
336,723 sequences; 677,679,054 total letters
Searchingdone
Sequences producing significant alignments:
Score
(bits)
gb|U49928|HSU49928 Homo sapiens TAK1 binding protein (TAB1) mRNA...
emb|Z36985|PTPP2CMR P.tetraurelia mRNA for protein phosphatase t...
emb|X77116|ATMRABI1 A.thaliana mRNA for ABI1 protein
gb|U12856|ATU12856 Arabidopsis thaliana Col-0 abscisic acid inse...
</NCBI_Blast_Report>
1009
58
53
53
E
Value
0.0
4e-07
1e-05
1e-05
Binaries
• We base64 encode binaries, and again define data classes that
inherit from String
• base64_encoded_jpeg ISA text/base64 ISA text/plain ISA String
<base64_encoded_jpeg namespace=‘TAIR_image’ id=‘3343532’>
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC
Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC
Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV
BAgTDFdlc3Rlcm4gQ2FwZTESMBAGA1UEBxMJQ2FwZSBUb3duMQ8wDQYDVQQKEwZUaGF3dGUx
HTAbBgNVBAsTFENlcnRpZmljYXRlIFNlcnZpY2VzMSgwJgYDVQQDEx9QZXJzb25hbCBGcmVl
bWFpbCBSU0EgMjAwMC44LjMwMB4XDTAyMDkxNTIxMDkwMVoXDTAzMDkxNTIxMDkwMVowQjEf
MB0GA1UEAxMWVGhhd3RlIEZyZWVtYWlsIE1lbWJlcjEfMB0GCSqGSIb3DQEJARYQamprM0Bt
</base64_encoded_jpeg>
Extending legacy data types
•
•
•
•
With legacy data-types defined, we can extend them as we see fit
annotated_jpeg ISA base64_encoded_jpeg
annotated_jpeg HASA 2D_Coordinate_set
annotated_jpeg HASA Description
<annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’>
<CrossReference>
<Object namespace=“TAIR_Allele” id=“ufo-1”/>
</CrossReference>
<2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”>
<CrossReference>
<Object namespace=‘TAIR_Tissue’ id=‘122’/>
</CrossReference>
<Integer namespace=‘’ id=‘’ articleName=“x_coordinate”>3554</Integer>
<Integer namespace=‘’ id=‘’ articleName=“y_coordinate”>663</Integer>
</2D_Coordinate_set>
<String namespace=‘’ id=‘’ articleName=“Description”>
This is the phenotype of a ufo-1 mutant under long daylength, 16’C
</String>
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC
Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV
</annotated_jpeg>
The same object…
annotated_jpeg ISA base64_encoded_jpeg HASA 2D_Coordinate_set HASA Description
<annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’>
<CrossReference>
<Object namespace=“TAIR_Allele” id=“ufo-1”/>
</CrossReference>
<2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”>
<CrossReference>
<Object namespace=‘TAIR_Tissue’ id=‘122’/>
</CrossReference>
<Integer namespace=‘’ id=‘’ articleName=“x_coordinate”> 3554 </Integer>
<Integer namespace=‘’ id=‘’ articleName=“y_coordinate”> 663 </Integer>
</2D_Coordinate_set>
<String namespace=‘’ id=‘’ articleName=“Description”>
This is the phenotype of a ufo-1 mutant under long daylength, 16’C
</String>
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC
Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV
</annotated_jpeg>
The Object Ontology:
Defines an XML Schema
• Object Ontology terms have semantically rich
names, but this is for human intuition only
– DNA Sequence
– Annotated_GIF
• Object Ontology does not define what these
data-types mean – NO SEMANTICS
• It does define the XML schema of their
representation - SYNTAX
The Object Ontology:
Defines an XML Schema!
• The position of an ontology node precisely
defines the syntax by which that node will be
represented
• End-users can define new data-types without
having to write an XML schema!
– This was an important aim of the project
• Similarly you can, at run-time, determine the
schema of any incoming XML by querying the
ontology.
MOBY-S in detail
• MOBY-S Data typing system: Semantic Type
• MOBY-S Data typing system: Syntactic Type
• The MOBY-S Service Ontology
• The MOBY Central Registry
The Service Ontology
• A simple ISA hierarchy
• Rooted in the base “Service” transformation
(never instantiated)
• Primitive types include:
–
–
–
–
–
Analysis
Parsing
Registration
Retrieval
Resolution
The Service Ontology
Service
ISA
Parsing
ISA
Analysis
ISA
ISA
Parse_NCBI_Blast
Alignment
ISA
Blast
ISA
WU_Blast
ISA
NCBI_Blast
MOBY-S in detail
• MOBY-S Data typing system: Semantic Type
• MOBY-S Data typing system: Syntactic Type
• The MOBY-S Service Ontology
• The MOBY Central Registry
MOBY Central: The yellow pages
• MOBY Central is a registry for MOBY-compliant
services
• Not UDDI-based
• Services register:
– “Service Signature” - a triple of [input, service_type, output]
– A human readable description of the service
– The URL to the service interface
• Provides two types of interfaces:
– Register/Deregister
– Search/Retrieve
A simple MOBY-S browser is
embedded in Gbrowse
• gbrowse_moby can be configured to
execute MOBY Services in response
to mouse-clicks in the Gbrowse
sequence viewer.
• It isn’t a powerful client, but it reveals
some interesting MOBYesque
behaviours…
Semantic Web “on the fly”!
• This simple browser behaves very much like a
semantic web browser
– Information from non-coordinated service providers
is discovered at run-time in response to queries.
• It does so without semantics - Syntax only!
Semantic Web “on the fly”!
• Perhaps Interoperability is not a semantic
problem?
• Data Integration may be more of a semantic
problem (??)
• Service Discovery, however, definitely is a
semantic problem
Ugh…. Tedious!
• The simple browser is frustrating in many
ways
–
–
–
–
design once, run once
Analysis of only one data-element at a time
No way to extract the data at the end of the analysis
No provision information is saved
• myGrid has been working on similar problems
• The BioMOBY project has secretly absconded
with one of the myGrid employees, and he
now works for us! Shhhhhhh! ;-)
TAVERNA
A fantastic client program that can
talk to MOBY Central
and execute MOBY Services
Taverna was written by Tom Oinn
with MOBY input by Martin Senger
as part of the myGrid project
MOBY-S: On reflection
• Two years into the project
• >140 services registered and growing
• ~20 independent service providers (not part of
the BioMOBY project)
• Codebase not yet developed beyond a working
prototype
• myGrid is making great progress, and has 25X
more funding than we have!
• It is now time to step back and take a critical
look at what we achieved, where we failed, and
where to go from here
What MOBY got RIGHT
• Open source, community driven
1. Involving the model organism community right from
the start has made an enormous impact on the
early acceptance and adoption of MOBY
2. Rapid feedback on success/failure
– we had “real” users right from the prototype stage!
3. The community has been very forgiving of
“hiccups” because they are included in the
development process
What MOBY got RIGHT
• Data typing
1. Does not attempt to re-structure legacy data-types
– passed verbatim in a lightweight XML wrapper.
– There are TONS of parsers out there
– Entire software projects are built around extracting
information from these legacy formats.
2. Ontology dictates data structure/sub-structure
– XML can be parsed, with the “meaning” of each substructure encountered being defined by the ontology
– Thus MOBY data is more “self-describing” than XML
even with an XML schema
What MOBY got RIGHT
• Data typing
3. Provides a foundation for future data-type
definitions
– New data-types can be defined by end-users
– New data-types can be defined in a structured, machinereadable way, rather than by new ad hoc flat-file format.
– Unsophisticated data providers have an “environment”
that structures their thinking about the data they are
providing.
– XML schema creation is unnecessary
–
REMEMBER WHO OUR TARGET AUDIENCE IS!!
4. Object ontology simplifies creation of visualization
tools in an environment where the number/nature
of data types is changing daily.
What MOBY got RIGHT
• Data typing
5. Provides a standard way of annotating the
data object, and/or any of its sub-structures
– Annotations are kept separate from the data itself
(versus e.g. hypertext)
– Multiple annotations per data component
– Mechanism for indicating the semantic
relationship between the annotation and the data
being annotated
6. Separation of the semantic data-type from
its syntax
– The same data “entity” can be instantiated in a
wide variety of ways
What MOBY got RIGHT
• Data typing
7. Despite all of this potential richness, the data can
be remarkably simple!!!
– Often single XML tag is all that is required
– REMEMBER WHO OUR TARGET AUDIENCE IS!!
What MOBY got RIGHT
• Messaging structure
1. Having a predictable messaging layer
dramatically simplifies the interoperability
problem
–
–
Yes, I know, this goes against the most fundamental
rules of the “open world” Web!
REMEMBER WHO OUR TARGET AUDIENCE IS!!
2. Provides a standardized structure into which
provision information can be added
3. Dictates what constitutes an “error”
– “I don’t know” is NOT an error in MOBY
What MOBY got WRONG
• Service typing
The problem with MOBY
Chickens go in;
Pies come out!
The problem with MOBY
What sort o’
pies?
The problem with MOBY
Apple!
What MOBY got WRONG
• Service typing
1. Describing bioinformatics services is HARD!
2. The MOBY plan was to simply describe them
“the way a biologist speaks”
1. “I’m going to Blast this sequence”  Service type
Blast
2. “I need to retrieve this sequence”  Service type
Retrieve
3. This doesn’t really work very well, since
services can be arbitrarily complex.
What MOBY got WRONG
• Service typing
– MOBY Service ontology suffers from singleparenting
•
A “Blast Report Parsing” service is a unique node in the
ontology.
•
Better to have a service described as the intersection of a
variety of orthogonal concepts:
•
A Blast Report Parser is “a Parser that operates on a Blast
Report datatype.”
•
The TAMBIS project (same research team as myGrid) is a
perfect example of how this can and should be done.
What MOBY got WRONG
• Service typing
– MOBY desperately needs a “legitimate” service type
ontology
– myGrid has one, and a registry as well
– We will soon completely devolve our service
description & discovery layer to myGrid
•
i.e. the end of MOBY Central
– They have enough funding to ensure that the code is
robust and well-designed
– Can we make service description simple enough for
biologists, even with the rich myGrid ontologies.
– REMEMBER WHO OUR TARGET AUDIENCE IS!!
Usage of MOBY Central
2004
400000
350000
300000
250000
200000
150000
100000
50000
0
Month
Ju
l
Ju
n
API Calls
Ja
n
Fe
b
M
ar
Ap
r
M
ay
MOBY Central API
API Calls
Early Adopters
The PlaNet Consortium
PlaNet Consortium
Members
• Institute for Bioinformatics (IBI) / MIPS,
Neuherberg
• Flanders Interuniversity Institute for
Biotechnology (VIB), Gent
• Genoplante-Info, Evry
Nottingham Arabidopsis Stock Centre (NASC),
Nottingham
• John-Innes-Centre, Norwich
• Plant Research International (PRI), Wageningen
• Centro Nacional de Biotecnología, Madrid (CNB)
Early Adopters
CGIAR
Generation Challenge Program
GCP Consortium Members
Early Adopters
Commonwealth Scientific And
Industrial Research Organization
CSIRO
• Will begin deploying services in
~January
Unexpected phenomenon
• In every case, these consortia have set up their
own instances of the MOBY Central registry
–
–
–
–
–
This was not how I had expected that MOBY would be used!
Could be due to the lack of a descriptive service ontology
Could be sociological
Could be security (MOBY Central API is open)
Probably a bit of each…
• This is a critical observation when it comes to
architectural decisions v.v. registry setup
– Deployment of “boutique” registries must be TRIVIAL!
– This will be an important consideration in our collaboration with
myGrid…
Hey, those are all plant databases!
• For some reason, MOBY has been more
rapidly adopted by the plant community
than by other communities
• Could be personal (My PhD is in Botany)
• Could be ethical
The heart is also
biologically
important!
(Murray and Lopez, The global burden of disease : a comprehensive
assessment of mortality and disability from diseases, injuries, and risk
factors in 1990 and projected to 2020, 1996)
CVD-Related Deaths for 2001
(By WHO Region, Deaths in Thousands)
(Source: World Health Organization, The World Health Report
2002: Reducing Risks and Promoting Healthy Life, 2002)
Logo
Sharing the wealth
Mark Wilkinson & Bruce McManus
iCAPTURE Centre for Cardiovascular
and Pulmonary Research
UBC, Vancouver, British Columbia
Canada
Toward Optimal Knowledge Delivery
in the Cardiovascular Sciences
“Sometimes what your
listeners hear is more
interesting than what you’ve
actually said.”
~ Don Moyer, Harvard Business Review
(I am once again talking about vaporware….)
“In 25 years,
[information] will
double every
three months.
What will that do
for learning
requirements?”
~Doug Engelbart
“Information is not
knowledge.”
~Albert Einstein
“Science is
organized
knowledge.”
~Herbert Spencer
“Where is all the knowledge we
lost with information?”
~T. S. Eliot
(Source: Clarke and Rollo, Education and Training, 2001)
Problems of the post-genomic era
• Too much information!
• Too little knowledge
• Once you have data, how do you:
– Share it
– Manage it
– Use it
– Package it
– Translate it
– Apply it
– Turn it into knowledge!
"If HP knew what HP
knows, we'd be three
times more profitable."
~Lew Platt, Non-executive
Chairman, of The Boeing
Company, former CEO of
Hewlett-Packard Company
BioMOBY and myGrid are
not the solution either!!
• Deal with data (aggregation) not
knowledge (organization)
• We have to take the next step
• Move from a data-centric architecture to a
knowledge-centric architecture
Occam’s Razor
“Pluralitas non est ponenda
sine neccesitate.”
“Plurality should not be
posited without necessity."
“Why posit from simplicity
when the full complexity
could be available?”
Nosology: (Gr noso “disease” +-logy)
a classification or list of diseases
Ontology (Gr: “things which exist” +-logy)
An explicit formal specification of how to
represent the objects, concepts and other
entities that are assumed to exist in some
area of interest and the relationships that
hold among them.
Capturing and encoding
knowledge is hard!
(it is also research, no matter what others may tell you!)
• Requires extensive collaboration between
biomedical domain experts, and knowledge
management experts (ontologists)
• At least the tools and standards are now
becoming more stable…
• We also have a trail to follow!
Exemplary
Case
• Mission
– provide bioinformatics support and
integration of research initiatives to
the cancer research community.
• Works with intramural and extramural groups
to develop Initiative-Specific Modules
• Modules connected through intelligent
interfaces, coordinated through an NCI
Core Module (i.e. ontology) and deployed
through open source tools and systems
• NCICB serves as a focal point for cancer
research informatics planning worldwide
• On the downside
– The ontology is a bit monolithic
– Requires 12+ full-time personnel to maintain
– Monolithic ontologies become quite fragile…
• OBO and TAMBIS have shown the power of
lightweight, modular, orthogonal ontologies
• This may be a better solution…??
Duplicating NCI’s success
• We need something like this for the
cardiovascular sciences
• How can we duplicate the caCORE
success story with less resources?
CardioSHARE
Cardiovascular Semantic
Health And Research
Environment
Wilkinson & McManus
Grant Proposal to Genome Canada, 2004
(Source: Clarke and Rollo, Education and Training, 2001)
CardioSHARE architecture: Increasingly complex ontological
layers organize data into richer concepts, even hypotheses
Hypothesis
Ischemia
BioMOBY
& Semantic
Web “agents”
Hypertension
Blood Pressure
Database 1
Database 2
Database 3
Friends and Participants
Bruce McManus – iCAPTURE Centre, UBC
Lincoln Stein - CSHL
Damian Gessler, Andrew Farmer, Gary Schiltz - NCGR
Bill Crosby, Matthew Links, Luke McCarthy – U of S
Martin Senger – myGrid @ EBI
Heiko Schoof, Rebecca Ernst – MIPS
Lukas Mueller – formerly at TAIR
Midori Harris – GO Consortium
Mike Niemi – IBM
Fiona Cunningham, Shuly Avraham – CSHL
Ken Stuebe – SDSC
Carole Goble, Phillip Lord – myGrid @ U Manchester
Funding and equipment donations from:
Genome Canada/Genome Prairie, Canada
National Science Foundation (NSF), USA
Canadian Bioinformatics Resource, NRC, Halifax
Open-Bio Foundation
IBM
Download