Structuring the life science resourceome for Semantic Systems Biology: lessons from the

advertisement
Structuring the life science
resourceome for Semantic Systems
Biology: lessons from the
BioGateway project
Erick Antezana
Dept. of Plant Systems Biology
VIB/University of Ghent, BELGIUM
erick.antezana@psb.ugent.be
1
Contents
1.
2.
3.
4.
5.
6.
Semantic Systems Biology
Data integration and exploitation
BioGateway
Concluding remarks
Discussion
Next steps
2
The four steps of Systems Biology
1.
Define all of the components of the system, build
model, simulate and predict
2.
Systematically perturb and monitor components of
the system
3.
Reconcile the experimentally observed responses
with those predicted by the model
4.
Design and perform new perturbation experiments to
distinguish between multiple or competing model
hypotheses
Kitano, Science, 2002
3
Mathematical model
Data analysis
Information extraction
New information to model
Model Refinement
Systems
Biology Cycle
Experimentation,
Data generation
Dynamical simulations and
hypothesis formulation
Experimental design
4
Semantic Knowledge Base
Information extraction,
Knowledge formalization
Consistency checking
Querying
Automated reasoning
Semantic
Systems
Biology Cycle
Experimentation,
Data generation
Hypothesis formulation
Experimental design
5
BioGateway
Some motivating questions
•
Cancer: what candidate genes are involved in
cell cycle control, S-phase to G2 transition,
DNA damage response and skin cancer?
•
Gastrin: what genes correlate with cancer and
the use of anti-acids, and are involved in the
gastrin response, and are associated with cell
cycle control?
•
Inflammation: give me genes that are
mentioned in the context of high carbohydrate
intake and play a role in process P and are
within S steps from a GO ontology term related
to inflammation
=> Keyword-based search gets OBSOLETE
6
BioGateway Design
• Quick query results: performance, choice:
“tuned” RDF (no OWL), 1 graph per resource
• Human “readable” output: labels, no IDs or
URI…
• Good practice:
– Standards (RDF) => orthogonality, …
– Representation issues (e.g. n-ary relations)
• Transitive closure: is_a, part_of
7
Transitive closure graphs
• A transitive closure was constructed for the subsumption
relation (is a) and the partonomy relation (part of)
• If A is a B, and B is a C, then A is a C is also added to the
graph.
• Many interesting queries can be done in a performant way
with it, like 'What are the proteins that are located in the
cell nucleus or any subpart thereof?'
• The graphs without transitive closure are available for
querying as well.
8
BioGateway
• Automatic pipeline
– Run on a regular basis (~3 months)
– Latest data available (from scratch)
• Uses Virtuoso Open Server
– Open Source software that can host a triple
store
– Can build this from RDF files
– Has a DB backend
• Supports SPARQL*
http://www.openlinksw.com/virtuoso/
*http://www.w3.org/TR/rdf-sparql-query/
9
BioGateway architecture
10
BioGateway graphs
• 1 Swiss-Prot file, the section of UniProt KB of proteins
• 1 NCBI file with the taxonomy of organisms
• 1 Metaonto file with information about OBO Foundry
ontologies
• 2 Metarel files with relation type properties
• 5 CCO files with integrated information about cell cycle
proteins
• 44 OBO Foundry files with diverse biomedical
information + Transitive Closure
• 51 Transitive Closure files to enhance query abilities
• 893 GOA files with GO annotations
BioGateway holds ~175 million triples!!!
11
BioGateway graphs
Each RDF-resource in BioGateway has a URI of this form:
http://www.semantic-systems-biology.org/SSB#resource_id
Each RDF-graph in BioGateway has a URI of this form:
http://www.semantic-systems-biology.org/graph_name
12
Sample RDF-ication: GOA
UniProtKB O03042 O03042 GO:0000287 GOA:spkw|GO_REF:0000004 IEA
“The protein O03042 (Ribulose bisphosphate carboxylase large chain) is
annotated with the GO term GO:0000287 (Magnesium Ion Binding), a term
in the Molecular Function subtree from GO. Therefore, O03042 has the
molecular function of binding magnesium ion. This fact is supported by IEA,
that is, Inferred from Electronic Annotation.”
13
A library of queries*
• The drop-down box contains (so far) 34 queries:
– 13 protein-centric biological queries:
•
•
•
•
•
The role of proteins in diseases
Their interactions
Their functions
Their locations
…
– 21 ontological queries:
• Browsing abilities in RDF like getting the neighborhood, the
path to the root, the children,...
• Meta-information about the ontologies, graphs, relations
• Queries to show the possibilities of SPARQL on BioGateway,
like counting, filtering, combining graphs,...
• …
14
* http://www.semantic-systems-biology.org/biogateway/querying
All the queries are explained in a tutorial*
For every query the name, the
parameters and the function are
indicated at the top.
The parameters are
indicated in red.
15
* http://www.semantic-systems-biology.org/biogateway/tutorial
http://www.semantic-systems-biology.org/sparql-viewer/
Limit
The SPARQL-endpoint
Execute
The prefixes
The query without the prefixes
The URI's in blue.
The results:
9 proteins
Labeled arrows
to extra 16
information
Conclusions / Results
• BioGateway: RDF store for Biosciences
• Data integration pipeline: BioGateway
• Queries and knowledge sources and system
design go hand-in-hand (user interaction)
• Still far from answering “complex” questions
• SW technologies add a new dimension of
Knowledge integration to Systems Biology
• Existing integration obstacles due to:
• diversity of data formats
• lack of formalization approaches
• Calls for ‘foundry’ type initiative for RDF
17
Next steps
• More data sources (e.g. Nutrigenomics, pathways
etc.)
• RDF rules (e.g. RuleML)
• A more user-friendly interface
• Reasoning
• OBO cross products
• …
18
Discussion
• W3C standards – limitations (e.g. spatiotemporal information, microarrays experiments)
• Biological identifiers: URIs, LSIDs, MIRIAM
URIs, etc. They should be scalable & resolvable.
• Lack of semantic content: poor axiomatisation,
inadequately codified. Use of standard
languages (e.g. RDF).
• Adequate tools: not adapted for real-size
problems (e.g. reasoning). Designed with a
universal architecture in mind.
19
SSB at a community level*
• Semantic bio-content: encourage and
facilitate
• Best practices for such creation (standards)
• Mechanism for identifying biological entities
• Bridge semantic technology developers and
life scientists (=the users)
WIKI: http://www.bio.ntnu.no/systemsbiology/ssbwiki/doku.php
20
Principles
1. Orthogonality
2. A “common language” (e.g. RDF)
3. Unique ID + resolution (e.g. purl.org)
4. Comply to: ULO (e.g. BFO), RO, …
5. Explicit semantics
6. Rich axiomatisation
7. Application-driven development (e.g. SB)
8. Peer review (community evaluation)
9. Tooling (e.g. visualisation)
10. Licensing (e.g. CC)
21
Acknowledgements
•
•
•
•
•
•
•
•
•
Martin Kuiper (NTNU, NO)
Vladimir Mironov (NTNU, NO)
Mikel Egaña (U Manchester, UK)
Robert Stevens (U Manchester, UK)
Ward Blonde (U Ghent, BE)
Bernard De Baets (U Ghent, BE)
Alan Ruttenberg (Science Commons, US)
Alistair Rutherford (www.netthreads.co.uk)
Users
http://ww.semantic-systems-biology.org
22
EXTRA slides
23
Metarel
• Metarel is a generic ontological hierarchy for
relation types, consistent with OBOF and RDF.
• It includes meta-information like transitivity,
reflexivity and composition.
• BioMetarel includes all the biological relation
types that are used in BioGateway.
• We are still testing the exploitation of
composition, like A located in B and B part of
C, gives A located in C.
24
The RDF export specifications
• The RDF is automatically generated with ontoperl, our own ontology API.
• Many choices for the RDF specifications were
made during the testing of the queries.
• The resources are available either as part of an
integrated graph or as individual graphs.
• BioMetarel, a relation ontology, provides labels
for the URIs of the relations.
• OWL(XML/RDF) was avoided because it is too
verbose. We preferred RDF optimized for
querying.
25
BioGateway
The homepage of SSB, including BioGateway as a
first step towards this idea.
26
Use the buttons for
prefixes and other
constructs
Type a query here.
Click Run!
27
Select a query in the
drop-down box
The query editor
Click on Run to execute
the query
28
Parameterizing the queries made
easy.
29
The neighborhood of the human protein 1443F in the RDF-graph
The resulting triples (arrows)
are represented as a small
grammatical sentence:
subject, predicate, object.
Outgoing arrows
Incoming arrows
30
998 RDF-files can be
downloaded from the
Resources page
The graph
names can
be used to
query or
combine
individual
graphs for
quicker
answers or
more
specific
information
31
The results
appear in a
separate
window
32
Download