Structuring the life science resourceome for Semantic Systems Biology: lessons from the BioGateway project Erick Antezana Dept. of Plant Systems Biology VIB/University of Ghent, BELGIUM erick.antezana@psb.ugent.be 1 Contents 1. 2. 3. 4. 5. 6. Semantic Systems Biology Data integration and exploitation BioGateway Concluding remarks Discussion Next steps 2 The four steps of Systems Biology 1. Define all of the components of the system, build model, simulate and predict 2. Systematically perturb and monitor components of the system 3. Reconcile the experimentally observed responses with those predicted by the model 4. Design and perform new perturbation experiments to distinguish between multiple or competing model hypotheses Kitano, Science, 2002 3 Mathematical model Data analysis Information extraction New information to model Model Refinement Systems Biology Cycle Experimentation, Data generation Dynamical simulations and hypothesis formulation Experimental design 4 Semantic Knowledge Base Information extraction, Knowledge formalization Consistency checking Querying Automated reasoning Semantic Systems Biology Cycle Experimentation, Data generation Hypothesis formulation Experimental design 5 BioGateway Some motivating questions • Cancer: what candidate genes are involved in cell cycle control, S-phase to G2 transition, DNA damage response and skin cancer? • Gastrin: what genes correlate with cancer and the use of anti-acids, and are involved in the gastrin response, and are associated with cell cycle control? • Inflammation: give me genes that are mentioned in the context of high carbohydrate intake and play a role in process P and are within S steps from a GO ontology term related to inflammation => Keyword-based search gets OBSOLETE 6 BioGateway Design • Quick query results: performance, choice: “tuned” RDF (no OWL), 1 graph per resource • Human “readable” output: labels, no IDs or URI… • Good practice: – Standards (RDF) => orthogonality, … – Representation issues (e.g. n-ary relations) • Transitive closure: is_a, part_of 7 Transitive closure graphs • A transitive closure was constructed for the subsumption relation (is a) and the partonomy relation (part of) • If A is a B, and B is a C, then A is a C is also added to the graph. • Many interesting queries can be done in a performant way with it, like 'What are the proteins that are located in the cell nucleus or any subpart thereof?' • The graphs without transitive closure are available for querying as well. 8 BioGateway • Automatic pipeline – Run on a regular basis (~3 months) – Latest data available (from scratch) • Uses Virtuoso Open Server – Open Source software that can host a triple store – Can build this from RDF files – Has a DB backend • Supports SPARQL* http://www.openlinksw.com/virtuoso/ *http://www.w3.org/TR/rdf-sparql-query/ 9 BioGateway architecture 10 BioGateway graphs • 1 Swiss-Prot file, the section of UniProt KB of proteins • 1 NCBI file with the taxonomy of organisms • 1 Metaonto file with information about OBO Foundry ontologies • 2 Metarel files with relation type properties • 5 CCO files with integrated information about cell cycle proteins • 44 OBO Foundry files with diverse biomedical information + Transitive Closure • 51 Transitive Closure files to enhance query abilities • 893 GOA files with GO annotations BioGateway holds ~175 million triples!!! 11 BioGateway graphs Each RDF-resource in BioGateway has a URI of this form: http://www.semantic-systems-biology.org/SSB#resource_id Each RDF-graph in BioGateway has a URI of this form: http://www.semantic-systems-biology.org/graph_name 12 Sample RDF-ication: GOA UniProtKB O03042 O03042 GO:0000287 GOA:spkw|GO_REF:0000004 IEA “The protein O03042 (Ribulose bisphosphate carboxylase large chain) is annotated with the GO term GO:0000287 (Magnesium Ion Binding), a term in the Molecular Function subtree from GO. Therefore, O03042 has the molecular function of binding magnesium ion. This fact is supported by IEA, that is, Inferred from Electronic Annotation.” 13 A library of queries* • The drop-down box contains (so far) 34 queries: – 13 protein-centric biological queries: • • • • • The role of proteins in diseases Their interactions Their functions Their locations … – 21 ontological queries: • Browsing abilities in RDF like getting the neighborhood, the path to the root, the children,... • Meta-information about the ontologies, graphs, relations • Queries to show the possibilities of SPARQL on BioGateway, like counting, filtering, combining graphs,... • … 14 * http://www.semantic-systems-biology.org/biogateway/querying All the queries are explained in a tutorial* For every query the name, the parameters and the function are indicated at the top. The parameters are indicated in red. 15 * http://www.semantic-systems-biology.org/biogateway/tutorial http://www.semantic-systems-biology.org/sparql-viewer/ Limit The SPARQL-endpoint Execute The prefixes The query without the prefixes The URI's in blue. The results: 9 proteins Labeled arrows to extra 16 information Conclusions / Results • BioGateway: RDF store for Biosciences • Data integration pipeline: BioGateway • Queries and knowledge sources and system design go hand-in-hand (user interaction) • Still far from answering “complex” questions • SW technologies add a new dimension of Knowledge integration to Systems Biology • Existing integration obstacles due to: • diversity of data formats • lack of formalization approaches • Calls for ‘foundry’ type initiative for RDF 17 Next steps • More data sources (e.g. Nutrigenomics, pathways etc.) • RDF rules (e.g. RuleML) • A more user-friendly interface • Reasoning • OBO cross products • … 18 Discussion • W3C standards – limitations (e.g. spatiotemporal information, microarrays experiments) • Biological identifiers: URIs, LSIDs, MIRIAM URIs, etc. They should be scalable & resolvable. • Lack of semantic content: poor axiomatisation, inadequately codified. Use of standard languages (e.g. RDF). • Adequate tools: not adapted for real-size problems (e.g. reasoning). Designed with a universal architecture in mind. 19 SSB at a community level* • Semantic bio-content: encourage and facilitate • Best practices for such creation (standards) • Mechanism for identifying biological entities • Bridge semantic technology developers and life scientists (=the users) WIKI: http://www.bio.ntnu.no/systemsbiology/ssbwiki/doku.php 20 Principles 1. Orthogonality 2. A “common language” (e.g. RDF) 3. Unique ID + resolution (e.g. purl.org) 4. Comply to: ULO (e.g. BFO), RO, … 5. Explicit semantics 6. Rich axiomatisation 7. Application-driven development (e.g. SB) 8. Peer review (community evaluation) 9. Tooling (e.g. visualisation) 10. Licensing (e.g. CC) 21 Acknowledgements • • • • • • • • • Martin Kuiper (NTNU, NO) Vladimir Mironov (NTNU, NO) Mikel Egaña (U Manchester, UK) Robert Stevens (U Manchester, UK) Ward Blonde (U Ghent, BE) Bernard De Baets (U Ghent, BE) Alan Ruttenberg (Science Commons, US) Alistair Rutherford (www.netthreads.co.uk) Users http://ww.semantic-systems-biology.org 22 EXTRA slides 23 Metarel • Metarel is a generic ontological hierarchy for relation types, consistent with OBOF and RDF. • It includes meta-information like transitivity, reflexivity and composition. • BioMetarel includes all the biological relation types that are used in BioGateway. • We are still testing the exploitation of composition, like A located in B and B part of C, gives A located in C. 24 The RDF export specifications • The RDF is automatically generated with ontoperl, our own ontology API. • Many choices for the RDF specifications were made during the testing of the queries. • The resources are available either as part of an integrated graph or as individual graphs. • BioMetarel, a relation ontology, provides labels for the URIs of the relations. • OWL(XML/RDF) was avoided because it is too verbose. We preferred RDF optimized for querying. 25 BioGateway The homepage of SSB, including BioGateway as a first step towards this idea. 26 Use the buttons for prefixes and other constructs Type a query here. Click Run! 27 Select a query in the drop-down box The query editor Click on Run to execute the query 28 Parameterizing the queries made easy. 29 The neighborhood of the human protein 1443F in the RDF-graph The resulting triples (arrows) are represented as a small grammatical sentence: subject, predicate, object. Outgoing arrows Incoming arrows 30 998 RDF-files can be downloaded from the Resources page The graph names can be used to query or combine individual graphs for quicker answers or more specific information 31 The results appear in a separate window 32