Trevor Paterson and Andy Law
Roslin Institute, Scotland
Aims:
- To develop ‘enabling technologies’ for comparative genomics.
- To integrate disparate resources (genomic mapping,
DNA sequence, evolutionary relationships, functional information) across species boundaries.
- In order to inform and expedite genomic mapping: particularly in non-model organisms.
Collaborators:
Farm animal, crop and microbial genomics;
Bioinformatics; Computer Sciences;
Statistics.
Dr. Andy Law
(project co-ordinator)
Dr Trevor Paterson
Dr. Peter Rice
Tony Burdett
Dr. Ian Roberts
Roslin Institute
EBI
IFR
Dr. Jo Dicks
RA Dr Robert Davey
Dr. Robert Stevens
Dr Andrew Gibson
Dr. Darren Wilkinson
Dr. Richard Boys
(RA Dr Madhuchhanda
Bhattacharjee )
Dr. Neil Wipat
Dr. Matthew Pocock
Professor Paul Watson
Dr. David Marshall
JIC
Manchester
Newcastle
(Maths & Stats)
Newcastle
(Computing
Science)
SCRI
DISPARATE GENOMIC MAPPING DATA
- for individual species
- multiple datatypes
- in many non-standard formats and databases
- archived in many locations, variety of access protocols
- data of variable quality and completeness
PLUS ONLINE BIOINFORMATICS RESOURCES
- DNA sequence and genome projects
- Gene structure and function
- Protein structure, family, function
- Evolutionary history, orthology, homology
- Phenotypes (genetic traits and diseases)
- Population genetics
- Gene expression patterns
- Publications
Current integration between datasources and across species is largely manual.
i.e.
,
and
.
Why do Biologists want to integrate mapping data across species…?
What are they trying to do..?
GOAL MAP,IDENTIFY AND UNDERSTAND GENES
BEHIND PHENOTYPES (i.e. DISEASES & TRAITS)
ComparaGrid aims to assist this process by exploiting existing mapping data across species boundaries.
UNDERLYING BIOLOGICAL PRINCIPAL BEHIND
CROSS-SPECIES MAP COMPARISON
Conservation of Synteny :
“Conservation of (blocks of) gene order throughout chromosomal evolution”
As species evolve and diverge, their chromosomes rearrange through duplications, inversions, translocations etc - but blocks of genes can be traced through evolutionary history between even relatively divergent species (e.g. chicken and man).
Therefore the known gene order in these blocks in one species can inform/predict the order of evolutionarily related genes
(orthologues) in other species .
Ancestral
Chromosome
Speciation
Event
Breakage
Duplicative inversion
Modern
Species species B species A
20M years ago species A’
10M
Inversion
NOW
Ancestral
Chromosome
Speciation
Event
Breakage
Duplicative inversion
Modern
Species
Sequence
Similarity &
Conserved
Synteny
=>
Orthology species B
species A
20M years ago species A’
10M
Inversion
NOW
COMPARATIVE GENOMICS USE CASE
Agribusiness wants to map the underlying genetic basis of the
‘Tasty Bacon’ Trait ( a QTL ).
QTL (Genetic) Map
COMPARATIVE GENOMICS USE CASE
The position of the QTL is correlated on various types of Pig
Genetic maps
Tasty Bacon
QTL
Map
Linkage
Map
Radiation
Hybrid Map
COMPARATIVE GENOMICS USE CASE
There is a ‘known’ homology between a Pig Marker/Sequence in this region and the human genome
Pig Human
DNA Sequence
Similarity
=> Homology
=>? Orthology
QTL
Map
Linkage
Map
Radiation
Hybrid Map
Cytogenetic
Map
COMPARATIVE GENOMICS USE CASE
A Physical Map of BAC clones exists for this region of the Human
Genome
Pig Human
BAC1
BAC2
BAC3
QTL
Map
Linkage
Map
Radiation
Hybrid Map
Cytogenetic
Map
Physical
Mapping
COMPARATIVE GENOMICS USE CASE
There are known chicken expressed sequences homologous to
Human Gene Sequences in this region
Pig Human Chicken
BAC1
BAC2
EST1
EST2
BAC3
QTL
Map
Linkage
Map
Radiation
Hybrid Map
Cytogenetic
Map
Physical
Mapping
COMPARATIVE GENOMICS USE CASE
Gene expression Data for these Chick ESTs might correlate with a trait similar to ‘Tastiness’
Pig Human Chicken
BAC1
BAC2
EST1
EST2
BAC3
QTL
Map
Linkage
Map
Radiation
Hybrid Map
Cytogenetic
Map
Physical
Mapping
Expression
Analysis
COMPARATIVE GENOMICS USE CASE
The literature may detail Functions of Human genes in this region, and homologies to genes in other species – helping the researcher predict candidate genes in Pigs responsible for tastiness
Pig Human Chicken
BAC1
BAC2
EST1
EST2
BAC3
QTL
Map
Linkage
Map
Radiation
Hybrid Map
Cytogenetic
Linked
Map
References
Physical
Mapping
Expression
Analysis
COMPARATIVE GENOMICS USE CASE:
HOW CAN WE AUTOMATE THIS?
Provide Architecture to Link and Traverse Data Sources….
GRID/ Web-services
Provide Data Standards to allow this
Syntax and Semantics of Data
Formalise the Links between Data:
these Relationships are Data too
these are what the Biologists care about
WHAT DOES COMPARAGRID NEED TO INTEGRATE
DATASOURCES IN A BIOLOGICALLY
RELEVANT FASHION?
A lightweight Exchange Standard or a heavyweight
Ontology in OWL-DL?
1. Lightweight Mapping from RDB Schema to standard
Minimally: a data exchange standard
(defines structure and vocabulary for data exchange):
XML Schema? RDF?
(a ‘straightforward’ mapping by data providers, integration logic handling the meaning of
‘relationships’ must be in the Application)
WHAT DOES COMPARAGRID NEED TO INTEGRATE
DATASOURCES?
A lightweight Exchange Standard or a heavyweight
Ontology in OWL-DL?
2. More Heavyweight Mapping
Capturing the Semantics of the Data
Defined RDFS Vocabulary?
(mapping still quite lightweight, data is better defined & more reliably integrated, integration of data can be automatic,
Applications can rely on semantics)
WHAT DOES COMPARAGRID NEED TO INTEGRATE
DATASOURCES?
A lightweight Exchange Standard or a heavyweight
Ontology in OWL-DL?
3. Heavyweight Mapping
Semantically represent the Relationships between Data
(and Relationships between Relationships…):
Formal Ontology (OWL-DL)
(mapping from datasource to Ontology is complex and specialist,
Automatic integration and inference is possible over data represented as individuals of the ontology)
DO WE NEED YET ANOTHER ONTOLOGY?
• We think comparative genomics is very different from other biological knowledge domains…(SO, OBO, GO…)
• We need to integrate both abstract and physical data – experimental observations positioning ‘markers’ on abstract maps, and physical locations of ‘features’ on representations of DNA sequences
• Metadata is important – we need to treat mapping data as assertions
– that might be accepted or rejected on the basis of quality, provenance and trust
• We need to represent evolutionary relationships between mapped objects – these are also assertions – not facts – based through the relatedness of underlying physical objects (sequence similarity).
• Integration between datasources depends on accepting these evolutionary assertions!
IDEALIZED COMPARAGRID ARCHITECTURE:
The OWL Ontology forms the 'semantic glue' to integrate data sources and express cross species queries.
The mapping between the data source schema and the integration schema (the CG OWL Ontology) is critical.
COMPARAGRID STACK ARCHITECTURE:
A publisher service automates mapping DB Schema to OWL
Bespoke mapping rules map from DB-OWL to CG-OWL
Raw data Syntax Semantics Aggregation
SQL
DB CG
Raw data
Publisher service
Transformer service
Integrator
BUILDING THE COMPARAGRID ONTOLOGY
Stage I (Biologists & Bioinformaticians input)
• Define the Scope of the Domain
• Collect the terminology used in the Domain
• Interview practising experts
• Document some use cases
• Observe how the experts perform an analysis
• Define the terms and relationships necessary
• Model the knowledge domain
OUTPUTS:
a model of the knowledge domain
a prototype ontology (in OWL-DL): terms and relationships necessary to represent the data and the relationships between data
(Using Protégé).
BUILDING THE COMPARAGRID ONTOLOGY
Stage II (Biologists, Bioinformaticians, Ontologists)
• Hold workshops for panels of experts across the scope of the domain (animal, plant, microbe).
• Confirm the Concepts and Relationships that are required.
• Confirm our model of the knowledge domain.
• Iterate and refine the prototype model representing this model.
OUTPUT: version 1 prototype ComparaGrid OWL Ontology
HIERARCHY OF CONCEPTS IN THE
COMPARAGRID ONTOLOGY
COMPARAGRID ONTOLOGY:
Simple Relationships = Properties
Hierarchy of Object to Object
Properties
Hierarchy of Object to Value
Properties
In OWL-DL complex relationships can be modelled as Concepts
Simple RDF Statement Representation of a Relationship
Chromosome isMapOf
Map
DomainConcept property
DomainConcept
Richer Representation as OWL Class
Map
DomainConcept relatesFrom
(property) isMapOf
Unidirectional
Relationship
Chromosome
DomainConcept relatesTo
(property) property hasEvidence
DomainConcept
Citation property
Value identifier <String of Characters>
The Importance of Relationships
Biologists and Bioinformaticians see an important conceptual difference between:
The ‘nuts and bolts’ relationships with in the data
(‘EXPERIMENTAL OBSERVATIONS’ and ‘FACTS’)
Vs
The biological hypotheses (‘ASSERTIONS’)
Hopefully the richness and expressivity of OWL-DL will give us the opportunity to capture the subtleties of the different types of relationships and how they may relate to each other.
Critically we want to infer over the data represented as individuals – not merely over properties of the ontology
COMPARAGRID ONTOLOGY:
Complex Relationships (as Concepts)
BUILDING THE COMPARAGRID ONTOLOGY
Stage III (Expert Ontologists)
• Refactor the prototype ontology according to good design principles
• Build a core upper-level comparative mapping domain ontology that will integrate with other domains
• Incorporate additional modules to represent specific subdomains (Genetic Variation, Abstract Mapping
Concepts, Evidence, Evolutionary Relationships etc.)
OUTPUT: modularised ComparaGrid OWL Ontology
THE MODULARISED COMPARAGRID ONTOLOGY
BUILDING THE COMPARAGRID ONTOLOGY
Timescale
• Stage I : 6 months
• Stage II : 6 months
• Stage III : ongoing / 3 years
Problem how do we develop the architecture and software, when we don’t have a final Ontology or model?
• Use the Prototype version?
• Use small hack ontologies for demonstration data?
But can we be sure the principals will work for the final larger, more complex Ontology?
USING THE COMPARAGRID ONTOLOGY:
Querying distributed resources through the
ComparaGrid Stack Architecture
• Tools for converting DB schema to OWL ontology
• Under Development…
• Automatic query translations up and down the stack
• Allows queries to be expressed and resolved in OWL
– should allow automated reasoning and inference
Roslin Ark Database’s experience as Data
Providers (and Biologists/Users)
• We want to export and import data in reusable format
• We could build all our own applications using a common data format……..allowing us to traverse data sets according to assertions made between the data.
• ….but want to use ComparaGrid’s ‘clever’ integration and query through OWL
• i.e. we want to exchange data as OWL – so have to incorporate mapping from schema to OWL into our service architecture
Roslin Ark Database’s experience as Data
Providers
Problems:
• We are waiting for the ‘final’ ontology
• We are waiting for the stack architecture
(…which is waiting for the ontology)
• The ComparaGrid Architecture/Toolset is being designed to map from DB schema to OWL, but our DB schema captures none of our domain model……our mapping should be from Object model to OWL ….
• We have to implement our own mapping to OWL….
• We want to progress and ACTUALLY DO SOME BIOLOGY!
Schema
Object
Table
Relationship
Table
Ibatis
ArkIIDB
Object
Model
Web App
Drawing
Applet
Java
Objects
Java
Application
Download
App
Betwixt /
XSLT
CG
XML/RDF
CG-OWL
RDFS
Vocabulary
Web
Service
ComparaGrid Ontology:
Where are we at…and Why?
• Prototype OWL Ontology created:
- used to demonstrate mapping of ArkDB to Webservices.
- Ontology is flabby and poorly designed?
- Mapping from Java to OWL/XML is a cumbersome/manual process.
• Refactoring/modularising the ComparaGrid OWL Ontology is non trivial (Research Project in its own right!).
We are not able to use a ‘final’ ontology to drive the development of services.
• Until we have a working common data format or ontology we can’t start to import and export further datasources
ComparaGrid Ontology:
Where are we at…and Why?
• Implementation of Comparagrid stack integration and query architecture is ongoing.
• Automated / Assisted mapping tools under development.
(DB relational schema DB-OWL CG-OWL)
[Using hack ontology fragments in the interim.]
• We need further tools to support mapping from any adhoc database or object model to OWL
ComparaGrid Ontology:
Where are we at…and Why?
• As data providers Roslin ArkDB is dependent on the tools and infrastructure being developed by ComparaGrid – without knowing how much added value an ontology will give….
• We hope that the ontology will allow us to represent the
‘interesting’ biological relationships
• That it will facilitate automated integration and data traversal
• That it will allow inference of new knowledge automatically
• However…the burden is put on the data mapping process
– a more lightweight approach would simplify this
(e.g. RDF/RDFS), but might require that applications understand the context of information sources.
• RDF(S) is becoming quite well supported – and allows some inference over semantic relationships.
WOULD IT BE GOOD
ENOUGH FOR US?