The Development of an Ontology Comparative Genomics.

advertisement

The Development of an Ontology for Data Integration and Query in

Comparative Genomics.

Trevor Paterson and Andy Law

Roslin Institute, Scotland

Aims:

- To develop ‘enabling technologies’ for comparative genomics.

- To integrate disparate resources (genomic mapping,

DNA sequence, evolutionary relationships, functional information) across species boundaries.

- In order to inform and expedite genomic mapping: particularly in non-model organisms.

Collaborators:

Farm animal, crop and microbial genomics;

Bioinformatics; Computer Sciences;

Statistics.

Dr. Andy Law

(project co-ordinator)

Dr Trevor Paterson

Dr. Peter Rice

Tony Burdett

Dr. Ian Roberts

Roslin Institute

EBI

IFR

Dr. Jo Dicks

RA Dr Robert Davey

Dr. Robert Stevens

Dr Andrew Gibson

Dr. Darren Wilkinson

Dr. Richard Boys

(RA Dr Madhuchhanda

Bhattacharjee )

Dr. Neil Wipat

Dr. Matthew Pocock

Professor Paul Watson

Dr. David Marshall

JIC

Manchester

Newcastle

(Maths & Stats)

Newcastle

(Computing

Science)

SCRI

DISPARATE GENOMIC MAPPING DATA

- for individual species

- multiple datatypes

- in many non-standard formats and databases

- archived in many locations, variety of access protocols

- data of variable quality and completeness

PLUS ONLINE BIOINFORMATICS RESOURCES

- DNA sequence and genome projects

- Gene structure and function

- Protein structure, family, function

- Evolutionary history, orthology, homology

- Phenotypes (genetic traits and diseases)

- Population genetics

- Gene expression patterns

- Publications

Current integration between datasources and across species is largely manual.

i.e.

difficult

,

error-prone

and

very inefficient

.

Why do Biologists want to integrate mapping data across species…?

What are they trying to do..?

GOAL  MAP,IDENTIFY AND UNDERSTAND GENES

BEHIND PHENOTYPES (i.e. DISEASES & TRAITS)

ComparaGrid aims to assist this process by exploiting existing mapping data across species boundaries.

UNDERLYING BIOLOGICAL PRINCIPAL BEHIND

CROSS-SPECIES MAP COMPARISON

Conservation of Synteny :

“Conservation of (blocks of) gene order throughout chromosomal evolution”

As species evolve and diverge, their chromosomes rearrange through duplications, inversions, translocations etc - but blocks of genes can be traced through evolutionary history between even relatively divergent species (e.g. chicken and man).

Therefore the known gene order in these blocks in one species can inform/predict the order of evolutionarily related genes

(orthologues) in other species .

Ancestral

Chromosome

Speciation

Event

Breakage

Duplicative inversion

Modern

Species species B species A

20M years ago species A’

10M

Inversion

NOW

Ancestral

Chromosome

Speciation

Event

Breakage

Duplicative inversion

Modern

Species

Sequence

Similarity &

Conserved

Synteny

=>

Orthology species B

HyPOTHESIS

species A

20M years ago species A’

10M

Inversion

NOW

COMPARATIVE GENOMICS USE CASE

Agribusiness wants to map the underlying genetic basis of the

‘Tasty Bacon’ Trait ( a QTL ).

QTL (Genetic) Map

COMPARATIVE GENOMICS USE CASE

The position of the QTL is correlated on various types of Pig

Genetic maps

Tasty Bacon

QTL

Map

Linkage

Map

Radiation

Hybrid Map

COMPARATIVE GENOMICS USE CASE

There is a ‘known’ homology between a Pig Marker/Sequence in this region and the human genome

Pig Human

DNA Sequence

Similarity

=> Homology

=>? Orthology

QTL

Map

Linkage

Map

Radiation

Hybrid Map

Cytogenetic

Map

COMPARATIVE GENOMICS USE CASE

A Physical Map of BAC clones exists for this region of the Human

Genome

Pig Human

BAC1

BAC2

BAC3

QTL

Map

Linkage

Map

Radiation

Hybrid Map

Cytogenetic

Map

Physical

Mapping

COMPARATIVE GENOMICS USE CASE

There are known chicken expressed sequences homologous to

Human Gene Sequences in this region

Pig Human Chicken

BAC1

BAC2

EST1

EST2

BAC3

QTL

Map

Linkage

Map

Radiation

Hybrid Map

Cytogenetic

Map

Physical

Mapping

COMPARATIVE GENOMICS USE CASE

Gene expression Data for these Chick ESTs might correlate with a trait similar to ‘Tastiness’

Pig Human Chicken

BAC1

BAC2

EST1

EST2

BAC3

QTL

Map

Linkage

Map

Radiation

Hybrid Map

Cytogenetic

Map

Physical

Mapping

Expression

Analysis

COMPARATIVE GENOMICS USE CASE

The literature may detail Functions of Human genes in this region, and homologies to genes in other species – helping the researcher predict candidate genes in Pigs responsible for tastiness

Pig Human Chicken

BAC1

BAC2

EST1

EST2

BAC3

QTL

Map

Linkage

Map

Radiation

Hybrid Map

Cytogenetic

Linked

Map

References

Physical

Mapping

Expression

Analysis

COMPARATIVE GENOMICS USE CASE:

HOW CAN WE AUTOMATE THIS?

Provide Architecture to Link and Traverse Data Sources….

 GRID/ Web-services

Provide Data Standards to allow this

 Syntax and Semantics of Data

Formalise the Links between Data:

 these Relationships are Data too

 these are what the Biologists care about

WHAT DOES COMPARAGRID NEED TO INTEGRATE

DATASOURCES IN A BIOLOGICALLY

RELEVANT FASHION?

A lightweight Exchange Standard or a heavyweight

Ontology in OWL-DL?

1. Lightweight Mapping from RDB Schema to standard

Minimally: a data exchange standard

(defines structure and vocabulary for data exchange):

 XML Schema? RDF?

(a ‘straightforward’ mapping by data providers, integration logic handling the meaning of

‘relationships’ must be in the Application)

WHAT DOES COMPARAGRID NEED TO INTEGRATE

DATASOURCES?

A lightweight Exchange Standard or a heavyweight

Ontology in OWL-DL?

2. More Heavyweight Mapping

Capturing the Semantics of the Data

 Defined RDFS Vocabulary?

(mapping still quite lightweight, data is better defined & more reliably integrated, integration of data can be automatic,

Applications can rely on semantics)

WHAT DOES COMPARAGRID NEED TO INTEGRATE

DATASOURCES?

A lightweight Exchange Standard or a heavyweight

Ontology in OWL-DL?

3. Heavyweight Mapping

Semantically represent the Relationships between Data

(and Relationships between Relationships…):

 Formal Ontology (OWL-DL)

(mapping from datasource to Ontology is complex and specialist,

Automatic integration and inference is possible over data represented as individuals of the ontology)

DO WE NEED YET ANOTHER ONTOLOGY?

• We think comparative genomics is very different from other biological knowledge domains…(SO, OBO, GO…)

• We need to integrate both abstract and physical data – experimental observations positioning ‘markers’ on abstract maps, and physical locations of ‘features’ on representations of DNA sequences

• Metadata is important – we need to treat mapping data as assertions

– that might be accepted or rejected on the basis of quality, provenance and trust

• We need to represent evolutionary relationships between mapped objects – these are also assertions – not facts – based through the relatedness of underlying physical objects (sequence similarity).

• Integration between datasources depends on accepting these evolutionary assertions!

IDEALIZED COMPARAGRID ARCHITECTURE:

The OWL Ontology forms the 'semantic glue' to integrate data sources and express cross species queries.

The mapping between the data source schema and the integration schema (the CG OWL Ontology) is critical.

COMPARAGRID STACK ARCHITECTURE:

A publisher service automates mapping DB Schema to OWL

Bespoke mapping rules map from DB-OWL to CG-OWL

Raw data Syntax Semantics Aggregation

SQL

DB CG

Raw data

Publisher service

Transformer service

Integrator

BUILDING THE COMPARAGRID ONTOLOGY

Stage I (Biologists & Bioinformaticians input)

• Define the Scope of the Domain

• Collect the terminology used in the Domain

• Interview practising experts

• Document some use cases

• Observe how the experts perform an analysis

• Define the terms and relationships necessary

• Model the knowledge domain

OUTPUTS:

a model of the knowledge domain

a prototype ontology (in OWL-DL): terms and relationships necessary to represent the data and the relationships between data

(Using Protégé).

BUILDING THE COMPARAGRID ONTOLOGY

Stage II (Biologists, Bioinformaticians, Ontologists)

• Hold workshops for panels of experts across the scope of the domain (animal, plant, microbe).

• Confirm the Concepts and Relationships that are required.

• Confirm our model of the knowledge domain.

• Iterate and refine the prototype model representing this model.

OUTPUT: version 1 prototype ComparaGrid OWL Ontology

HIERARCHY OF CONCEPTS IN THE

COMPARAGRID ONTOLOGY

COMPARAGRID ONTOLOGY:

Simple Relationships = Properties

Hierarchy of Object to Object

Properties

Hierarchy of Object to Value

Properties

In OWL-DL complex relationships can be modelled as Concepts

Simple RDF Statement Representation of a Relationship

Chromosome isMapOf

Map

DomainConcept property

DomainConcept

Richer Representation as OWL Class

Map

DomainConcept relatesFrom

(property) isMapOf

Unidirectional

Relationship

Chromosome

DomainConcept relatesTo

(property) property hasEvidence

DomainConcept

Citation property

Value identifier <String of Characters>

The Importance of Relationships

Biologists and Bioinformaticians see an important conceptual difference between:

The ‘nuts and bolts’ relationships with in the data

(‘EXPERIMENTAL OBSERVATIONS’ and ‘FACTS’)

Vs

The biological hypotheses (‘ASSERTIONS’)

Hopefully the richness and expressivity of OWL-DL will give us the opportunity to capture the subtleties of the different types of relationships and how they may relate to each other.

Critically we want to infer over the data represented as individuals – not merely over properties of the ontology

COMPARAGRID ONTOLOGY:

Complex Relationships (as Concepts)

BUILDING THE COMPARAGRID ONTOLOGY

Stage III (Expert Ontologists)

• Refactor the prototype ontology according to good design principles

• Build a core upper-level comparative mapping domain ontology that will integrate with other domains

• Incorporate additional modules to represent specific subdomains (Genetic Variation, Abstract Mapping

Concepts, Evidence, Evolutionary Relationships etc.)

OUTPUT: modularised ComparaGrid OWL Ontology

THE MODULARISED COMPARAGRID ONTOLOGY

BUILDING THE COMPARAGRID ONTOLOGY

Timescale

• Stage I : 6 months

• Stage II : 6 months

• Stage III : ongoing / 3 years

Problem how do we develop the architecture and software, when we don’t have a final Ontology or model?

• Use the Prototype version?

• Use small hack ontologies for demonstration data?

But can we be sure the principals will work for the final larger, more complex Ontology?

USING THE COMPARAGRID ONTOLOGY:

Querying distributed resources through the

ComparaGrid Stack Architecture

• Tools for converting DB schema to OWL ontology

• Under Development…

• Automatic query translations up and down the stack

• Allows queries to be expressed and resolved in OWL

– should allow automated reasoning and inference

Roslin Ark Database’s experience as Data

Providers (and Biologists/Users)

• We want to export and import data in reusable format

• We could build all our own applications using a common data format……..allowing us to traverse data sets according to assertions made between the data.

• ….but want to use ComparaGrid’s ‘clever’ integration and query through OWL

• i.e. we want to exchange data as OWL – so have to incorporate mapping from schema to OWL into our service architecture

Roslin Ark Database’s experience as Data

Providers

Problems:

• We are waiting for the ‘final’ ontology

• We are waiting for the stack architecture

(…which is waiting for the ontology)

• The ComparaGrid Architecture/Toolset is being designed to map from DB schema to OWL, but our DB schema captures none of our domain model……our mapping should be from Object model to OWL ….

• We have to implement our own mapping to OWL….

• We want to progress and ACTUALLY DO SOME BIOLOGY!

Schema

Object

Table

Relationship

Table

Ibatis

ArkIIDB

Object

Model

Web App

Drawing

Applet

Java

Objects

Java

Application

Download

App

Betwixt /

XSLT

CG

XML/RDF

CG-OWL

RDFS

Vocabulary

Web

Service

ComparaGrid Ontology:

Where are we at…and Why?

• Prototype OWL Ontology created:

- used to demonstrate mapping of ArkDB to Webservices.

- Ontology is flabby and poorly designed?

- Mapping from Java to OWL/XML is a cumbersome/manual process.

• Refactoring/modularising the ComparaGrid OWL Ontology is non trivial (Research Project in its own right!).

We are not able to use a ‘final’ ontology to drive the development of services.

• Until we have a working common data format or ontology we can’t start to import and export further datasources

ComparaGrid Ontology:

Where are we at…and Why?

• Implementation of Comparagrid stack integration and query architecture is ongoing.

• Automated / Assisted mapping tools under development.

(DB relational schema  DB-OWL  CG-OWL)

[Using hack ontology fragments in the interim.]

• We need further tools to support mapping from any adhoc database or object model to OWL

ComparaGrid Ontology:

Where are we at…and Why?

• As data providers Roslin ArkDB is dependent on the tools and infrastructure being developed by ComparaGrid – without knowing how much added value an ontology will give….

• We hope that the ontology will allow us to represent the

‘interesting’ biological relationships

• That it will facilitate automated integration and data traversal

• That it will allow inference of new knowledge automatically

• However…the burden is put on the data mapping process

– a more lightweight approach would simplify this

(e.g. RDF/RDFS), but might require that applications understand the context of information sources.

• RDF(S) is becoming quite well supported – and allows some inference over semantic relationships.

WOULD IT BE GOOD

ENOUGH FOR US?

Download