A View from the Closed World Integrating Genomic Mapping Data using Ontologies:

advertisement
Integrating Genomic Mapping Data
using Ontologies:
A View from the Closed World
Trevor Paterson and Andy Law
(Roslin Institute, Scotland)
and the ComparaGrid Consortium
Aims:
- To develop ‘enabling technologies’ for comparative
genomics.
- To integrate disparate resources (genomic mapping,
DNA sequence, evolutionary relationships, functional
information) across species boundaries.
- In order to inform and expedite genomic mapping:
particularly in non-model organisms.
Consortium Members:
Farm animal, crop and microbial genomics
Bioinformatics
Computer Sciences
Ontologists
Statisticians
Dr. Andy Law
(project co-ordinator)
Dr Trevor Paterson
Roslin Institute
Dr. Peter Rice
Tony Burdett
EBI
Dr. Ian Roberts
IFR
Dr. Jo Dicks
RA Dr Robert Davey
JIC
Dr. Robert Stevens
Dr Andrew Gibson
Manchester
Dr. Darren Wilkinson
Dr. Richard Boys
(RA Dr Madhuchhanda
Bhattacharjee )
Newcastle
(Maths & Stats)
Dr. Neil Wipat
Dr. Matthew Pocock
Professor Paul Watson
Newcastle
(Computing
Science)
Dr. David Marshall
SCRI
Biological Goal
To MAP, IDENTIFY and UNDERSTAND genes
behind phenotypes (e.g. diseases & commercially
important traits)
ComparaGrid aims to assist this process by
exploiting existing genetic mapping data (held in
multiple online databases and resources) across
species boundaries.
UNDERLYING BIOLOGICAL PRINCIPAL BEHIND
CROSS-SPECIES MAP COMPARISON
Conservation of Synteny:
“Conservation of (blocks of) gene order
throughout chromosomal evolution”
As species evolve and diverge, their chromosomes rearrange through
duplications, inversions, translocations etc - but blocks of genes can be
traced through evolutionary history between even relatively divergent
species (e.g. chicken and man).
Therefore the known gene order in these blocks in one species can
inform/predict the order (and existence) of evolutionarily related
genes (orthologues) in other species.
Recognizing Orthology:
gene sequence similarity  homology
best sequence similarity + conserved position  orthology
duplication in one species cf. another  paralogy
COMPARATIVE GENOMICS USE CASE
Agribusiness wants to map the underlying genetic basis of the
‘Tasty Bacon’ Trait ( a QTL ).
COMPARATIVE GENOMICS USE CASE
Tasty Bacon
The ‘Tasty Bacon’ QTL has been genetically mapped.
QTL (Genetic) Map
COMPARATIVE GENOMICS USE CASE
The position of the QTL is correlated on various types of Pig
Genetic maps
Tasty Bacon
QTL
Map
Linkage
Map
Radiation
Hybrid Map
COMPARATIVE GENOMICS USE CASE
There is a ‘known’ homology between a Pig Marker/Sequence in
this region and the human genome
Pig
Human
DNA Sequence
Similarity
 Homology
? Orthology
QTL
Map
Linkage
Map
Radiation
Hybrid Map
Cytogenetic
Map
COMPARATIVE GENOMICS USE CASE
A Physical Map of BAC clones exists for this region of the Human
Genome
Pig
Human
BAC1
BAC2
BAC3
QTL
Map
Linkage
Map
Radiation
Hybrid Map
Cytogenetic
Map
Physical
Mapping
COMPARATIVE GENOMICS USE CASE
There are known chicken expressed sequences homologous to
Human Gene Sequences in this region
Pig
Chicken
Human
BAC1
EST1
BAC2
EST2
BAC3
QTL
Map
Linkage
Map
Radiation
Hybrid Map
Cytogenetic
Map
Physical
Mapping
EST
Library
COMPARATIVE GENOMICS USE CASE
Gene expression Data for these Chick ESTs might correlate with a
trait similar to ‘Tastiness’
Pig
Chicken
Human
BAC1
EST1
BAC2
EST2
BAC3
QTL
Map
Linkage
Map
Radiation
Hybrid Map
Cytogenetic
Map
Physical
Mapping
Expression
Analysis
COMPARATIVE GENOMICS USE CASE
The literature may detail Functions of Human genes in this region,
and homologies to genes in other species – helping the researcher
predict candidate genes in Pigs responsible for tastiness
Pig
Chicken
Human
BAC1
EST1
BAC2
EST2
BAC3
QTL
Map
Linkage
Map
Radiation
Hybrid Map
Cytogenetic
Map
Linked
References
Physical
Mapping
Expression
Analysis
THE DATA: FACTS AND ASSERTIONS
- Maps (positions of 'markers' on representations of chromosomes)
e.g. genetic linkage, radiation hybrid, linkage association,
cytogenetic, QTLs etc.
- Genomic Sequences for model organisms – with annotations
including gene positions and structures,
- DNA Sequence Databases (for genes, clones etc.)
- Gene and Protein Function Databases
- Gene Family and Homology Databases (including Gene
Orthology Databases)
WE ONLY RECORD POSITIVE DATA!
RDBMS Data query works on Closed World Assumption.
The scientist has to decide whether absence of data means NOT or UNKNOWN!
Assumptions Interpreting Data
• (Almost) all our data is recorded in RDBMS
• (Almost) all our data is 'positive'
– (A maps on X, A is related to B, ...)
• Negative data is rarely explicitly recorded
– (A does not map on X, A is not related to B, ...)
• Some (lots) of our data is contradictory
– (A maps on X at position (x,y), A maps on X at position (p,q), A
maps on Y at position (s,t), ...)
Linking the Data: Traversing Relationships
Between Maps and Species
Can we create a system that integrates and queries across datasources
implementing the reasoning of an expert scientist?
•
Expert biological reasoning
– Currently performed manually/in cerebro by the expert (difficult, error-prone,
inefficient)
– Integrating explicit facts and recorded assertions and relationships
– Accounting for the implicit assumptions of the knowledge domain
– Relying on 'working' assumptions that are implicit, and dependent on the data
values and the amount of data available
– Relies on interpreting and extrapolating the relationships and gaps in the
data/relationships manually e.g. interpreting what is implied by the nonexistence of information
Linking the Data: Traversing Relationships
Between Species
Consider maps in two species:
speciesA
speciesB
act2
act6
act2
• A-act2 is an orthologue of B-act2
• B-mys and B-pseudo-mys are homologous
mys
trp
orf231
• the sequence of A-act6 is similar to B-act6 and
B-act2
act6
mysA
mysB
Various datasources may record
various facts and assertions relating
'markers' in species A with speciesB
• A-mysA and A-mysB are paralogous
pseudo-mys
trp
tyr
• the sequences of A-mysB, B-mys and B-pseudomys are similar
• the sequences of A-orf231 and B-trp are related
• Clearly not ALL the possible relationships between the markers are
recorded (or known) – but we don't even know if they have been investigated
or not.
• There are some necessarily true deductions that we can make – and some
possibly true deductions.
• For example is A-trp related to B-trp? Does absence of information imply
that it isn't (Closed World) or allow that it might be (Open World).
The message is that there are no knowns.
There are things we know that we know.
There are known unknowns,
that is to say there are things we now know
we don’t know.
But there are also unknown unknowns
- things we do not know we don’t know.
Donald Rumsfeld
The interpretation of Data/Assertions (i.e. our
assumptions) might also change according to the
particular data values.
• Pig Map 1: -A-B-C-D-E-F- Pig Map 2: -A-B-C-D------(----)-----E-F• 3 markers on the Pig Map are orthologous to markers on Human
Map HSA3: -A-B-D?
• 1 marker on our Pig Map is orthologous to a marker on HSA1: -F• where are the orthologues of C and E in HSA (if they exist!)?
• predict C might lie in the ABD cluster on HSA3
• E might be linked to the ABD cluster or be linked to F on
HSA1 – (or maybe elsewhere........)
• still predict C might lie in the ABD cluster on HSA3
• we might now predict now that E is more likely to be linked to
F on HSA1
The interpretation of Assertions (i.e. our assumptions)
changes according to context
•
In the preceding examples the using data in our RDBMS (Closed World) we can
discover conserved chromosomal regions and reason that the syntenic conservations
between speciesA and speciesB may include other (unknown) genes.
•
But nothing in those statements precludes any other evolutionary relationships also
being true (i.e. they are just unknown or unrecorded).
•
If we transfer the knowledge to open world reasoning how do we exclude potential
other truths. Or how do we restrict which possible other truths we wish to
postulate/test.
•
A biologist would interpret the available assertions and alter their assumptions based
on the available 'supporting evidence' (the amount, type and quality of
assertions/evidence recorded)
•
e.g. They might choose to ignore possible (open world) interpretations (Occam's
Razor) and might in some cases assume unknown knowledge is 'negative' – but they
must also account for the reality that they do not have all the information – they are
working in an open world system!
•
How can we mimic this using open world first order logic reasoning (with OWL-DL).
Linking the Data
Can we create a system that integrates and queries across
datasources implementing the reasoning of an expert scientist?
• Data Requirements
– Need for standard representation of data
– Need for explicit and standardised 'relationships' between data
– Explicit assertions, with associated values to 'weight' these assertions
• evidence
• quality
• provenance
• Requirements for Automated Reasoning
–
–
–
–
–
capturing/understanding the supporting 'background' assumptions
what types of assertions are there
what types of assumptions are there in domain (implicit)
how do we currently use these in cerebro
how do we capture the semantics of these assertions and assumptions
to facilitate automated reasoning
AUTOMATING COMPARATIVE GENOMICS
Provide Architecture to Link, Integrate and Traverse Data Sources:
 GRID/ Web-services
Provide Data Standards to allow this:
 Syntax and Semantics of Data
Formalise the Links between Data: (e.g. sequence similarities,
evolutionary hypotheses)
 these relationships and assertions are Data too
 these are 'more interesting' to Biologists than mere 'facts'
 is there an important conceptual difference between:
The ‘nuts and bolts’ relationships with in the data
(‘EXPERIMENTAL OBSERVATIONS’ and ‘FACTS’)
versus
The biological hypotheses (‘ASSERTIONS’)
ComparaGrid aims to allow integration across datasources –
using a common representation of the data types – using a
shared Domain Ontology (in OWL-DL) to represent the data
classes and relationships.
The Domain Ontology represents a consistent open world
model of the classes and potential restrictions and
relationships on and between classes.
Representing data as instances of classes of the shared
Domain Ontology allows the potential to perform queries and
data retrieval using Reasoning Engines. However there are
potential pitfalls in translating data from a closed world
representation (RDBMS) to open world representation (OWLDL).
ComparaGRID Design
export
convert
Ontology
query
Data
query
USER
DISTRIBUTED
DATABASES
DATA
Different
SYNTAX
TERMINOLOGY
SEMANTICS
CLOSED WORLD
CLOSED WORLD
COMPARAGRID
ONTOLOGY
MAPPER
DATA AS
COMPARAGRID
OWL-DL
ONTOLOGY
Common
SYNTAX
TERMINOLOGY
SEMANTICS
OPEN WORLD
OPEN WORLD
COMPARAGRID
DATA
INTEGRATOR
Integrate
Cache
and Reason
over Data
SEMI-OPEN
WORLD
Modelling the Domain and Representing Data.
•
•
•
The Domain Ontology seeks to model the knowledge domain
Attempts to capture many explicit and implicit truths
Some implicit truths/interpretations more complex and context
dependant
•
Map data – Straightforward?
–
–
–
•
if a marker is at one position on a map then it cannot be at another different position
 OWL exclusion on class – fundamental property of a mapping
(but still have problems with contradictory data............)
Relationship data – Interpretation is Complex and Contextual
–
–
–
–
–
–
if a marker has an evolutionary relationship with another marker (in this or another species)
that does not preclude it from having an evolutionary relationship with any other marker (in this
or any other species)
database usually does not record negative statements
 OWL cannot exclude at class level
humans may choose to exclude unless stated
open world reasoning might require explicit negation – or anything might be true!
can we 'fake' closing the open world by generating negations for assertions not explicitly
stated
• but is this correct? (sometimes? not always?)
• and is it tractable?
• do we need extensions to OWL - epistemic operators?
Genomic Information at Roslin
www.thearkdb.org
• multispecies mapping resource
• mapping data generated by variety of experimental techniques
• by numerous laboratories
• stored in a relational database
• accessed through a java webapp using an object-relational
data model
• basically contains maps, 'markers', positions of markers on
maps, and relationships between markers and other markers (e.g.
through related DNA sequences)
The ARK Relational Model
Stores all data as relationships between objects
Objects: typed objects
alias
assertion
cytogenetic_map
chromosome
clone
contig
sequence
gene
laboratory
linkage_map
map
marker
physical_map
probe
primer
publication
QTL
radiation_hybrid_map
external_sequence
species
marker_type
...
The ARK Relational Model
Stores all data as relationships between objects
Objects: typed objects
Relationships: typed relationships (also objects):
Object <typed relationship> Object
also_known_as
belongs_to_species
belongs_to_group
has_associated_sequence
has_primer
is_an_alias_of
is_id_for
belongs_to_container
is_type
maps
is_map_of
has_associated_gene
supported_by
...
Developmental aims for ArkDB
(Using the ComparaGrid Architecture)
• traverse relationships/assertions to join up data between maps and
species
• allow intelligent reasoning over these relationships
• based on EVIDENCED assertions (ultimately sequence relationships)
• weight traversal of relationships on the quality and provenance of the
evidence
• allow user guided data discovery (select which type of assertions to
traverse)
• allow automatic data discovery
Developmental aims for ArkDB - Implications
(Using the ComparaGrid Architecture)
• data in RDBMS – model approximates to RDF statements
• representing this data in OWL-DL for ComparaGRID
• mapping closed world to open world
• how does this affect our data export – e.g. representing 'null information'
• what is the implication for reasoning in an open world over data that
comes from a closed world
• can we 'close down' parts of the open world – is this possible generically –
or is it a run time option – can we use epistemic operators
• how difficult is it to ask.....
• for genes known to be in some region or involved in some phenotype
• for genes that might/probably be involved
• for genes definitely NOT involved
Conclusions
•
We need a way to formalise/standardise the representation not only of the data in our
RDBMS (e.g. positional mappings) but also of the relationships between objects
(within and between species).
–
OWL is good for modelling the Domain, and providing a standardized data representation
•
We would like to 'reason' across our data
•
We need to capture assumptions in our domain – but these might change according
to the context of the data
•
Representing data as instances in an open world OWL-DL ontology allows open
world reasoning
•
This may cause us grief
–
–
–
•
satisfying a query that has no known answer returns all possible answers?
can we live with this? – or can we replicate the power of 'human reasoning' capturing our
context dependant assumptions (nonmonotonic logic?)
e.g. how do we treat null data.
Can it be done?......(over to you Matt)
Download