The Aim • Integrate genomic mapping data across different

advertisement
The Aim
• Integrate genomic mapping data across different
data sources on the Internet
• To allow exploration of mapping information
• To discover or predict new information
• By exploiting Conservation of Synteny across
species boundaries.
Trevor Paterson 31 May 2016
The Problems
• Lots of different types of mapping data
• from different types of experiments
• in various species
• different locations/ computer systems
• different representations of data
• different terminology
• different storage formats
• data of variable quality
• …et cetera
Trevor Paterson 31 May 2016
(Part Of) The Solution
• Define a language/terminology to describe and
represent mapping data unambiguously
• Capturing both the meaning of the data and the
relationships represented in the data.
• Use this language as a common representation for
exchanging and querying data and providing results.
• Perhaps use the semantics captured in this
language to automatically discover new information.
Trevor Paterson 31 May 2016
What is an Ontology?
DEFINED TERMINOLOGY
Terms + Definitions
A controlled common
vocabulary for describing
and for querying data
unambiguously
ONTOLOGY
Defined Terminology
+ How terms are used
Captures semantics of data
• checks data is consistent
• infers new information
• facilitates semantically
complex queries
Defined
Teminology
Increasing Formality of Ontology
Formal
Logics
Defining the ComparaGRID Domain Ontology.
Ontology a (more or less) 'formal' specification of a domain
of knowledge
(Here: Genomic Mapping data across all species…)
• What types of concepts are there (defined terms, things
we need to talk about)
• And how these concepts (might or necessarily) relate to
each other
• Can be used to control the vocabulary used for storing or
describing data
• Can represent Formal Logics: allow 'reasoning' about
data (Software can check the validity of data and deduce new
information).
Formal Logics….What?
If we 'know' that:
&
1. Concept A is related to Concept B [ α  β ]
2. Concept B is related to Concept C [ β  γ ]
Can we deduce/reason anything about possible
relationships between Concepts A and C?
[ α ?? γ ]
Formal Logics….Why?
For example
If in species A
&
&
Gene A is syntenic with Gene B
Gene B is syntenic with Gene C
Gene C is syntenic with Gene D
α↔β
β↔γ
γ↔δ
If we have defined Synteny to be Transitive:
We can deduce that:
Gene A is syntenic with Gene D
α↔δ
Formal Logics….Why?
More complex example
If in species A
Gene A1 is syntenic with Gene A2 α1↔ α2
&
Gene A2 is syntenic with Gene A3
α2↔ α3 ? ↔ ? α4
And in a somewhat related Species B
Gene B2 is syntenic with Gene B3
β2↔ β3
&
Gene B3 is syntenic with Gene B4 ‘β1’ ? ↔ β3↔ β4
And sequence comparisons establish that A2/B2 , A3/B3, A4/B4 are
orthologues……..
We might be able to postulate that an orthologue of A1 might be
found syntenic with B2, 3 and 4; and that A4 might be syntenic with
A1,2,3
Exploiting Conserved Synteny
to predict candidate genes
Pig
QTL
is in here
somewhere
{
Human
A
A
Gene 1
Gene 2
B
B
C
C
D
D
E
E
‘Orthology’
Gene 3
Conservation
of Synteny –
with Conserved
Gene Order
}
Gene 4
These are
potential
candidate
genes
Conservation
of Synteny ?
The ComparaGRID Ontology
The ComparaGRID ontology defines the terminology
used in the domain of Comparative Genomics, and
how this terminology can be used.
There are two components of the ontology
1. Classes (Concepts, or terms with definitions)
and
2. Properties (simple relationships, between a Class
and a Value: the value can be another Class, or a
simple number etc.)
Example: Concepts and Properties
Map is a Concept: It has a definition
‘The abstract (typically linear) representation of an
informational macromolecule or chromosome etc.,
allowing the positioning of identifiable markers along
the length of the map...’
hasScaleUnit is a Property: In our ontology we can
define which Concepts can have particular
Properties, and which Concepts may be the values
of particular Properties.
Ontology Statement: Map: hasScaleUnit: ScaleUnit
Real Data: <RFxWL_Uppsala:Chromosome1>
hasScaleUnit <centiMorgan>
Building the Ontology
The process of Ontology Definition involves collecting all
the terms and relationships in the knowledge domain
Providing definitions for terms: Concepts
Classifying Concepts into related groups in a hierarchical
tree
Defining the relationships found in the data: Properties
Specifying the permitted domain and range for these
properties
Specifying which properties are allowed, which must
always be true, and which are disallowed
CONCEPTS
SIMPLE RELATIONSHIPS
transcribedFrom
Microsatellite
QuantitativeTrait
PartOf
identifier
hasAbbreviation
TechniqueUsed
Chromosome
COMPLEX RELATIONSHIPS
DNADuplication
Orthology
Interval Position
Mapping
Reciprocal
BestMatch
GeneticLinkageMap
Example: Modelling Maps
WHAT IS A MAP? – information about the presence and ordering of
Markers on an abstract representation of a macromolecule (DNA
Molecule, Chromosome or even a Polypeptide).
• ‘Linkage Group’: the simplest ‘Map’
• a collection or set of markers that are inherited together – without implied
order.
• i.e the relationship between a ‘Marker’ and a ‘Linkage Group’ is a
‘Containment’ - the Linkage Group contains Markers.
• A true ‘Map’
• has some sort of ordering of Markers belonging to a Linkage Group.
• i.e. the relationship between a Marker and a Map is a ‘Mapping’ which has a
‘Position’. This Position may be purely ordinal, or may be co-ordinate and be
associated with Scale Units. The Map maps Markers with a Position.
• A Map is a specialized type-of Linkage Group
Modelling Maps
Workshop One distinguished two types of Maps:
Physical Maps
Probabilistic Maps
1. Physical Map
A map of the locations of identifiable landmarks on ‘DNA’ (e.g.,
restriction-enzyme cutting sites, genes), regardless of inheritance. At highest
resolution, distance is measured in base pairs, other units may be used. For a
given genome, the lowest-resolution physical map might be the banding patterns
on the different chromosomes; the highest-resolution
physical map of a DNA
Molecule is its complete nucleotide sequence.
e.g. Contig Map;
Cytogenetic Map; Breakpoint Map;
Deletion Map;
FingerprintMap; Restriction Site Map; Sequence Map (Amino Acid, DNA, RNA)
Modelling Maps
2. Probabilistic Map
A map of the relative locations of markers on a chromosome derived
from an experimental analysis tracking the propensity markers to be inherited
together following natural or induced chromosomal disruption. i.e based on some
probabilistic measure of closeness.
e.g.
Genetic Linkage Map; Meiotic Linkage Map; Radiation Hybrid Map;
HAPPY Map
In addition we might represent
3. Integrated Map
A map combining mapping data from multiple map sources and experiments
The Importance of Relationships
Defining concepts is ‘easy’….;-)
In many respects defining concepts such as maps, genes, positions,
chromosomes etc. to represent the species specific maps in existing
datasources is straightforward. This language defines the nuts and bolts
used to represent and exchange the data between individual datasources.
However, some concepts are problematic – even within one datasource
e.g. what is meant by a ‘Marker’?
Even more complicated are the Relationships that we want to express
between data, in different datasources and between different organisms.
And this represents the primary scientific challenge for ComparaGRID.
The Importance of Relationships
For example
A pig database records the mapping of some marker PIGA on a map at position SSC9:
30.1, and associates that marker observation with a technique: PCR, and some
reagents: primers P1 and P2 with Sequence S1 and S2
30
31
SSC9
PIGA
hasEvidence: PCRDetection
hasReagent: Primer1 (with sequence S1)
Primer2 (with sequence S2)
The Importance of Relationships
A cattle database records mapping of a marker COWX on a map BTA4: 105.3 and
associates that marker observation with a technique: PCR, and some reagents: primers
P1 and P2 with Sequence S1 and S2
105
106
BTA4
COWX
hasEvidence: PCRDetection
hasReagent: Primer1 (with sequence S1)
Primer2 (with sequence S2)
The Importance of Relationships
• Pig: primers P1 and P2 with Sequence S1 and S2 (detecting Marker A). Are identical to
Cow primers P1 and P2 with Sequence S1 and S2 (detecting Marker X)
• What can we say about the possible relationships between Marker A and Marker X?
The Importance of Relationships
What can we discover about the relationships between these mapping data?
(And HOW can we discover any relationships between these data?)
• Can we draw any inference between the use of identical primer sequences and a
similar detection technique?
• Does this imply a relationship between the cattle and pig markers?
• Does it imply homology?
• Is it evidence that they are or could be considered the ‘same’ marker?
• How good or reliable is any such inference?
• How can we represent different values/qualities of such inferences – to allow
‘weighting’ of evidence?
• How can we accumulate different strands of evidence to establish a real
relationship between these markers and these regions of the two genomes?
ComparaGRID Ontology: Classification of Relationships
Some of the relationships that we want to capture in our
data can be represented by simple ‘binary’ properties:
Concept A
Property
hasPosition
hasScaleUnit
hasProduct
hasEvidence
hasPart
mappedOn
containedOn
hasMarker
hasValue
hasLatinName
Concept B
Simple relationships can be represented as Properties:
<Animalia>
<Human>
Kingdom
TaxonConcept
hasKingdom
property
DomainConcept
DomainConcept
property
Value
hasLatinName
<String of Characters>
“Homo sapiens”
ComparaGRID Ontology: Classification of Relationships
Others relationships are more complicated, and might link
mutiple concepts and have properties attached to them.
These are modelled as complex concepts so that we can
represent more details about them:
Mapping
Synteny
Orthology
Paralogy
Containment
Similarity
TaxonomicIdentification
IsMapOf
More complex relationships are modelled as Concepts:
Chromosome
Simple Representation
isMapOf
Map
DomainConcept
property
DomainConcept
Richer Representation
Map
Chromosome
DomainConcept
DomainConcept
isMapOf
relatesFrom
(property)
Unidirectional
Relationship
relatesTo
(property)
property
hasEvidence
property
DomainConcept
Citation
Value
identifier <String of Characters>
Mapping is a type of Relationship
DomainConcept
relatesFrom
Any Concept that can
experimentally be placed
hasMarker
DomainConcept
on a Map/LinkageGroup.
e.g a e.g.
Gene,
Gene Product,
MCW0010
Genetic Variation, QTL,
Phenotype, STS, EST, SNP,
nucleotide etc.
Containment
hasMarker
DomainConcept
MCW0010
relatesTo You
can make
a map of
DomainConcept
any DomainConcept made
of a biological
informational
macromolecule
(DNA, RNA, Protein...)
Relationship
containedOn
LinkageGroup
isMapOf
HeritableStructure
isMapOf
Mapping
mappedOn
RFxWL_Uppsala
hasEvidence
35.300
Chromosome 1
Map
has
Position
hasScaleUnit
Position
has
Value
LG 1
Evidence
ScaleUnit
centiMorgan
What’s the point of all this
ontological classification etc?
A structured classification makes it easier for the human
user to understand and navigate the terminology.
The ‘meaning’ of terms is more precisely captured – and
how the terms relate to each other.
We can see how terms used in different datasets relate to
each other. We can integrate datasets that are described
using this common vocabulary.
We can link data and make inferences between species –
based on formalised rules and conditions.
Automatic classification and reasoning about data is
feasible.
Download