The Aim • Integrate genomic mapping data across different data sources on the Internet • To allow exploration of mapping information • To discover or predict new information • By exploiting Conservation of Synteny across species boundaries. Trevor Paterson 31 May 2016 The Problems • Lots of different types of mapping data • from different types of experiments • in various species • different locations/ computer systems • different representations of data • different terminology • different storage formats • data of variable quality • …et cetera Trevor Paterson 31 May 2016 (Part Of) The Solution • Define a language/terminology to describe and represent mapping data unambiguously • Capturing both the meaning of the data and the relationships represented in the data. • Use this language as a common representation for exchanging and querying data and providing results. • Perhaps use the semantics captured in this language to automatically discover new information. Trevor Paterson 31 May 2016 What is an Ontology? DEFINED TERMINOLOGY Terms + Definitions A controlled common vocabulary for describing and for querying data unambiguously ONTOLOGY Defined Terminology + How terms are used Captures semantics of data • checks data is consistent • infers new information • facilitates semantically complex queries Defined Teminology Increasing Formality of Ontology Formal Logics Defining the ComparaGRID Domain Ontology. Ontology a (more or less) 'formal' specification of a domain of knowledge (Here: Genomic Mapping data across all species…) • What types of concepts are there (defined terms, things we need to talk about) • And how these concepts (might or necessarily) relate to each other • Can be used to control the vocabulary used for storing or describing data • Can represent Formal Logics: allow 'reasoning' about data (Software can check the validity of data and deduce new information). Formal Logics….What? If we 'know' that: & 1. Concept A is related to Concept B [ α β ] 2. Concept B is related to Concept C [ β γ ] Can we deduce/reason anything about possible relationships between Concepts A and C? [ α ?? γ ] Formal Logics….Why? For example If in species A & & Gene A is syntenic with Gene B Gene B is syntenic with Gene C Gene C is syntenic with Gene D α↔β β↔γ γ↔δ If we have defined Synteny to be Transitive: We can deduce that: Gene A is syntenic with Gene D α↔δ Formal Logics….Why? More complex example If in species A Gene A1 is syntenic with Gene A2 α1↔ α2 & Gene A2 is syntenic with Gene A3 α2↔ α3 ? ↔ ? α4 And in a somewhat related Species B Gene B2 is syntenic with Gene B3 β2↔ β3 & Gene B3 is syntenic with Gene B4 ‘β1’ ? ↔ β3↔ β4 And sequence comparisons establish that A2/B2 , A3/B3, A4/B4 are orthologues…….. We might be able to postulate that an orthologue of A1 might be found syntenic with B2, 3 and 4; and that A4 might be syntenic with A1,2,3 Exploiting Conserved Synteny to predict candidate genes Pig QTL is in here somewhere { Human A A Gene 1 Gene 2 B B C C D D E E ‘Orthology’ Gene 3 Conservation of Synteny – with Conserved Gene Order } Gene 4 These are potential candidate genes Conservation of Synteny ? The ComparaGRID Ontology The ComparaGRID ontology defines the terminology used in the domain of Comparative Genomics, and how this terminology can be used. There are two components of the ontology 1. Classes (Concepts, or terms with definitions) and 2. Properties (simple relationships, between a Class and a Value: the value can be another Class, or a simple number etc.) Example: Concepts and Properties Map is a Concept: It has a definition ‘The abstract (typically linear) representation of an informational macromolecule or chromosome etc., allowing the positioning of identifiable markers along the length of the map...’ hasScaleUnit is a Property: In our ontology we can define which Concepts can have particular Properties, and which Concepts may be the values of particular Properties. Ontology Statement: Map: hasScaleUnit: ScaleUnit Real Data: <RFxWL_Uppsala:Chromosome1> hasScaleUnit <centiMorgan> Building the Ontology The process of Ontology Definition involves collecting all the terms and relationships in the knowledge domain Providing definitions for terms: Concepts Classifying Concepts into related groups in a hierarchical tree Defining the relationships found in the data: Properties Specifying the permitted domain and range for these properties Specifying which properties are allowed, which must always be true, and which are disallowed CONCEPTS SIMPLE RELATIONSHIPS transcribedFrom Microsatellite QuantitativeTrait PartOf identifier hasAbbreviation TechniqueUsed Chromosome COMPLEX RELATIONSHIPS DNADuplication Orthology Interval Position Mapping Reciprocal BestMatch GeneticLinkageMap Example: Modelling Maps WHAT IS A MAP? – information about the presence and ordering of Markers on an abstract representation of a macromolecule (DNA Molecule, Chromosome or even a Polypeptide). • ‘Linkage Group’: the simplest ‘Map’ • a collection or set of markers that are inherited together – without implied order. • i.e the relationship between a ‘Marker’ and a ‘Linkage Group’ is a ‘Containment’ - the Linkage Group contains Markers. • A true ‘Map’ • has some sort of ordering of Markers belonging to a Linkage Group. • i.e. the relationship between a Marker and a Map is a ‘Mapping’ which has a ‘Position’. This Position may be purely ordinal, or may be co-ordinate and be associated with Scale Units. The Map maps Markers with a Position. • A Map is a specialized type-of Linkage Group Modelling Maps Workshop One distinguished two types of Maps: Physical Maps Probabilistic Maps 1. Physical Map A map of the locations of identifiable landmarks on ‘DNA’ (e.g., restriction-enzyme cutting sites, genes), regardless of inheritance. At highest resolution, distance is measured in base pairs, other units may be used. For a given genome, the lowest-resolution physical map might be the banding patterns on the different chromosomes; the highest-resolution physical map of a DNA Molecule is its complete nucleotide sequence. e.g. Contig Map; Cytogenetic Map; Breakpoint Map; Deletion Map; FingerprintMap; Restriction Site Map; Sequence Map (Amino Acid, DNA, RNA) Modelling Maps 2. Probabilistic Map A map of the relative locations of markers on a chromosome derived from an experimental analysis tracking the propensity markers to be inherited together following natural or induced chromosomal disruption. i.e based on some probabilistic measure of closeness. e.g. Genetic Linkage Map; Meiotic Linkage Map; Radiation Hybrid Map; HAPPY Map In addition we might represent 3. Integrated Map A map combining mapping data from multiple map sources and experiments The Importance of Relationships Defining concepts is ‘easy’….;-) In many respects defining concepts such as maps, genes, positions, chromosomes etc. to represent the species specific maps in existing datasources is straightforward. This language defines the nuts and bolts used to represent and exchange the data between individual datasources. However, some concepts are problematic – even within one datasource e.g. what is meant by a ‘Marker’? Even more complicated are the Relationships that we want to express between data, in different datasources and between different organisms. And this represents the primary scientific challenge for ComparaGRID. The Importance of Relationships For example A pig database records the mapping of some marker PIGA on a map at position SSC9: 30.1, and associates that marker observation with a technique: PCR, and some reagents: primers P1 and P2 with Sequence S1 and S2 30 31 SSC9 PIGA hasEvidence: PCRDetection hasReagent: Primer1 (with sequence S1) Primer2 (with sequence S2) The Importance of Relationships A cattle database records mapping of a marker COWX on a map BTA4: 105.3 and associates that marker observation with a technique: PCR, and some reagents: primers P1 and P2 with Sequence S1 and S2 105 106 BTA4 COWX hasEvidence: PCRDetection hasReagent: Primer1 (with sequence S1) Primer2 (with sequence S2) The Importance of Relationships • Pig: primers P1 and P2 with Sequence S1 and S2 (detecting Marker A). Are identical to Cow primers P1 and P2 with Sequence S1 and S2 (detecting Marker X) • What can we say about the possible relationships between Marker A and Marker X? The Importance of Relationships What can we discover about the relationships between these mapping data? (And HOW can we discover any relationships between these data?) • Can we draw any inference between the use of identical primer sequences and a similar detection technique? • Does this imply a relationship between the cattle and pig markers? • Does it imply homology? • Is it evidence that they are or could be considered the ‘same’ marker? • How good or reliable is any such inference? • How can we represent different values/qualities of such inferences – to allow ‘weighting’ of evidence? • How can we accumulate different strands of evidence to establish a real relationship between these markers and these regions of the two genomes? ComparaGRID Ontology: Classification of Relationships Some of the relationships that we want to capture in our data can be represented by simple ‘binary’ properties: Concept A Property hasPosition hasScaleUnit hasProduct hasEvidence hasPart mappedOn containedOn hasMarker hasValue hasLatinName Concept B Simple relationships can be represented as Properties: <Animalia> <Human> Kingdom TaxonConcept hasKingdom property DomainConcept DomainConcept property Value hasLatinName <String of Characters> “Homo sapiens” ComparaGRID Ontology: Classification of Relationships Others relationships are more complicated, and might link mutiple concepts and have properties attached to them. These are modelled as complex concepts so that we can represent more details about them: Mapping Synteny Orthology Paralogy Containment Similarity TaxonomicIdentification IsMapOf More complex relationships are modelled as Concepts: Chromosome Simple Representation isMapOf Map DomainConcept property DomainConcept Richer Representation Map Chromosome DomainConcept DomainConcept isMapOf relatesFrom (property) Unidirectional Relationship relatesTo (property) property hasEvidence property DomainConcept Citation Value identifier <String of Characters> Mapping is a type of Relationship DomainConcept relatesFrom Any Concept that can experimentally be placed hasMarker DomainConcept on a Map/LinkageGroup. e.g a e.g. Gene, Gene Product, MCW0010 Genetic Variation, QTL, Phenotype, STS, EST, SNP, nucleotide etc. Containment hasMarker DomainConcept MCW0010 relatesTo You can make a map of DomainConcept any DomainConcept made of a biological informational macromolecule (DNA, RNA, Protein...) Relationship containedOn LinkageGroup isMapOf HeritableStructure isMapOf Mapping mappedOn RFxWL_Uppsala hasEvidence 35.300 Chromosome 1 Map has Position hasScaleUnit Position has Value LG 1 Evidence ScaleUnit centiMorgan What’s the point of all this ontological classification etc? A structured classification makes it easier for the human user to understand and navigate the terminology. The ‘meaning’ of terms is more precisely captured – and how the terms relate to each other. We can see how terms used in different datasets relate to each other. We can integrate datasets that are described using this common vocabulary. We can link data and make inferences between species – based on formalised rules and conditions. Automatic classification and reasoning about data is feasible.