Integrating Genomic Mapping Data using Ontologies: A View from the Closed World Trevor Paterson and Andy Law (Roslin Institute, Scotland) and the ComparaGrid Consortium Aims: - To develop ‘enabling technologies’ for comparative genomics. - To integrate disparate resources (genomic mapping, DNA sequence, evolutionary relationships, functional information) across species boundaries. - In order to inform and expedite genomic mapping: particularly in non-model organisms. Consortium Members: Farm animal, crop and microbial genomics Bioinformatics Computer Sciences Ontologists Statisticians Dr. Andy Law (project co-ordinator) Dr Trevor Paterson Roslin Institute Dr. Peter Rice Tony Burdett EBI Dr. Ian Roberts IFR Dr. Jo Dicks RA Dr Robert Davey JIC Dr. Robert Stevens Dr Andrew Gibson Manchester Dr. Darren Wilkinson Dr. Richard Boys (RA Dr Madhuchhanda Bhattacharjee ) Newcastle (Maths & Stats) Dr. Neil Wipat Dr. Matthew Pocock Professor Paul Watson Newcastle (Computing Science) Dr. David Marshall SCRI Biological Goal To MAP, IDENTIFY and UNDERSTAND genes behind phenotypes (e.g. diseases & commercially important traits) ComparaGrid aims to assist this process by exploiting existing genetic mapping data (held in multiple online databases and resources) across species boundaries. UNDERLYING BIOLOGICAL PRINCIPAL BEHIND CROSS-SPECIES MAP COMPARISON Conservation of Synteny: “Conservation of (blocks of) gene order throughout chromosomal evolution” As species evolve and diverge, their chromosomes rearrange through duplications, inversions, translocations etc - but blocks of genes can be traced through evolutionary history between even relatively divergent species (e.g. chicken and man). Therefore the known gene order in these blocks in one species can inform/predict the order (and existence) of evolutionarily related genes (orthologues) in other species. Recognizing Orthology: gene sequence similarity homology best sequence similarity + conserved position orthology duplication in one species cf. another paralogy COMPARATIVE GENOMICS USE CASE Agribusiness wants to map the underlying genetic basis of the ‘Tasty Bacon’ Trait ( a QTL ). COMPARATIVE GENOMICS USE CASE Tasty Bacon The ‘Tasty Bacon’ QTL has been genetically mapped. QTL (Genetic) Map COMPARATIVE GENOMICS USE CASE The position of the QTL is correlated on various types of Pig Genetic maps Tasty Bacon QTL Map Linkage Map Radiation Hybrid Map COMPARATIVE GENOMICS USE CASE There is a ‘known’ homology between a Pig Marker/Sequence in this region and the human genome Pig Human DNA Sequence Similarity Homology ? Orthology QTL Map Linkage Map Radiation Hybrid Map Cytogenetic Map COMPARATIVE GENOMICS USE CASE A Physical Map of BAC clones exists for this region of the Human Genome Pig Human BAC1 BAC2 BAC3 QTL Map Linkage Map Radiation Hybrid Map Cytogenetic Map Physical Mapping COMPARATIVE GENOMICS USE CASE There are known chicken expressed sequences homologous to Human Gene Sequences in this region Pig Chicken Human BAC1 EST1 BAC2 EST2 BAC3 QTL Map Linkage Map Radiation Hybrid Map Cytogenetic Map Physical Mapping EST Library COMPARATIVE GENOMICS USE CASE Gene expression Data for these Chick ESTs might correlate with a trait similar to ‘Tastiness’ Pig Chicken Human BAC1 EST1 BAC2 EST2 BAC3 QTL Map Linkage Map Radiation Hybrid Map Cytogenetic Map Physical Mapping Expression Analysis COMPARATIVE GENOMICS USE CASE The literature may detail Functions of Human genes in this region, and homologies to genes in other species – helping the researcher predict candidate genes in Pigs responsible for tastiness Pig Chicken Human BAC1 EST1 BAC2 EST2 BAC3 QTL Map Linkage Map Radiation Hybrid Map Cytogenetic Map Linked References Physical Mapping Expression Analysis THE DATA: FACTS AND ASSERTIONS - Maps (positions of 'markers' on representations of chromosomes) e.g. genetic linkage, radiation hybrid, linkage association, cytogenetic, QTLs etc. - Genomic Sequences for model organisms – with annotations including gene positions and structures, - DNA Sequence Databases (for genes, clones etc.) - Gene and Protein Function Databases - Gene Family and Homology Databases (including Gene Orthology Databases) WE ONLY RECORD POSITIVE DATA! RDBMS Data query works on Closed World Assumption. The scientist has to decide whether absence of data means NOT or UNKNOWN! Assumptions Interpreting Data • (Almost) all our data is recorded in RDBMS • (Almost) all our data is 'positive' – (A maps on X, A is related to B, ...) • Negative data is rarely explicitly recorded – (A does not map on X, A is not related to B, ...) • Some (lots) of our data is contradictory – (A maps on X at position (x,y), A maps on X at position (p,q), A maps on Y at position (s,t), ...) Linking the Data: Traversing Relationships Between Maps and Species Can we create a system that integrates and queries across datasources implementing the reasoning of an expert scientist? • Expert biological reasoning – Currently performed manually/in cerebro by the expert (difficult, error-prone, inefficient) – Integrating explicit facts and recorded assertions and relationships – Accounting for the implicit assumptions of the knowledge domain – Relying on 'working' assumptions that are implicit, and dependent on the data values and the amount of data available – Relies on interpreting and extrapolating the relationships and gaps in the data/relationships manually e.g. interpreting what is implied by the nonexistence of information Linking the Data: Traversing Relationships Between Species Consider maps in two species: speciesA speciesB act2 act6 act2 • A-act2 is an orthologue of B-act2 • B-mys and B-pseudo-mys are homologous mys trp orf231 • the sequence of A-act6 is similar to B-act6 and B-act2 act6 mysA mysB Various datasources may record various facts and assertions relating 'markers' in species A with speciesB • A-mysA and A-mysB are paralogous pseudo-mys trp tyr • the sequences of A-mysB, B-mys and B-pseudomys are similar • the sequences of A-orf231 and B-trp are related • Clearly not ALL the possible relationships between the markers are recorded (or known) – but we don't even know if they have been investigated or not. • There are some necessarily true deductions that we can make – and some possibly true deductions. • For example is A-trp related to B-trp? Does absence of information imply that it isn't (Closed World) or allow that it might be (Open World). The message is that there are no knowns. There are things we know that we know. There are known unknowns, that is to say there are things we now know we don’t know. But there are also unknown unknowns - things we do not know we don’t know. Donald Rumsfeld The interpretation of Data/Assertions (i.e. our assumptions) might also change according to the particular data values. • Pig Map 1: -A-B-C-D-E-F- Pig Map 2: -A-B-C-D------(----)-----E-F• 3 markers on the Pig Map are orthologous to markers on Human Map HSA3: -A-B-D? • 1 marker on our Pig Map is orthologous to a marker on HSA1: -F• where are the orthologues of C and E in HSA (if they exist!)? • predict C might lie in the ABD cluster on HSA3 • E might be linked to the ABD cluster or be linked to F on HSA1 – (or maybe elsewhere........) • still predict C might lie in the ABD cluster on HSA3 • we might now predict now that E is more likely to be linked to F on HSA1 The interpretation of Assertions (i.e. our assumptions) changes according to context • In the preceding examples the using data in our RDBMS (Closed World) we can discover conserved chromosomal regions and reason that the syntenic conservations between speciesA and speciesB may include other (unknown) genes. • But nothing in those statements precludes any other evolutionary relationships also being true (i.e. they are just unknown or unrecorded). • If we transfer the knowledge to open world reasoning how do we exclude potential other truths. Or how do we restrict which possible other truths we wish to postulate/test. • A biologist would interpret the available assertions and alter their assumptions based on the available 'supporting evidence' (the amount, type and quality of assertions/evidence recorded) • e.g. They might choose to ignore possible (open world) interpretations (Occam's Razor) and might in some cases assume unknown knowledge is 'negative' – but they must also account for the reality that they do not have all the information – they are working in an open world system! • How can we mimic this using open world first order logic reasoning (with OWL-DL). Linking the Data Can we create a system that integrates and queries across datasources implementing the reasoning of an expert scientist? • Data Requirements – Need for standard representation of data – Need for explicit and standardised 'relationships' between data – Explicit assertions, with associated values to 'weight' these assertions • evidence • quality • provenance • Requirements for Automated Reasoning – – – – – capturing/understanding the supporting 'background' assumptions what types of assertions are there what types of assumptions are there in domain (implicit) how do we currently use these in cerebro how do we capture the semantics of these assertions and assumptions to facilitate automated reasoning AUTOMATING COMPARATIVE GENOMICS Provide Architecture to Link, Integrate and Traverse Data Sources: GRID/ Web-services Provide Data Standards to allow this: Syntax and Semantics of Data Formalise the Links between Data: (e.g. sequence similarities, evolutionary hypotheses) these relationships and assertions are Data too these are 'more interesting' to Biologists than mere 'facts' is there an important conceptual difference between: The ‘nuts and bolts’ relationships with in the data (‘EXPERIMENTAL OBSERVATIONS’ and ‘FACTS’) versus The biological hypotheses (‘ASSERTIONS’) ComparaGrid aims to allow integration across datasources – using a common representation of the data types – using a shared Domain Ontology (in OWL-DL) to represent the data classes and relationships. The Domain Ontology represents a consistent open world model of the classes and potential restrictions and relationships on and between classes. Representing data as instances of classes of the shared Domain Ontology allows the potential to perform queries and data retrieval using Reasoning Engines. However there are potential pitfalls in translating data from a closed world representation (RDBMS) to open world representation (OWLDL). ComparaGRID Design export convert Ontology query Data query USER DISTRIBUTED DATABASES DATA Different SYNTAX TERMINOLOGY SEMANTICS CLOSED WORLD CLOSED WORLD COMPARAGRID ONTOLOGY MAPPER DATA AS COMPARAGRID OWL-DL ONTOLOGY Common SYNTAX TERMINOLOGY SEMANTICS OPEN WORLD OPEN WORLD COMPARAGRID DATA INTEGRATOR Integrate Cache and Reason over Data SEMI-OPEN WORLD Modelling the Domain and Representing Data. • • • The Domain Ontology seeks to model the knowledge domain Attempts to capture many explicit and implicit truths Some implicit truths/interpretations more complex and context dependant • Map data – Straightforward? – – – • if a marker is at one position on a map then it cannot be at another different position OWL exclusion on class – fundamental property of a mapping (but still have problems with contradictory data............) Relationship data – Interpretation is Complex and Contextual – – – – – – if a marker has an evolutionary relationship with another marker (in this or another species) that does not preclude it from having an evolutionary relationship with any other marker (in this or any other species) database usually does not record negative statements OWL cannot exclude at class level humans may choose to exclude unless stated open world reasoning might require explicit negation – or anything might be true! can we 'fake' closing the open world by generating negations for assertions not explicitly stated • but is this correct? (sometimes? not always?) • and is it tractable? • do we need extensions to OWL - epistemic operators? Genomic Information at Roslin www.thearkdb.org • multispecies mapping resource • mapping data generated by variety of experimental techniques • by numerous laboratories • stored in a relational database • accessed through a java webapp using an object-relational data model • basically contains maps, 'markers', positions of markers on maps, and relationships between markers and other markers (e.g. through related DNA sequences) The ARK Relational Model Stores all data as relationships between objects Objects: typed objects alias assertion cytogenetic_map chromosome clone contig sequence gene laboratory linkage_map map marker physical_map probe primer publication QTL radiation_hybrid_map external_sequence species marker_type ... The ARK Relational Model Stores all data as relationships between objects Objects: typed objects Relationships: typed relationships (also objects): Object <typed relationship> Object also_known_as belongs_to_species belongs_to_group has_associated_sequence has_primer is_an_alias_of is_id_for belongs_to_container is_type maps is_map_of has_associated_gene supported_by ... Developmental aims for ArkDB (Using the ComparaGrid Architecture) • traverse relationships/assertions to join up data between maps and species • allow intelligent reasoning over these relationships • based on EVIDENCED assertions (ultimately sequence relationships) • weight traversal of relationships on the quality and provenance of the evidence • allow user guided data discovery (select which type of assertions to traverse) • allow automatic data discovery Developmental aims for ArkDB - Implications (Using the ComparaGrid Architecture) • data in RDBMS – model approximates to RDF statements • representing this data in OWL-DL for ComparaGRID • mapping closed world to open world • how does this affect our data export – e.g. representing 'null information' • what is the implication for reasoning in an open world over data that comes from a closed world • can we 'close down' parts of the open world – is this possible generically – or is it a run time option – can we use epistemic operators • how difficult is it to ask..... • for genes known to be in some region or involved in some phenotype • for genes that might/probably be involved • for genes definitely NOT involved Conclusions • We need a way to formalise/standardise the representation not only of the data in our RDBMS (e.g. positional mappings) but also of the relationships between objects (within and between species). – OWL is good for modelling the Domain, and providing a standardized data representation • We would like to 'reason' across our data • We need to capture assumptions in our domain – but these might change according to the context of the data • Representing data as instances in an open world OWL-DL ontology allows open world reasoning • This may cause us grief – – – • satisfying a query that has no known answer returns all possible answers? can we live with this? – or can we replicate the power of 'human reasoning' capturing our context dependant assumptions (nonmonotonic logic?) e.g. how do we treat null data. Can it be done?......(over to you Matt)