The OBO Foundry A Gold Standard Approach to Ontology Evaluation Barry Smith http://ontology.buffalo.edu/smith http://ontologist.com 1 Two types of ontology natural-science ontologies capture terminology-level knowledge underlying the best current science contrasted with administrative ontologies (e.g. billing ontologies, bloodbank ontologies, lab workflow ontologies) prepared for specific, local purposes http://ontologist.com 2 scientific ontologies have special features Every term in a scientific ontology must be such that the developers of the ontology believe it to refer to some entity on the basis of the best current evidence scientific ontologies are realism-based http://ontologist.com 3 For scientific ontologies reusability is crucial compatibility with neighboring scientific ontologies it is generalizations that are important = universals, types, kinds http://ontologist.com 4 An ontology is a representation of universals We learn about universals in reality from looking at the results of scientific experiments in the form of scientific theories experiments relate to what is particular science describes what is general http://ontologist.com 5 what is the difference between an ontology and a scientific theory? an ontology is also a terminological standardization WHAT DOES THIS MEAN? http://ontologist.com 6 1st aspect: additivity cell = def. plant cell, consisting of protoplast and cell wall; ... [Plant Ontology] what happens when the users of the Plant Ontology need to consider bacterial pathogens in plants? http://ontologist.com 7 2nd aspect: calibration with reality gold standard kilogram the same universal is defined by reference either to some artifact or to some universal physical constant (for realists there is no problem here) http://ontologist.com 8 VIM: the International Vocabulary of Metrology (i) repeated measurements always give rise to some variation in values, (ii) one can never be sure (fallibilism) that one has got the true value, Hence: (iii) there are no true values. To keep happy those who dismiss the notion of the true value, the international community is agreeing to a set of terms which intentionally allow two possible interpretations once again: bad philosophy leads to bad standards Compare:http://ontology.buffalo.edu/medo/Wuesteria.pdf http://ontologist.com 9 from: The NIST Reference on Constants, Units and Uncertainty The creation of the decimal Metric System at the time of the French Revolution and the subsequent deposition of two platinum standards representing the meter and the kilogram, on 22 June 1799, in the Archives de la République in Paris can be seen as the first step in the development of the present International System of Units. http://ontologist.com 10 from: The NIST Reference on Constants, Units and Uncertainty In the 1860s Maxwell and Thomson ‘formulated the requirement for a coherent system of units with base units and derived units. In 1874 the British Association for the Advancement of Science introduced the CGS system, a three-dimensional coherent unit system based on the three mechanical units centimeter, gram and second, using prefixes ranging from micro to mega to express decimal submultiples and multiples. The following development of physics as an experimental science was largely based on this system.’ http://ontologist.com 11 http://ontologist.com 12 Base and Derived Units Units based on undefined SI dimensions: meter, second, kilogram, ampere, candela, kelvin, mole. Units based on defined SI dimensions: volume, area, velocity, acceleration, newton, joule, pascal, coulomb, farad, henry, hertz, lumen, lux, ohm, etc. Dimensions can be multiplied and divided (meters/second). http://ontologist.com 13 The SI System of Units is a qualitative ontology: it captures qualitative dimensions of reality to which quantities can be applied (it captures measurable dimensions of reality) there is a degree of conventionality in the choice of basic vs. derived units, and in the standard [e.g. the Paris meter] that is used to define the unit in each dimension http://ontologist.com 14 but the dimensions themselves exist independently of our conventions so that an ontology of these dimensions is a true representation of an independently existing reality http://ontologist.com 15 Quantities are Universals Ingvar Johansson: Many different things can simultaneously have a mass of 5kg (length of 4m, etc.). Determinate quantities are universals, which means that they have many instances http://ontologist.com 16 Units Ontology developed in conjunction with PATO, the Phenotypic qualities ontology obo.sourceforge.net/cgi-bin/detail.cgi?quality http://ontologist.com 17 fiat subtypes of qualities quality spatial quality length weight 1cm 1g temperature is_a … 1mm 1kg 18 Representation of measurements quality unit spatial quality mm cm kg g length weight temperature is_a measurement_of 19 Ingvar Johansson: (a) no object can possibly at one and the same time take two values of the same quantity dimension (b) in case of additive quantities, only quantities of the same dimension can be added together to give rise to a sum: no material object can have two masses, and masses can only be added to other masses http://ontologist.com 20 Controlled vocabulary Each SI unit is represented by a symbol, not an abbreviation. The use of unit symbols is regulated by precise rules. These symbols are the same in every language of the world, even though the names of the units themselves vary in spelling according to national conventions. http://ontologist.com 21 The SI system of units gives you: a gold standard controlled vocabulary for the expression of scientific results which makes these results comparable and integratable – my hypotheses can be checked against your data my measuring equipment can be callibrated against your measuring equipment (because each can be callibrated against the same gold standard) the SI system of units can serve as a gold standard because it is a true reflection of an independent reality http://ontologist.com 22 a system of units is a legend for measurement data heartrate speed cadence torque http://ontologist.com power 23 compare: legends for maps http://ontologist.com 24 Creating a system of units is not easy; it has to match the way the measurable dimensions are interconnected in reality it may need to be revised in light of new discoveries about how reality is structured http://ontologist.com 25 after Maxwell and Thomson the subsequent development of physics as an experimental science was largely based on their system of standardized units. http://ontologist.com 26 analogous achievements also in chemistry IUPAC InChI and in molecular biology, for proteins, enzymes, genes, etc. IUBMB HUGO Gene Nomenclature Committee, etc. http://ontologist.com 27 Periodic Table http://ontologist.com 28 the goal of realist ontology to generalize this achievement – specifically in biology – and in medicine (where forces are at work which tend to thwart standardization of vocabulary) to move from standardizations of nouns to standardizations of sentences http://ontologist.com 29 gene expression data realist ontologies are legends for data where in the body ? in what kind of cell? what kind of disease process ? need for semantic annotation of data http://ontologist.com 31 http://ontologist.com 32 the Gene Ontology is already a de facto standard http://ontologist.com 33 natural language labels organized in a graphtheoretic structure,designed to make the data cognitively accessible to human beings algorithmically accessible to machines linked up to other data resources because the same labels have been used http://ontologist.com 34 compare: legends for cartoons (for diagrams in scientific texts) http://ontologist.com 35 ontologies are legends for mathematical equations xi = vector of measurements of gene i k = the state of the gene ( as “on” or “off”) θi = set of parameters of the Gaussian model ... ... http://ontologist.com 36 or chemistry diagrams Prasanna, et al. Chemical Compound Navigator: A Web-Based Chem-BLAST, Chemical Taxonomy-Based Search Engine for Browsing Compounds PROTEINS: Structure, Function, and Bioinformatics 63:907–917 (2006) http://ontologist.com 37 annotation using common ontologies yields integration of databases GlyProt MouseEcotope Holliday junction helicase complex DiabetInGene GluChem http://ontologist.com 38 What is mapping (1) “Given two ontologies A and B, mapping one ontology with another means that for each concept (node) in ontology A, we try to find a corresponding concept (node), which has the same or similar semantics, in ontology B and vice verse.” M. Ehrig M and Y. Sure, Ontology mapping - an integrated approach. In Proceedings of the First European Semantic Web Symposium, ESWS 2004, volume 3053 of Lecture Notes in Computer Science, pages 76–91, Heraklion, Greece, May 2004. Springer Verlag. http://ontologist.com 39 What is mapping (2) “the task of relating the vocabulary of two ontologies in such a way that the mathematical structure of ontological signatures and their intended interpretations, as specified by the ontological axioms, are respected ”. [ontological signature = a hierarchy of concept symbols together with a set of relation symbols whose arguments are defined over the concepts of the concept hierarchy] Y. Kalfoglou and M. Schorlemmer, Ontology mapping: the state of the art. Knowl. Eng. Rev., 18(1): 2003. http://ontologist.com 40 What is mapping (3) “a formal expression that states the semantic relation between two entities belonging to different ontologies”, “Simple examples are: concept c1 in ontology O1 is equivalent to concept c2 in ontology O2; concept c1 in ontology O1 is similar to concept c2 in ontology O2; individual i1 in ontology O1 is the same as individual i2 in ontology O2” P. Bouquet et al. KnowledgeWeb deliverable D2.2.1. Specification of a common framework for characterizing alignment. http://ontologist.com 41 One way to support ontology matching (and evaluation) have experts manually prepare for each given matching problem a gold standard to which matching efforts could be compared. – M. Ehrig and J. Euzenat, Relaxed Precision and Recall for Ontology Matching, in: Proc. K-Cap 2005 workshop on Integrating ontology, Banff (CA), p. 25-32, 2005. http://ontologist.com 42 Gold standard methodology for ontology evaluation is very expensive who are the experts? sometimes cannot be done for political reasons • UMLS metathesaurus even a gold standard can contain errors http://ontologist.com 43 Solution: The OBO Foundry 1. some large pieces already exist (especially Gene Ontology, Foundational Model of Anatomy) 2. processes of unification and reform already in place 3. all participants aiming for additivity 4. procedures for constant update in light of scientific advance http://obofoundry.org http://ontologist.com 44 The GO methodology of annotations science basis of the GO: trained experts curating peerreviewed literature RESULT: a slowly growing computer-interpretable map of biological reality within which major databases are automatically integrated in semantically searchable form Contrast: data-mining based approaches to ontology construction http://ontologist.com 45 Systematic annotation of references to gene products in literature • leads to improvements and extensions of the ontology • leads to better annotations • leads to a virtuous cycle of improvement in the quality and reach of both future annotations and the ontology itself http://ontologist.com 46 Five bangs for your GO buck science base cross-species database integration cross-granularity database integration through links to the entities in biological reality semantic searchability links people to software http://ontologist.com 47 First step (2003) a shared portal for (so far) 58 ontologies (low regimentation) http://obo.sourceforge.net NCBO BioPortal http://ontologist.com 48 http://ontologist.com 49 Second step (2004) reform efforts initiated, e.g. linking GO to other OBO ontologies to ensure orthogonality GO id: CL:0000062 name: osteoblast def: "A bone-forming cell which secretes an extracellular matrix. Hydroxyapatite crystals are then deposited into the matrix to form bone." is_a: CL:0000055 relationship: develops_from CL:0000008 relationship: develops_from CL:0000375 Osteoblast differentiation: Processes whereby an osteoprogenitor cell or a cranial neural crest cell acquires the specialized features of an osteoblast, a bone-forming cell which secretes extracellular matrix. http://ontologist.com + Cell type = New Definition 50 Third step (2006) The OBO Foundry http://obofoundry.org/ http://ontologist.com 51 A prospective standard designed to guarantee interoperability of ontologies from the very start (contrast to: post hoc mapping) established March 2006 12 initial candidate OBO ontologies – focused primarily on basic science domains several being constructed ab initio by influential consortia who have the authority to impose their use on large parts of the relevant communities. http://ontologist.com 52 GO Gene Ontology undergoing ChEBI Chemical Ontology rigorous CL Cell Ontology FMA Foundational Model of Anatomy reform PaTO Phenotype Quality Ontology SO Sequence Ontology CARO Common Anatomy Reference Ontology CTO Clinical Trial Ontology FuGO Functional Genomics Investigation Ontology PrO Protein Ontology RnaO RNA Ontology RO Relation Ontology new The OBO Foundry http://ontologist.com http://obofoundry.org/ 53 Ontology Scope URL Custodians Cell Ontology (CL) cell types from prokaryotes to mammals obo.sourceforge.net/cgibin/detail.cgi?cell Jonathan Bard, Michael Ashburner, Oliver Hofman Chemical Entities of Biological Interest (ChEBI) molecular entities ebi.ac.uk/chebi Paula Dematos, Rafael Alcantara Common Anatomy Reference Ontology (CARO) anatomical structures in human and model organisms (under development) Melissa Haendel, Terry Hayamizu, Cornelius Rosse, David Sutherland, Foundational Model of Anatomy (FMA) structure of the human body fma.biostr.washington. edu JLV Mejino Jr., Cornelius Rosse Functional Genomics Investigation Ontology (FuGO) design, protocol, data instrumentation, and analysis fugo.sf.net FuGO Working Group Gene Ontology (GO) cellular components, molecular functions, biological processes www.geneontology.org Gene Ontology Consortium Phenotypic Quality Ontology (PaTO) qualities of anatomical structures obo.sourceforge.net/cgi -bin/ detail.cgi? attribute_and_value Michael Ashburner, Suzanna Lewis, Georgios Gkoutos Protein Ontology (PrO) protein types and modifications (under development) Protein Ontology Consortium Relation Ontology (RO) relations obo.sf.net/relationship Barry Smith, Chris Mungall RNA Ontology (RnaO) three-dimensional RNA structures (under development) RNA Ontology Consortium properties and features of nucleic sequences song.sf.net Karen Eilbeck Sequence Ontology http://ontologist.com (SO) 54 GOALS to providing a FRAMEWORK OF RULES to counteract the current policy of ad hoc creation of new ontologies y each clinical research group REUSABILITY: if data-schemas are formulated using a single well-integrated framework ontology system in widespread use, then this data will be to this degree itself become more widely accessible and usable The OBO Foundry http://ontologist.com http://obofoundry.org/ 55 GOALS to serve as BENCHMARK FOR IMPROVEMENTS: once a system of interoperable reference ontologies is there, it will make sense to calibrate existing terminologies in its terms in order to achieve more robust alignment and greater domain coverage The OBO Foundry http://ontologist.com http://obofoundry.org/ 56 Gold standard Two aspects: 1. an expression of practice carried out perfectly (for example, the optimal therapy for a given medical problem) 2. based on complete acceptance or consensus: everyone qualified to render a judgement would agree to what the gold standard is. Friedman CP, Wyatt J. Evaluation Methods in Medical Informatics http://ontologist.com 57 Gold standards are worth approximating. That is, “tarnished” or “fuzzy” standards are better than no standards at all. ... studies comparing the performance of information resources against imperfect standards, so long as the degree of imperfection has been estimated, represent a stronger approach than those that bypass the issue of a standard altogether. Friedman CP, Wyatt J. Evaluation Methods in Medical Informatics http://ontologist.com 58 Gold standards can also be partial: to serve ontology matching and evaluation it is enough to have ontologies comprehending even selected aspects of biomedical reality, provided the assertions contained in these ontologies are universally true in non-closed worlds, gold standards will always be partial in complex disciplines gold standards will always be evolving http://ontologist.com 59 the constraint of universality OBO Foundry ontologies accept only those relations between their terms which obtain universally (= for all instances) lung is_a anatomical structure lobe of lung part_of lung Compare: electrons have a negative electric charge electrons have a negative electric charge of 1.6 × 10-19 coulomb http://ontologist.com 60 Principle of Low Hanging Fruit Ontologies should include even absolutely trivial assertions (assertions you know to be universally true) herpes virus is_a virus Computers need to be led by the hand http://ontologist.com 61 if the standard is to work it has to simulate the achievements of the SI system of units • simple • controlled vocabulary • wide acceptance • uncontroversial • allows cross-disciplinary, cross-experimenter callibration • my data can confirm or disconfirm your hypothesis http://ontologist.com 62