Data Representation and the Role of Ontologies PHAR 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD • Prerequisite reading: Genome Research (2001) 11:1425-1433 PHAR 201 Lecture 4, 2012 1 Consider this Course a Workflow in How You Will Handle Data (Regardless of Type) For the Rest of Your Lives We Use Macromolecular Structure Data to Illustrate the Process And Hence Learn Structural Bioinformatics in the Process Data In Recognize redundancy In the data Classify the data Understand the scope and complexity of the data Understand the methods to physically instantiate the model Analyze the data PHAR 201 Lecture 4, 2012 Understand the experiment to understand the errors Understand how to best represent (model) the data Discover new science From the data 2 Agenda • Before there were ontologies there was mmCIF • Briefly review the history of ontology development • Review the Gene Ontology (GO) – Motivation – Features – Related research activities around GO PHAR 201 Lecture 4, 2012 3 The PDB Format • A full description is here • It was designed around an 80 column punched card! • It was designed to be human readable • It is used by almost every piece of software that deals with structural data PHAR 201 Lecture 4, 2012 4 The PDB Format - Records • Every PDB file may be broken into a number of lines terminated by an end-of-line indicator. Each line in the PDB entry file consists of 80 columns. The last character in each PDB entry should be an end-of-line indicator. • Each line in the PDB file is self-identifying. The first six columns of every line contain a record name, left-justified and blank-filled. This must be an exact match to one of the stated record names. • The PDB file may also be viewed as a collection of record types. Each record type consists of one or more lines. • Each record type is further divided into fields. PHAR 201 Lecture 4, 2012 5 The PDB Format – An Example – The Header PHAR 201 Lecture 4, 2012 6 The PDB Format – An Example – The Atomic Coordinates PHAR 201 Lecture 4, 2012 7 The Description – Atom Records PHAR 201 Lecture 4, 2012 8 What is Wrong with this Approach? • The description and the data are separate • Parsing is a nightmare – the most complex piece of code we have in our research laboratory probably remains the PDB parser • There are no relationships between items of data • Some data just cannot be parsed • The fixed column format cannot represent some of today’s structures … PHAR 201 Lecture 4, 2012 9 Structures are Spread Over Multiple Files – Most Users are Not Aware of this PHAR 201 Lecture 4, 2012 10 PDB Format Important Components of the Data are Lost to All But Humans REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK 3 REFINEMENT. BY THE RESTRAINED LEAST-SQUARES PROCEDURE OF 3 J. KONNERT AND W. HENDRICKSON (PROGRAM *PROLSQ*). THE R 3 VALUE IS 0.168 FOR 2680 REFLECTIONS WITH I GREATER THAN 3 2.0*SIGMA(I) REPRESENTING 74 PER CENT OF THE TOTAL 3 AVAILABLE DATA IN THE RESOLUTION RANGE 10.0 TO 2.0 3 ANGSTROMS. 4 THE ERABUTOXIN A (EA) CRYSTAL STRUCTURE IS ISOMORPHOUS WITH 4 THE KNOWN STRUCTURE OF ERABUTOXIN B (PROTEIN DATA BANK 4 ENTRIES *2EBX*, *3EBX*). EA DIFFERS FROM EB BY A SINGLE 4 SUBSTITUTION - EA ASN 26 FOR EB HIS 26. THE EA STARTING 4 MODEL WAS OBTAINED FROM A MOLECULAR REPLACEMENT STUDY IN 4 WHICH COORDINATES FOR 309 OF THE 475 ATOMS IN THE EB 4 STRUCTURE (*2EBX*) WERE USED. PHAR 201 Lecture 4, 2012 11 mmCIF Was Developed to Address these Problems Methods in Enzymology. 1997 277, 571-590 PHAR 201 Lecture 4, 2012 12 mmCIF – Scope of the Initial Effort • All PDB data should be captured • Describe a paper’s material and methods section • Describe biologically active molecule • Fully describe secondary structure but not tertiary or quaternary • Describe details of chemistry (inc. 2D) • Meaningful 3D views PHAR 201 Lecture 4, 2012 13 mmCIF - Extract from a Data File loop_ _atom_site.group_PDB _atom_site.type_symbol _atom_site.label_atom_id _atom_site.label_comp_id _atom_site.label_asym_id _atom_site.label_seq_id _atom_site.label_alt_id _atom_site.Cartn_x _atom_site.Cartn_y _atom_site.Cartn_z _atom_site.occupancy _atom_site.B_iso_or_equiv _atom_site.footnote_id _atom_site.entity_id _atom_site.entity_seq_num _atom_site.id ATOM N N VAL A 11 . 25.360 30.691 11.795 1.00 17.93 . 1 11 1 ATOM C CA VAL A 11 . 25.970 31.965 12.332 1.00 17.75 . 1 11 2 ATOM C C VAL A 11 . 25.569 32.010 13.881 1.00 17.83 . 1 11 3 PHAR 201 Lecture 4, 2012 14 mmCIF - Extract from the Dictionary save__atom_site.Cartn_x _item_description.description ; The x atom site coordinate in angstroms specified according to a set of orthogonal Cartesian axes related to the cell axes as specified by the description given in _atom_sites.Cartn_transform_axes. ; _item.name '_atom_site.Cartn_x' _item.category_id atom_site _item.mandatory_code no _item_aliases.alias_name '_atom_site_Cartn_x' _item_aliases.dictionary cifdic.c94 _item_aliases.version 2.0 loop_ _item_dependent.dependent_name '_atom_site.Cartn_y' '_atom_site.Cartn_z' _item_related.related_name '_atom_site.Cartn_x_esd' _item_related.function_code associated_esd _item_sub_category.id cartesian_coordinate _item_type.code float _item_type_conditions.code esd _item_units.code angstroms PHAR 201 Lecture 4, 2012 15 Summary • mmCIF has provided the PDB with a robust data representation which serves as conceptual and physical schema upon which the current RCSB, PDBe and PDBj are built • This work predated XML and XML-schema but embodies the important concepts inherent in these descriptions • mmCIF was later exactly converted into XML and is now used more than mmCIF, but much less than the old PDB format • PDB format will be phased out over a period of years PHAR 201 Lecture 4, 2012 16 Agenda • Before there were ontologies there was mmCIF • Briefly review the history of ontology development • Review the Gene Ontology (GO) – Motivation – Features – Related research activities around GO PHAR 201 Lecture 4, 2012 17 Formal Definitions Taken from Knowledge Engineering …. 1. A systematic account of existence. 2. (From philosophy) An explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them. 3. For AI systems, what "exists" is that which can be represented. When the knowledge about a domain is represented in a declarative language, the set of objects that can be represented is called the universe of discourse. We can describe the ontology of a program by defining a set of representational terms. Definitions associate the names of entities in the universe of discourse (e.g. classes, relations, functions or other objects) with human-readable text describing what the names mean, and formal axioms that constrain the interpretation and well-formed use of these terms. Formally, an ontology is the statement of a logical theory. PHAR 201 Lecture 4, 2012 18 Formal Definitions Taken from Knowledge Engineering Continued 4. A set of agents that share the same ontology will be able to communicate about a domain of discourse without necessarily operating on a globally shared theory. We say that an agent commits to an ontology if its observable actions are consistent with the definitions in the ontology. The idea of ontological commitment is based on the Knowledge-Level perspective. 5. The hierarchical structuring of knowledge about things by subcategorizing them according to their essential (or at least relevant and/or cognitive) qualities. See subject index. This is an extension of the previous senses of "ontology" (above) which has become common in discussions about the difficulty of maintaining subject indices. PHAR 201 Lecture 4, 2012 19 We will not focus too much on the formal definitions But more on how these formal concepts have been applied to biology PHAR 201 Lecture 4, 2012 20 The History of Ontologies from a Biological Perspective … • Early biological database efforts (1990’s) adopted knowledge bases as a model e.g. RiboWeb • They used the products from the AI community e.g. Ontolingua • Some of the concepts of knowledge bases remain – notably ontologies, but they are now mostly cast in more familiar commercial frameworks e.g. relational databases PHAR 201 Lecture 4, 2012 21 The History of Ontologies from a Biological Perspective Continued • Biological community in general was slow to see the value • Medical informatics community adopted ontologies early • Late 90’s database providers in particular began to work together – the gene ontology (GO) being a major product of this effort • 1998-2004 ontologies were the rage and warranted their own session at Bioinformatics meetings and are taken seriously by the biological community • 2004- accepted as part of biological data representation and use PHAR 201 Lecture 4, 2012 22 The History of Ontologies from a Biological Perspective Continued • Centers established to support the maintenance of ontologies: – The Open Biomedical Ontologies (OBO) Foundry – National Center for Biomedical Ontology (BioPortal 2.0) PHAR 201 Lecture 4, 2012 23 What Isn’t An Ontology? • A database or program – because they share internal formats only – it is not global • A table of contents – Because it is not a formal representation of the concepts • A terminology (aka controlled vocabulary) – Because it is a set of terms without a formal structure of how they relate PHAR 201 Lecture 4, 2012 24 Examples of Valuable Terminologies (Controlled Vocabularies) That Are Not Ontologies • • • • • ICD-9 for diseases SNOMED/RCD codes for symptoms EC Numbers (?) Taxonomy SMILES strings PHAR 201 Lecture 4, 2012 25 Ontology As Language • The ontology becomes the language of the domain it describes • The language = syntax + semantics • While that language must be understood by computers human readability counts PHAR 201 Lecture 4, 2012 26 Ontology as Contract Purposes of Ontologies • data exchange • unification/translation • calling knowledge services • representing theories • human communication Parties to the contract • programmers • data admins • programmers, netbots • scientists • collaborators PHAR 201 Lecture 4, 2012 27 Ontology Specifications • XML – provides a syntax for structured documents • XML Schema - a language for structuring XML documents and adding data types • RDF - a data model for objects and relations between them and represented in XML • RDF Schema – describes properties and classes of RDF resources with semantics to generalize • OWL 2 – Web Ontology Language – adds more vocabulary particularly of relationships between classes (e.g. disjointness, cardinality) PHAR 201 Lecture 4, 2012 28 Here is Another One.. http://richard.cyganiak.de/2007/10/lod/lod-datasets_2010-09-22_colored.html PHAR 201 Lecture 4, 2012 29 References: Ontologies in Bioinformatics • Bio-ontologies workshops since 1997 • Historical papers on knowledge sharing • mmCIF as an ontology - Westbrook and Bourne (2000) Bioinformatics 16(2) 159-168 [PDF] • Review 2006 – Bodenreider and Stevens Briefings in Bioinformatics PHAR 201 Lecture 4, 2012 30 Agenda • Before there were ontologies there was mmCIF • Briefly review the history of ontology development • Review the Gene Ontology (GO) – Motivation – Features – Related research activities around GO PHAR 201 Lecture 4, 2012 31 References • GO Itself - Creating the Gene Ontology Resource: Design and Implementation Genome Research (2001) 11:1425-1433 • Nucleic Acids Res. 2010 Jan;38(Database issue):D3315. Epub 2009 Nov 17. • The GO Website - http://www.geneontology.org • Application of GO – The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro Genome Res. 2003 Apr;13(4):662-72. Epub 2003 Mar 12 PHAR 201 Lecture 4, 2012 32 Brief History • Started by Saccharomyces Genome Database, FlyBase and the Mouse Genome Database • Grown to a consortium of members (see here) PHAR 201 Lecture 4, 2012 33 Roles of the GO Consortium • Write and maintain the ontologies themselves • Associate the ontologies to genes in the respective databases of members • Provide tools to facilitate the development and maintenance of ontologies PHAR 201 Lecture 4, 2012 34 Gene Ontology (GO) http://www.geneontology.org/ • Three levels of annotation: – Molecular function - what a gene product does at the biochemical level – Biological process - a broad biological perspective – not currently a pathway (no dynamics or dependencies) – Cellular component - location within cellular structures (eg Golgi apparatus) and macromolecular complexes (ribosome) PHAR 201 Lecture 4, 2012 35 GO Goals From Genome Res 2001 Aug;11(8):1425-33 PHAR 201 Lecture 4, 2012 36 Structure of GODirected Acyclic Graph (DAG) Example from molecular function: Parent Transmembrane receptor is_a Child Protein tyrosine kinase is_a Transmembrane receptor tyrosine protein kinase PHAR 201 Lecture 4, 2012 37 Structure of GODirected Acyclic Graph (DAG) Relationship of Child to Parent is_a represents an instance of part_of A mitotic chromosome is_a instance of a chromosome A telomere is part_of a chromosome PHAR 201 Lecture 4, 2012 38 Example - Molecular Function PHAR 201 Lecture 4, 2012 39 Example - Biological Process PHAR 201 Lecture 4, 2012 40 Example - Cellular Location PHAR 201 Lecture 4, 2012 41 Use of GO within the PDB http://pdb.rcsb.org PHAR 201 Lecture 4, 2012 42 Use of GO Within the Open Literature PHAR 201 Lecture 4, 2012 43 Some Issues – Levels of Granularity – Species Specificity • Chitin metabolism is part of cuticle synthesis in fly • Chitin metabolism is part of cell wall organization in yeast PHAR 201 Lecture 4, 2012 44 Some Issues • GO is dynamic – parent child relationships can change • When does a process begin and end? • Is_a and part_of not always clear – is actin cytoskeleton is_a cytoskeleton or part_of cytoskeleton • A community effort PHAR 201 Lecture 4, 2012 45 Relationship to Gene Products • A gene product is a protein or functional RNA • A gene product may have more than one function and therefore be related to multiple GO terms • The name of a gene product may only reflect one of its functions PHAR 201 Lecture 4, 2012 46 GO is Really 3 Independent Ontologies • Annotation of a gene product by one ontology is independent of its annotation by another ontology • Example: Products of the MDH1 MDH2 and MDH3 genes are all isoforms of malate dehydrogenanse in yeast with the same function, but localize to different cellular locations and are involved in different biochemical processes PHAR 201 Lecture 4, 2012 47 Evidence Codes • The evidence for assigning a gene product to a GO term itself has a controlled vocabulary PHAR 201 Lecture 4, 2012 48 Research Applications of GO PHAR 201 Lecture 4, 2012 49 PHAR 201 Lecture 4, 2012 50 Research Applications of GO PHAR 201 Lecture 4, 2012 51 PHAR 201 Lecture 4, 2012 52 PHAR 201 Lecture 4, 2012 53