Basic reading, writing and informatics skills for biomedical research Segment 4. Other types of database and browser 24 August 2012 Ganesha Associates 1 Biological databases • A database is an indexed collection of information • Some databases contain mainly text, but others contain image, sequence or structural data • A browser is a means of visualising this information and the relationships between data elements • There is a growing amount of information in publicly available databases. • For example, in 2011 the Nucleic Acids Research journal online Molecular Biology Database Collection listed 1380. • The National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute(EBI) host some of the most important databases used for biomedical research. • Wikipedia also contains a list of biological databases • Which databases are relevant to your project? 24 August 2012 Ganesha Associates 2 Data, data everywhere… • “Rapid release of prepublication data has served the field of genomics well.” • “With close to one million gene-expression data sets now in publicly accessible repositories, researchers can identify disease trends without ever having to enter a laboratory.” • “Most researchers agree that open access to data is the scientific ideal, so what is stopping it happening [in other fields]?” • “Earth scientists need better incentives, rewards and mechanisms to achieve free and open data exchange” 24 August 2012 Ganesha Associates 3 The database problem • Volume of digital data (both high throughput and text) – One second of HD video = 2000 pages of text • Distributed systems and databases, lack of data standards, incompatible data formats • Costs of creation, curation and maintenance • Retrieval: semantic search, metadata, images… 24 August 2012 Ganesha Associates 4 The problem – biomedical research Gene Expression Warehouse OMIM Disease ExPASy SwissProt PDB ExPASy Enzyme Protein Enzyme LocusLink Affy Fragment Known Gene MGD Sequence Metabolite SNP 24 August 2012 SPAD Sequence Cluster NCBI dbSNP Genbank NMR Pathway UniGene Ganesha Associates KEGG 5 Cross-database search today - NCBI 24 August 2012 Ganesha Associates 6 The problem – biomedical research 24 August 2012 Ganesha Associates 7 The problem – biomedical research 24 August 2012 Ganesha Associates 8 The problem – healthcare 24 August 2012 Ganesha Associates 9 The problem - healthcare JOURNAL of the AMERICAN MEDICAL ASSOCIATION (JAMA) Vol 284, No 4, July 26th 2000 • • • • • 2,000 deaths/year from unnecessary surgery 7,000 deaths/year from medication errors in hospitals 20,000 deaths/year from other errors in hospitals 80,000 deaths/year from infections in hospitals 106,000 deaths/year from non-error, adverse effects of medications These total up to 225,000 deaths per year in the US from iatrogenic causes which ranks these deaths as the # 3 killer. Iatrogenic is a term used when a patient dies as a direct result of treatments by a physician, whether it is from misdiagnosis of the ailment or from adverse drug reactions used to treat the illness (drug reactions are the most common cause). 24 August 2012 Ganesha Associates 10 The problem - healthcare • 17 year innovation adoption curve from discovery into accepted standards of practice • Even if a standard is accepted, patients have a 50:50 chance of receiving appropriate care, a 5-10% probability of incurring a preventable, anticipatable adverse event • Medical literature doubling every 19 years – Doubles every 22 months for AIDS care • 2 million facts needed to practice • Genomics and personalized medicine will increase the problem exponentially • Typical drug order today with decision support accounts for, at best, Age, Weight, Height, Labs, Other Active Meds, Allergies, Diagnoses 24 August 2012 Ganesha Associates 11 So how will we find things in databases ? • Search engine collects, indexes, parses, and stores data to facilitate fast and accurate information retrieval. • Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics (statistics), informatics, physics and computer science. 24 August 2012 Ganesha Associates 12 Semantic levels Definition Synonyms Classification (is_a) Properties (has_a) Other relations Keywords Dictionary Controlled vocabulary Thesaurus Taxonomy Ontology 24 August 2012 Ganesha Associates 22 The Gene Ontology organisation • The objective of GO is to provide controlled vocabularies for the description of the molecular function, biological process and cellular component of gene products. • These terms are to be used as attributes of gene products by collaborating databases, facilitating uniform queries across them. • The controlled vocabularies of terms are structured to allow both attribution and querying to be at different levels of granularity. • http://www.geneontology.org 24 August 2012 Ganesha Associates 27 24 August 2012 Ganesha Associates 29 An example of annotation Mitochondrial P450 (CC24 PR01238; MITP450CC24) GO cellular component term: mitochondrial inner membrane ; GO:0005743 GO molecular function term: monooxygenase activity ; GO:0004497 GO biological process term: electron transport ; GO:0006118 24 August 2012 Ganesha Associates 33 MicroArray data analysis with GO time Defense response Immune response Response to stimulus Toll regulated genes JAK-STAT regulated genes Puparial adhesion Molting cycle hemocyanin Amino acid catabolism Lipid metobolism Peptidase activity Protein catabloism Immune response Immune response Toll regulated genes 24 August 2012 attacked control Bregje Wertheim at the Centre for Evolutionary Genomics, Ganesha Associates 35 Department of Biology, UCL and Eugene Schuster Group, EBI. GoPubMed • GoPubMed is a knowledge-based search engine for biomedical texts. The Gene Ontology (GO) and Medical Subject Headings (MeSH) serve as "Table of contents" in order to structure the millions of articles of the MEDLINE data base. • GoPubMed is one of the first Web 2.0 search engines. • The system was developed at the Technical University of Dresden by Michael Schroeder and his team and at Transinsight. • http://www.gopubmed.org 24 August 2012 Ganesha Associates 36 24 August 2012 Ganesha Associates 37 Medline Cognition Cognition's Semantic NLP Understands: Word stems - the roots of words; Words/Phrases - with individual meanings of ambiguous words and phrases listed out; The morphological properties of each word/phrase, e.g., what type of plural does it take, what type of past tense, how does it combine with affixes like "re" and "ation"; How to disambiguate word senses - This allows Cognition's technology to pick the correct word meaning of ambiguous words in context; The synonym relations between word meanings; The ontological relations between word meanings; one can think of this as a hierarchical grouping of meanings or a gigantic "family tree of English" with mothers, daughters, and cousins; The syntactic and semantic properties of words. This is particularly useful with verbs, for example. Cognition encodes the types of objects different verb meanings can occur with. 24 August 2012 Ganesha Associates 38 24 August 2012 Ganesha Associates 39 iHOP Information Hyperlinked over Proteins. iHOP provides the network of genes and proteins as a natural way of accessing the millions of abstracts in PubMed 24 August 2012 Ganesha Associates 40 iHOP • The minimal information view contains general information, like the symbol, name and organism of a gene. Moreover it provides: – Useful links to external resources (e.g. UniProt, NCBI, OMIM, etc.) – Links to other iHOP views on this gene – Homologues • Other views contain all sentences found in the literature: – For the main gene of a page and other genes (gene B) which iteract. – That mention the main gene together with relevant biomedical terms such as lymphoma. • Sentences are ranked by significance, so that screening over a few sentences will be usually sufficient to gain an idea of a gene's function. 24 August 2012 Ganesha Associates 41 24 August 2012 Ganesha Associates 42 GenMAPP • GenMAPP is a free computer application designed to visualize gene expression and other genomic data on maps representing biological pathways and groupings of genes. • Integrated with GenMAPP are programs to perform a global analysis of gene expression or genomic data in the context of hundreds of pathway MAPPs and thousands of Gene Ontology Terms. 24 August 2012 Ganesha Associates 43 Automatic rendering of pathway interactions 24 August 2012 Ganesha Associates 44 Other ways to search – BLAST, PubChem, UCSC Genome Browser By sequence – BLAST: >DinoDNA from JURASSIC PARK p. 103 nt 1-1200 GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGCATCAGATACAGTTGGAGA TAAGGACGGACGTGTGGCAGCTCCCGCAGAGGATTCACTGGAAGTGCATTA CCTATCCCATGGGAGCCATGGAGTTCGTGGCGCTGGGGGGGCCGGATGCG GGCTCCCCCACTCCGTTCCCTGATGAAGCCGGAGCCTTCCTGGGGCTGGGG GGGGGCG By structure – PubChem: 24 August 2012 Ganesha Associates 45 Example of BLAST search results 24 August 2012 Ganesha Associates 46 PC Compound Record 24 August 2012 Ganesha Associates 47 UCSC Genome Browser • The Genome Browser zooms and scrolls over chromosomes, showing the work of annotators worldwide. • The Gene Sorter shows expression, homology and other information on groups of genes that can be related in many ways. • Blat quickly maps your sequence to the genome. The Table Browser provides convenient access to the underlying database. • VisiGene lets you browse through a large collection of in situ mouse and frog images to examine expression patterns. • Genome Graphs allows you to upload and display genome-wide data sets. 24 August 2012 Ganesha Associates 48 24 August 2012 Ganesha Associates 49