Basic reading, writing and informatics skills for biomedical research Segment 4. Other types of database and browser 10 April 2008 Copyright: Ganesha Associates 2008 1 Biological databases • A database is an indexed collection of information • Some databases contain mainly text, but others contain sequence of structural data • A browser is a means of visualising this information and the relationships between data elements • There is a growing amount of information in publicly available databases. • Each year, the journal Nucleic Acids Research publishes an annual database issue. The 2007 issue lists 968 editorially selected biomolecular databases 10 April 2008 Copyright: Ganesha Associates 2008 2 The database problem • • • • • • • • Volume of data (both high throughput and text) Complexity Distributed systems and databases Incompatible data formats Multi-disciplinary Multi-lingual Inability to share knowledge Ambiguity of terminology 10 April 2008 Copyright: Ganesha Associates 2008 3 The problem – biomedical research Gene Expression Warehouse OMIM Disease ExPASy SwissProt PDB ExPASy Enzyme Protein Enzyme LocusLink Affy Fragment Known Gene MGD Sequence Metabolite SNP 10 April 2008 SPAD Sequence Cluster NCBI dbSNP Genbank NMR Pathway UniGene Copyright: Ganesha Associates 2008 KEGG 4 The question – biomedical research 10 April 2008 Copyright: Ganesha Associates 2008 5 The problem – biomedical research 10 April 2008 Copyright: Ganesha Associates 2008 6 The problem - pharmabiotech 10 April 2008 Copyright: Ganesha Associates 2008 7 The problem - healthcare • 17 year innovation adoption curve from discovery into accepted standards of practice • Even if a standard is accepted, patients have a 50:50 chance of receiving appropriate care, a 5-10% probability of incurring a preventable, anticipatable adverse event • Medical literature doubling every 19 years – Doubles every 22 months for AIDS care • 2 million facts needed to practice • Genomics, Personalized Medicine will increase the problem exponentially • Typical drug order today with decision support accounts for, at best, Age, Weight, Height, Labs, Other Active Meds, Allergies, Diagnoses 10 April 2008 Copyright: Ganesha Associates 2008 8 The problem - healthcare JOURNAL of the AMERICAN MEDICAL ASSOCIATION (JAMA) Vol 284, No 4, July 26th 2000 • • • • • 2,000 deaths/year from unnecessary surgery 7,000 deaths/year from medication errors in hospitals 20,000 deaths/year from other errors in hospitals 80,000 deaths/year from infections in hospitals 106,000 deaths/year from non-error, adverse effects of medications These total up to 225,000 deaths per year in the US from iatrogenic causes which ranks these deaths as the # 3 killer. Iatrogenic is a term used when a patient dies as a direct result of treatments by a physician, whether it is from misdiagnosis of the ailment or from adverse drug reactions used to treat the illness (drug reactions are the most common cause). 10 April 2008 Copyright: Ganesha Associates 2008 9 How do we find things in databases ? • Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. • Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics (statistics), informatics, physics and computer science. 10 April 2008 Copyright: Ganesha Associates 2008 10 Indexing • The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. • Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power. • For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in 10,000 large documents could take hours. • The additional computer storage required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval. 10 April 2008 Copyright: Ganesha Associates 2008 11 Inverted indexing • An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents, in this case allowing full text search. • There are two main variants of inverted indexes: – A record level inverted index contains a list of references to documents for each word. – A word level inverted index additionally contains the positions of each word within a document. – The latter form offers more functionality (like phrase searches), but needs more time and space to be created. 10 April 2008 Copyright: Ganesha Associates 2008 12 Example • Texts T0 = "it is what it is", T1 = "what is it" and T2 = "it is a banana", have the following inverted file index (where the integers in the brackets refer to the subscripts T0, T1 etc.): – – – – – "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} • A search for the terms "what", "is" and "it" would give the set {0,1} 10 April 2008 Copyright: Ganesha Associates 2008 13 Example (cont’d) • In the full inverted index, where the pairs are document numbers and local word numbers, "banana": {(2, 3)} means the word "banana" is in the third document (T2), and it is the fourth word in that document (position 3): – – – – – "a": {(2, 2)} "banana": {(2, 3)} "is": {(0, 1), (0, 4), (1, 1), (2, 1)} "it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)} • A phrase search for "what is it“ gets hits for all the words in both document 0 and 1, but the terms occur only consecutively in document 1. 10 April 2008 Copyright: Ganesha Associates 2008 14 Indexing algorithms • Semantic – – – – – Stop words Stemming Synonyms Thesauri Ontologies • Syntactic – Word order – Word type – Natural language processing • Statistical – Word frequency – Word proximity 10 April 2008 Copyright: Ganesha Associates 2008 15 PubMed Related Articles Algorithm (I) • The neighbors of a document are those documents in the database that are the most similar to it. • The similarity between documents is measured by the words they have in common, with some adjustment for document lengths. • To carry out such a program, one must first define what a word is. • For us, a word is basically an unbroken string of letters and numerals with at least one letter of the alphabet in it. • Words end at hyphens, spaces, new lines, and punctuation. • A list of 310 common, but uninformative, words (also known as stopwords) are eliminated from processing at this stage. 10 April 2008 Copyright: Ganesha Associates 2008 16 PubMed Related Articles Algorithm (II) • Next, a limited amount of stemming of words is done. • Words from the abstract of a document are classified as text words. • Words from titles are also classified as text words, but words from titles are added in a second time to give them a small advantage in the local weighting scheme. • MeSH terms are placed in a third category, and a MeSH term with a subheading qualifier is entered twice, once without the qualifier and once with it. • These three categories of words (or phrases in the case of MeSH) comprise the representation of a document. • No other fields, such as Author or Journal, enter into the calculations. • See http://ii.nlm.gov/MTI/related.shtml for more info. 10 April 2008 Copyright: Ganesha Associates 2008 17 Ontologies, thesauri and taxonomies • An ontology is a controlled vocabulary that describes objects and the relations between them in a formal way, and has a grammar for using the vocabulary terms to express something meaningful within a specified domain of interest. • A thesaurus is a controlled list of terms linked together by semantic, hierarchical, and associative or equivalence relationships. • A taxonomy is a set of interdependent concepts arranged in a lattice based on their relationships. 10 April 2008 Copyright: Ganesha Associates 2008 18 Semantic inference Keywords Discovery Dictionary Controlled Vocabulary Thesaurus Taxonomy Integration Prediction 10 April 2008 Ontology Copyright: Ganesha Associates 2008 19 Semantic levels Definition Synonyms Classification (is_a) Properties (has_a) Other relations Keywords Dictionary Controlled vocabulary Thesaurus Taxonomy Ontology 10 April 2008 Copyright: Ganesha Associates 2008 20 The Medical Subject Headings classification • Controlled vocabulary, thesaurus. • MeSH terms are arranged in a hierarchy of "MeSH Tree Structures". • When PubMed searches a MeSH term, it will automatically include narrower terms in the search, if applicable. This is also called "automatic explosion." • When you click Go, PubMed will look for a match in up to four lists. It looks first for a match in the MeSH Translation Table. If it doesn't find a match, it looks in the Journals Translation Table, then in the Phrase List, and finally in the Author Index. 10 April 2008 Copyright: Ganesha Associates 2008 21 10 April 2008 Copyright: Ganesha Associates 2008 22 10 April 2008 Copyright: Ganesha Associates 2008 23 10 April 2008 Copyright: Ganesha Associates 2008 24 The Gene Ontology organisation • The objective of GO is to provide controlled vocabularies for the description of the molecular function, biological process and cellular component of gene products. • These terms are to be used as attributes of gene products by collaborating databases, facilitating uniform queries across them. • The controlled vocabularies of terms are structured to allow both attribution and querying to be at different levels of granularity. • http://www.geneontology.org 10 April 2008 Copyright: Ganesha Associates 2008 25 Gene Ontology organisation • GO collaborators have developed three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner. • There are three separate aspects to this effort: – They write and maintain the ontologies themselves – They make cross-links between the ontologies and the genes and gene products in the collaborating databases – They develop tools that facilitate the creation, maintainence and use of ontologies. • Useful links: http://www.amigo.org 10 April 2008 Copyright: Ganesha Associates 2008 26 10 April 2008 Copyright: Ganesha Associates 2008 27 10 April 2008 Copyright: Ganesha Associates 2008 28 Clark et al., 2005 Is_a and part_of relationships is_a part_of 10 April 2008 Copyright: Ganesha Associates 2008 29 An example of annotation Mitochondrial P450 (CC24 PR01238; MITP450CC24) GO cellular component term: mitochondrial inner membrane ; GO:0005743 GO molecular function term: monooxygenase activity ; GO:0004497 GO biological process term: electron transport ; GO:0006118 10 April 2008 Copyright: Ganesha Associates 2008 30 MicroArray data analysis with GO time Defense response Immune response Response to stimulus Toll regulated genes JAK-STAT regulated genes Puparial adhesion Molting cycle hemocyanin Amino acid catabolism Lipid metobolism Peptidase activity Protein catabloism Immune response Immune response Toll regulated genes 10 April 2008 attacked control Bregje Wertheim at the Centre for Evolutionary Genomics, Copyright: Ganesha Associates cted Gene Tree: pearson Coloredby: by: pearson lw n3d ... lw n3d ... Colored nch color classification: Set_LW_n3d_5p_... Gene List: Set_LW_n3d_5p_... Gene List: 31 Department of Biology, UCL and Eugene Schuster Group, EBI. 2008 Copy of Copy C5_RMA Copy ofofCopy of(Defa... C5_RMA (Defa... allall genes (14010)(14010) genes GoPubMed • GoPubMed is a knowledge-based search engine for biomedical texts. The Gene Ontology (GO) and Medical Subject Headings (MeSH) serve as "Table of contents" in order to structure the millions of articles of the MEDLINE data base. • GoPubMed is one of the first Web 2.0 search engines. • The system was developed at the Technical University of Dresden by Michael Schroeder and his team and at Transinsight. • http://www.gopubmed.org 10 April 2008 Copyright: Ganesha Associates 2008 32 10 April 2008 Copyright: Ganesha Associates 2008 33 Medline Cognition Cognition's Semantic NLP Understands: Word stems - the roots of words; Words/Phrases - with individual meanings of ambiguous words and phrases listed out; The morphological properties of each word/phrase, e.g., what type of plural does it take, what type of past tense, how does it combine with affixes like "re" and "ation"; How to disambiguate word senses - This allows Cognition's technology to pick the correct word meaning of ambiguous words in context; The synonym relations between word meanings; The ontological relations between word meanings; one can think of this as a hierarchical grouping of meanings or a gigantic "family tree of English" with mothers, daughters, and cousins; The syntactic and semantic properties of words. This is particularly useful with verbs, for example. Cognition encodes the types of objects different verb meanings can occur with. 10 April 2008 Copyright: Ganesha Associates 2008 34 10 April 2008 Copyright: Ganesha Associates 2008 35 iHOP Information Hyperlinked over Proteins. iHOP provides the network of genes and proteins as a natural way of accessing the millions of abstracts in PubMed 10 April 2008 Copyright: Ganesha Associates 2008 36 iHOP • The minimal information view contains general information, like the symbol, name and organism of a gene. Moreover it provides: – Useful links to external resources (e.g. UniProt, NCBI, OMIM, etc.) – Links to other iHOP views on this gene – Homologues • Other views contain all sentences found in the literature: – For the main gene of a page and other genes (gene B) which iteract. – That mention the main gene together with relevant biomedical terms such as lymphoma. • Sentences are ranked by significance, so that screening over a few sentences will be usually sufficient to gain an idea of a gene's function. 10 April 2008 Copyright: Ganesha Associates 2008 37 10 April 2008 Copyright: Ganesha Associates 2008 38 10 April 2008 Copyright: Ganesha Associates 2008 39 GenMAPP • GenMAPP is a free computer application designed to visualize gene expression and other genomic data on maps representing biological pathways and groupings of genes. • Integrated with GenMAPP are programs to perform a global analysis of gene expression or genomic data in the context of hundreds of pathway MAPPs and thousands of Gene Ontology Terms. 10 April 2008 Copyright: Ganesha Associates 2008 40 10 April 2008 Copyright: Ganesha Associates 2008 41 Automatic rendering of pathway interactions 10 April 2008 Copyright: Ganesha Associates 2008 42 Other ways to search – BLAST, PubChem, UCSC Genome Browser By sequence – BLAST: >DinoDNA from JURASSIC PARK p. 103 nt 1-1200 GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGCATCAGATACAGTTGGAGA TAAGGACGGACGTGTGGCAGCTCCCGCAGAGGATTCACTGGAAGTGCATTA CCTATCCCATGGGAGCCATGGAGTTCGTGGCGCTGGGGGGGCCGGATGCG GGCTCCCCCACTCCGTTCCCTGATGAAGCCGGAGCCTTCCTGGGGCTGGGG GGGGGCG By structure – PubChem: 10 April 2008 Copyright: Ganesha Associates 2008 43 Example of BLAST search results 10 April 2008 Copyright: Ganesha Associates 2008 44 PC Compound Record 10 April 2008 Copyright: Ganesha Associates 2008 45 UCSC Genome Browser • The Genome Browser zooms and scrolls over chromosomes, showing the work of annotators worldwide. • The Gene Sorter shows expression, homology and other information on groups of genes that can be related in many ways. • Blat quickly maps your sequence to the genome. The Table Browser provides convenient access to the underlying database. • VisiGene lets you browse through a large collection of in situ mouse and frog images to examine expression patterns. • Genome Graphs allows you to upload and display genome-wide data sets. 10 April 2008 Copyright: Ganesha Associates 2008 46 10 April 2008 Copyright: Ganesha Associates 2008 47 Cross-database search - NCBI 10 April 2008 Copyright: Ganesha Associates 2008 48 And for the future ? 10 April 2008 Copyright: Ganesha Associates 2008 49 Practical activity 4 - Non-bibliographic databases • Total duration - ca. 2 hours. • If you are a geneticist, biochemist, cell biologist, go to the NCBI Minicourses page and do one of the courses described there. These resources are also valuable if you are interested in the molecular biology of disease • If you are a medicinal chemist, or a pharmacologist take a look at the PubChem resource and find out how you can find links from a given compound to related data such as bioactivity studies, literature abstracts, protein sequences, protein structures, genes and diseases • If you are a clinician, find out more about evidence-based medicine and apply the PICO approach to building a specific focused, answerable question using PubMed. • If you are none of the above, short-list the database resources relevant to your field of interest • Discuss your findings with the class. 10 April 2008 Copyright: Ganesha Associates 2008 50