Database Resources The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information www.ncbi.nih.gov NCBI Mission …. to develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease. What does this involve ? • creating automated systems for storing and analyzing knowledge about molecular biology, biochemistry, and genetics; • facilitating the use of such databases and software by the research and medical community; • coordinating efforts to gather biotechnology information both nationally and internationally; • performing research into advanced methods of computer-based information processing for analyzing the structure and function of biologically important molecules. What is a Database ? • A model or representation of some aspect of the real world • An organized collection of data. May contain many different types of data • Coherent, consistent and designed for a specific purpose • A computational system for managing and querying the data. What is a Database ? • A collection of information organized in such a way that a computer program can quickly select desired pieces of data. • An electronic filing system • Traditional databases are organized by fields, records, and files. • A field is a single piece of information; • a record is one complete set of fields • a file is a collection of records. For example, a telephone book is analogous to a file. It contains a list of records, each of which consists of three fields: name, address, and telephone number. What is a Database ? • To access information from a database, you need a database management system (DBMS). This is a collection of programs that enables you to enter, organize, and select data in a database. • Most molecular biology databases primarily use relational database management systems (RDBMS). Relational Database • A relational database is like a large spreadsheet. Each field is a column, each row is an entry. Relational databases use a set of tables to organize data. • Each entry must be unambiguously identified • Names are not reliable e.g. incorrectly assigned gene function • Unique IDs (UID)s are used, e.g. in GenBank these are called accession numbers UID Name Sequence Quality Value BU039022 PP_LEa0001A01f CATACAAAT … 35 BU039057 PP_LEa0001B17f TACGGCTAC … 28 Relational Database • Achieving consistency • Repeated information is stored in a single place. • Only one copy needs to be updated Sequence UID Definition Locus Accession Taxonomy ID* Sequence Taxonomy Taxonomy ID* Genus Species Ref Index* UID Medline ID * May be referred Ref Index Medline ID Authors Title Journal * May be referred to indirectly via an index to by a secondary ID Relational Database • Language used is SQL or structured query language • Easy to understand (essentially English?) • Relatively consistent across RBDMS • Supplies a set of commands to define tables, insert data and make queries • Queries • SELECT some fields FROM some table WHERE some condition is met • E.g. select accession, sequence FROM sequence WHERE Accession = BU039022 BU039022 CATACAAATACTGCTACHTAAATC …. • More complex queries require two or more tables be joined to produce a result Relational Database • Most RDBMS do not allow users to directly query the database by SQL. • An ill formed query can overload or crash the system • SQL still too complex for biologists? • Provide a search interface for the user instead • E.g. user enters a phrase and the database identifies what part of the database should be searched. • The queries that make it through the web interface have to be translated to SQL Relational database : Example GenBank Query What Constitutes a Good Database ? • • • • • • Broad coverage of the chosen topic Up to date information gathering Curated Support staff Commitment to the future Good query interface Issues for Molecular Biological Databases ? • • • • Annotation Archives Updates Redundancy Issues for Molecular Biological Databases ? • Annotation • Adding biological information to genome sequence. Textual descriptive information • Correctness • Many genes are incorrectly annotated. May assign a function to a novel gene from a similar sequence that may itself be incorrectly annotated so the error is propagated throughout the database. • Routine error • Quality • Expert or non expert curation? Who provided the curation? • Is there any biological verification? • What vocabulary is used • Has their been any peer review ? Issues for Molecular Biological Databases ? • Archival Quality • • Is the database archival or curated Can the same data be recovered later • Don’t overwrite primary key (each accession numbers) • The best databases note any changes to the data. • Updates • • • • How often is the database updated? Major databases take direct submissions • Only the direct submitter can make changes, even if you can prove its wrong. When is a sequence finished ? How is annotation updated as more knowledge is available • Redundancy • • This is a major issue, how do we deal with it without losing potentially valuable information. Also relates to archival quality NCBI and GenBank • Genbank is the genetic sequence database of all publicly available DNA and derived protein sequences, with annotations describing the biological information in them. • GenBank is hosted within NCBI • Researchers submit their sequences to GenBank • NCBI provides analysis and retrieval resources for the data in GenBank (and many other NCBI hosted databases). NCBI Databases (http://www.ncbi.nlm.nih.gov/guide/all/#Databases_) • • • • • • • • • • • • • Nucleotide Database EST (dbEST) GSS (dbGSS) Protein Database Structure Database Genome 3D Domains Conserved Domains UniSTS Gene UniGene HomoloGene Reference Sequence (refseq) • • • • • • • • • • • • • • SNP (dbSNP) dbVAR – large scale genomic variation dbGAP – integration of genotype & phenotype PopSet Database Taxonomy Database GEO Profiles GEO Datasets Cancer Chromosomes Epigenomics PubMed Central Journals MeSH Bookshelf OMIM Database Retrieving Data from NCBI using Entrez • Entrez is a text based retrieval system that integrates all the information resources available at the NCBI such as; 1. 2. 3. 4. 5. 6. 7. Scientific literature DNA and protein sequence databases 3D protein structure and protein domain data Population study datasets Expression data Assemblies of complete genomes Taxonomic information http://www.ncbi.nlm.nih.gov/guide/all/#howto_ Create/login to the myNCBI portal Understanding GenBank records Go to http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#ModificationsDateB Click on the links on the left to get a description of what the term means, Copy the description into a word document and after completed, save the document on your drupal web site Entrez Sequences Help http://www.ncbi.nlm.nih.gov/books/NBK44864/