Databases WHY DO I HAVE TO LISTEN ABOUT THIS?! DataBase – what the heck is that? A database is a collection of information that is organized so that it can easily be accessed, managed, and updated. Various types – from simple to complex ones Flat-file, relational Records retrieved using a query language Are you using one?? Phone directory Archive of bills Birth registers Problems with data – why you need a db? Nowadays obtaining data is no problem Having data is no reason to have database Problems with data that require DB: Size Ease of updating Accuracy Security Redundancy Importance DBs - dissection Information system Query system Storage System Data GenBank flat file PDB file Interaction Record Title of a book Book DBs - dissection Oracle Information system Query system Storage System Data MySQL PostgreSQL PC binary files Unix text files Bookshelves DBs - dissection Information system Query system Storage System Data A List you look at A catalogue indexed files SQL grep DBs - dissection Information system Query system Storage System Data Google Entrez SRS DBget DBs – what are they made of? Tables (entities) • basic elements of information to track, e.g., gene, organism, sequence, citation... Columns (fields) • attributes of tables, e.g. for citation table, title, journal, volume, author... Rows (records) • actual data • whereas fields describe what data is stored, the rows of a table are where the actual data is stored Flat-File DBs All of the data is stored in one large table Txt file, excel… Relational DBs contains multiple tables and defines the relationships between them invoice_id customer 1 Elmer 2 Wiley 3 Elmer 4 Bugs product price quantity total buckshot $2.00 2 $4.00 Acme snow machine $5.00 1 $5.00 shotgun $25.00 1 $25.00 carrots $0.50 20 $10.00 customer_table name address Elmer Looney Tunes Dr. Wiley Southwest desert Bugs Rabbit Hole product_table product carrots shotgun buckshot Acme snow machine price $ $ $ $ notes likes hunting and opera big mail order customer likes to cross dress notes 0.50 25.00 oddly flexible 2.00 5.00 high defect rate Relational DBs Relationships can be built between tables and fields invoice_id customer 1 Elmer 2 Wiley 3 Elmer 4 Bugs product price quantity total buckshot $2.00 2 $4.00 Acme snow machine $5.00 1 $5.00 shotgun $25.00 1 $25.00 carrots $0.50 20 $10.00 customer_table name address Elmer Looney Tunes Dr. Wiley Southwest desert Bugs Rabbit Hole product_table product carrots shotgun buckshot Acme snow machine price $ $ $ $ notes likes hunting and opera big mail order customer likes to cross dress notes 0.50 25.00 oddly flexible 2.00 5.00 high defect rate Relational DBs – even more technical... Get the info using Structured Query Language (SQL): SELECT customer_table.name, customer_table.address FROM customer_table, invoice WHERE invoice.product = “Acme Snow Machine” AND invoice.customer = customer_table.name Result: Wiley, Southwest desert invoice_id customer 1 Elmer 2 Wiley 3 Elmer 4 Bugs product price quantity total buckshot $2.00 2 $4.00 Acme snow machine $5.00 1 $5.00 shotgun $25.00 1 $25.00 carrots $0.50 20 $10.00 customer_table name address Elmer Looney Tunes Dr. Wiley Southwest desert Bugs Rabbit Hole product_table product carrots shotgun buckshot Acme snow machine price $ $ $ $ notes likes hunting and opera big mail order customer likes to cross dress notes 0.50 25.00 oddly flexible 2.00 5.00 high defect rate Biological DBs A lot of them.. • Vary in size, quality, coverage, level of interest • Is it any good? • • • • • • comprehensiveness accuracy is up-to-date good interface batch search/download API (web services, DAS, etc.) DBs by data types Sequence databases Sequence analysis Functional genomics Literature databases Structural databases Metabolic pathway databases Specialised databases Confused?? http://www.oxfordjournals.org/nar/database/a/ http://www.expasy.org/links.html DBs by scope Comprehensive Contain data from many organisms and many different types of sequences Nucleotide GenBank (National Center for Biotechnology Information) EMBL (European Molecular Biology Laboratory) DDBJ (DNA Data Bank of Japan) GenBank, EMBL & DDBJ comprise the International Nucleotide Sequence Database Collaboration Protein, such as Swiss-Prot Protein Structure, such as PDB: Protein Data Bank Genomes and Maps, such as Entrez Genomes DBs by scope Specialized – Contain data from individual organisms, specific categories/functions of sequences, or data generated by specific sequencing technologies. – Example: Flybase, Wormbase, etc. DBs by level of curation Primary databases – Archival data Repository of information Redundant; might have many sequence records for the same gene, each from a different lab Submitters maintain editorial control over their records: what goes in is what comes out No controlled vocabulary Variation in annotation of biological features GenBank/EMBL/DDBJ UniProt PDB Medline (PubMed) DBs by level of curation Secondary (derivative) databases – Curated data Non-redundant; one record for each gene, or each splice variant Each record is intended to present an encapsulation of the current understanding of a gene or protein, similar to a review article Records contain value-added information that have been added by an expert(s) RefSeq Taxon UniProt OMIM Literature DBs PubMed www.ncbi.nlm.nih.gov/pubmed Focuses on biomedicine Integrated with other NCBI DBs and services Uses NCBI search sytax (PubMed help) Google Scholar scholar.google.com Standard Google syntax Subject areas Free pdfs To do: Stein, L.D. 2003. Integrating biological databases. Nat Rev Genet 4: 337-345. DBs - how much is in there? Growth of GenBank and WGS GenBank www.ncbi.nlm.nih.gov/Genbank/ Genbank database of nucleotide sequences from >160,000 organisms started in 1981 (263 entries; 436,710 residues) Release 175 - 12/09 (112,910,950 entries; 110,118,557,163 base pairs) Release 189 - 04/12 (151 824 421 entries; 139 266 481 398 base pairs) Release 201 – 04/14 (171 744 486 entries; 159 813 411 760 base pairs) Release 207 – 04/15 (182 188 746 entries; 189 739 230 107 base pairs) divided into 18 divisions Organism specific (primate , rodent, invertebrate, bacterial, viral… 11 divisions) Technology specific EST - EST sequences (expressed sequence tags) PAT - patent sequences STS - STS sequences (sequence tagged sites) GSS - GSS sequences (genome survey sequences) HTG - HTG sequences (high-throughput genomic sequences) HTC - unfinished high-throughput cDNA sequencing ENV - environmental sampling sequences GenBank file GenBank file - header GenBank file - features GenBank file - sequence // GenBank - interface GenBank - interface GenBank - interface GenBank - interface GenBank - interface NCBI/EBI/GenomeNet Formats NCBI DBs GenBank: The Nucleotide Sequence Database PubMed: The Bibliographic Database Macromolecular Structure Databases The Taxonomy Project The Single Nucleotide Polymorphism Database (dbSNP) of Nucleotide Sequence Variation The Gene Expression Omnibus (GEO): A Gene Expression and Hybridization Repository Online Mendelian Inheritance in Man (OMIM): A Directory of Human Genes and Genetic Disorders The NCBI BookShelf: Searchable Biomedical Books PubMed Central (PMC): An Archive for Literature from Life Sciences Journals The SKY/CGH Database for Spectral Karyotyping and Comparative Genomic Hybridization Data The Major Histocompatibility Complex Database, dbMHC NCBI - Entrez http://www.ncbi.nlm.nih.gov/gquery/ General Protein DBs UniProt (http://www.uniprot.org) SWISS-PROT GenPept/TrEMBL Manually curated high-quality annotations, less data Translated coding sequences from GenBank/EMBL Few annotations, more up to date PIR } UniProt (2002) Phylogenetic-based annotations European Bioinformatics Institute (EBI) Swiss Institute of Bioinformatics (SIB) Protein Information Resource (PIR) Other protein DBs Structural DBs (PDB) PDB (Protein Databank) MMDB (Molecular Modeling database) Protein domains DB (Pfam) Pfam SMART (a Simple Modular Architecture Research Tool) CDD (Conserved Domain Database) Protein motif DBs Scan Prosite PRINTS Other DBs Ribosomal RNA DBs RDP (Michigan State University, USA) rRNA database (University of Antwerp, Belgium) Silva Genome DBs Colibase (E. coli and related species) Flybase (Drosophila) Dictybase (Dictyostelium discoideum) Metabolic pathways DBs… Nutrigenomics related DBs Gene oriented Gene expression: GEO - Gene Expression Omnibus (NCBI) Array Express (EBI) CGED (Cancer Gene Expression Database) Variation databases: dbSNP (NCBI) Hapmap http://hapmap.ncbi.nlm.nih.gov/abouthapmap.html HGVbase (Human Genome Variation) OMIM Online Mendelian Inheritance in Man database that catalogues all the known diseases with a genetic component relationship between phenotype and genotype ~ 20 000 entries Clinical and Mutation Databases HGMD Human Gene Mutation Database • • Database of sequences and phenotypes of disease-causing mutations http://www.hgmd.cf.ac.uk/ac/index.php General Disease DBs http://swissvar.expasy.ch KEGG Disease http://www.genome.jp/kegg/disease/ Swisswar Disease-specific mutation databases Nutrigenomics related DBs Nutrigenomics database microarray data related to nutrition http://foodfunction.dc.affrc.go.jp/en/ NuGO http://www.nugo.org dbNP – Nutritional Phenotype database Biological information in db: genetics transcriptomics proteomics biomarkers metabolomics functional assays food intake and food composition Nutrition db – myplate.gov Nutrition db – myplate.gov Nutrition db - USDA http://ndb.nal.usda.gov/ Nutrition db - USDA Nutrition db - USDA Nutrition databases Nutrition databases http://nutritiondata.self.com/ Literature DBs ISI Web of knowledge portal.isiknowledge.com WOS WOS ISI Web of knowledge ISI – Citation report ISI 2 do Each group takes one department and check publications of full professors (www.pbf.hr) Count all publications and citings for your department What is the most cited publication for your department What is the highest h-factor in your department Normalize the data... PubMed Overview PubMed is a Web-based retrieval system developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) NLM has been indexing the biomedical literature since 1879 PubMed is a database of bibliographic information drawn primarily from the life sciences literature PubMed contains links to full-text articles at participating publishers' Web sites as well as links to other third party sites PubMed provides access and links to the integrated molecular biology and chemistry databases maintained by NCBI What’s in PubMed? Over 23 million records representing articles in the biomedical literature Most PubMed records are MEDLINE citations MEDLINE®, the National Library of Medicine’s premier bibliographic database containing citations and author abstracts from more than 5,500 biomedical journals The scope of MEDLINE includes diverse topics such as microbiology, delivery of health care, nutrition, pharmacology and environmental health PubMed - author search Full names are not available for all authors – it is smarter to use only initials PubMed – author search results Search results options Article view Subject search (simple) To search by subject be specific as possible Do not use punctuation, tags or operators Search for articles on the use of aspirin for heart attack prevention. Which query to use? a) “aspirin for heart attack prevention” b) aspirin heart attack prevention aspirin AND heart AND attack AND prevention c) Advanced Pubmed search using MeSH MeSH (Medical Subject Headings) is the NLM controlled vocabulary which gives uniformity and consistency to the indexing and cataloging of biomedical literature Similar to keywords on other systems Arranged in s hierarchical manner Even more about MeSH MeSH Vocabulary includes four types of terms: Headings —represent concepts found in the biomedical literature Body Weight Kidney Radioactive Waste Subheadings — attached to MeSH headings to describe a specific aspect of a concept Therapy Diagnosis Metabolism Supplementary Concept Record Publication Types PubMed Search using MeSH – graphic example Results MeSH example We will be looking for papers dealing with medication of adults with nutrition disorders 1. go to PubMed advanced search 2. In builder change All Fields to MeSH terms and write nutrition disorders (choose from dropdown menu) 3. In the next field write “adults” and click on Show index list – select “adults” 4. Change All fileds to MeSH Subheadings and from index list select “drug therapy” 5. Click on Search button Tasks Search for papers looking at vitamin B supplementation and its effects on Alzheimer’s disease Find all reviews published from 2010 dealing with drug therapies used for Alzheimer’s disease. Export all abstracts to a file. Need the full text article? If not looking for specific article filter your results using “Free full text” option Try searching PubMed Central (PMC) - a free archive of biomedical and life sciences journal literature Find paper of interest in pubmed and search Google Scholar to see if free pdfs are available Using MeSH Go to MeSH homepage - http://www.ncbi.nlm.nih.gov/mesh Search MeSH term for chewing How is it called? What subheadings does it have? In how many papers chewing is a major topic? MeSH – combining queries Search for terms obesity and outbreak Merge them into one query MeSH – using subheadings Search for papers dealing with genetics of obesity Tasks Find is there genetic basis for the vitamin C deficiency in humans? Find all nutrition disorders indexed in MeSH. To which group of diseases they belong? Find all reviews dealing with prevention and control of nutrition disorders in children. OMIM Online Mendelian Inheritance in Man OMIM is a comprehensive, authoritative, and timely compendium of human genes and genetic phenotypes OMIM contain information on all known mendelian disorders and over 12,000 genes OMIM focuses on the relationship between phenotype and genotype OMIM Obesity http://omim.org/entry/601665 Phenylkenonuria http://omim.org/entry/261600 Description Clinical features Biochemical features Inheritance Clinical management Population genetics Animal models