Databases

advertisement
Databases
WHY DO I HAVE TO LISTEN ABOUT THIS?!
DataBase – what the heck is that?


A database is a collection of information that is organized so
that it can easily be accessed, managed, and updated.
Various types – from simple to complex ones

Flat-file, relational

Records retrieved using a query language

Are you using one??
Phone directory
 Archive of bills
 Birth registers

Problems with data – why you need a db?
 Nowadays obtaining data is no problem
 Having data is no reason to have database
 Problems with data that require DB:
 Size
 Ease of updating
 Accuracy
 Security
 Redundancy
 Importance
DBs - dissection
Information system
Query system
Storage System
Data
GenBank flat file
PDB file
Interaction Record
Title of a book
Book
DBs - dissection
Oracle
Information system
Query system
Storage System
Data
MySQL
PostgreSQL
PC binary files
Unix text files
Bookshelves
DBs - dissection
Information system
Query system
Storage System
Data
A List you look at
A catalogue
indexed files
SQL
grep
DBs - dissection
Information system
Query system
Storage System
Data
Google
Entrez
SRS
DBget
DBs – what are they made of?
 Tables (entities)
• basic elements of information to track, e.g., gene, organism,
sequence, citation...
 Columns (fields)
• attributes of tables, e.g. for citation table, title, journal, volume,
author...
 Rows (records)
• actual data
• whereas fields describe what data is stored, the rows of a table
are where the actual data is stored
Flat-File DBs
 All of the data is stored in one large table
 Txt file, excel…
Relational DBs
contains multiple tables and defines the relationships between them
invoice_id customer
1 Elmer
2 Wiley
3 Elmer
4 Bugs
product
price
quantity total
buckshot
$2.00
2
$4.00
Acme snow machine
$5.00
1
$5.00
shotgun
$25.00
1
$25.00
carrots
$0.50
20
$10.00
customer_table
name
address
Elmer
Looney Tunes Dr.
Wiley
Southwest desert
Bugs
Rabbit Hole
product_table
product
carrots
shotgun
buckshot
Acme snow machine
price
$
$
$
$
notes
likes hunting and opera
big mail order customer
likes to cross dress
notes
0.50
25.00 oddly flexible
2.00
5.00 high defect rate
Relational DBs
 Relationships can be built between tables and fields
invoice_id customer
1 Elmer
2 Wiley
3 Elmer
4 Bugs
product
price
quantity total
buckshot
$2.00
2
$4.00
Acme snow machine
$5.00
1
$5.00
shotgun
$25.00
1
$25.00
carrots
$0.50
20
$10.00
customer_table
name
address
Elmer
Looney Tunes Dr.
Wiley
Southwest desert
Bugs
Rabbit Hole
product_table
product
carrots
shotgun
buckshot
Acme snow machine
price
$
$
$
$
notes
likes hunting and opera
big mail order customer
likes to cross dress
notes
0.50
25.00 oddly flexible
2.00
5.00 high defect rate
Relational DBs – even more technical...
 Get the info using Structured Query Language (SQL):
SELECT customer_table.name, customer_table.address
FROM customer_table, invoice
WHERE invoice.product = “Acme Snow Machine”
AND invoice.customer = customer_table.name
Result:
Wiley, Southwest desert
invoice_id customer
1 Elmer
2 Wiley
3 Elmer
4 Bugs
product
price
quantity total
buckshot
$2.00
2
$4.00
Acme snow machine
$5.00
1
$5.00
shotgun
$25.00
1
$25.00
carrots
$0.50
20
$10.00
customer_table
name
address
Elmer
Looney Tunes Dr.
Wiley
Southwest desert
Bugs
Rabbit Hole
product_table
product
carrots
shotgun
buckshot
Acme snow machine
price
$
$
$
$
notes
likes hunting and opera
big mail order customer
likes to cross dress
notes
0.50
25.00 oddly flexible
2.00
5.00 high defect rate
Biological DBs
 A lot of them..
• Vary in size, quality, coverage, level of interest
• Is it any good?
•
•
•
•
•
•
comprehensiveness
accuracy
is up-to-date
good interface
batch search/download
API (web services, DAS, etc.)
DBs by data types
 Sequence databases
 Sequence analysis
 Functional genomics
 Literature databases
 Structural databases
 Metabolic pathway databases
 Specialised databases
 Confused??


http://www.oxfordjournals.org/nar/database/a/
http://www.expasy.org/links.html
DBs by scope
 Comprehensive

Contain data from many organisms and many different types of
sequences
 Nucleotide
 GenBank (National Center for Biotechnology Information)
 EMBL (European Molecular Biology Laboratory)
 DDBJ (DNA Data Bank of Japan)

GenBank, EMBL & DDBJ comprise the International Nucleotide
Sequence Database Collaboration
 Protein, such as Swiss-Prot
 Protein Structure, such as PDB: Protein Data Bank
 Genomes and Maps, such as Entrez Genomes
DBs by scope
 Specialized

– Contain data from individual organisms, specific
categories/functions of sequences, or data generated by
specific sequencing technologies.

– Example: Flybase, Wormbase, etc.
DBs by level of curation
 Primary databases – Archival data
 Repository of information
 Redundant; might have many sequence records for the same
gene, each from a different lab
 Submitters maintain editorial control over their records: what
goes in is what comes out
 No controlled vocabulary
 Variation in annotation of biological features
GenBank/EMBL/DDBJ
 UniProt
 PDB
 Medline (PubMed)

DBs by level of curation
 Secondary (derivative) databases – Curated
data



Non-redundant; one record for each gene, or each splice
variant
Each record is intended to present an encapsulation of the
current understanding of a gene or protein, similar to a review
article
Records contain value-added information that have been
added by an expert(s)
RefSeq
 Taxon
 UniProt
 OMIM

Literature DBs
 PubMed www.ncbi.nlm.nih.gov/pubmed



Focuses on biomedicine
Integrated with other NCBI DBs and services
Uses NCBI search sytax (PubMed help)
 Google Scholar scholar.google.com



Standard Google syntax
Subject areas
Free pdfs
To do:
Stein, L.D. 2003. Integrating biological databases. Nat Rev Genet 4: 337-345.
DBs - how much is in there?
Growth of GenBank and WGS
GenBank
www.ncbi.nlm.nih.gov/Genbank/
 Genbank
 database of nucleotide sequences from >160,000 organisms
 started in 1981 (263 entries; 436,710 residues)
 Release 175 - 12/09 (112,910,950 entries; 110,118,557,163 base pairs)
 Release 189 - 04/12 (151 824 421 entries; 139 266 481 398 base pairs)
 Release 201 – 04/14 (171 744 486 entries; 159 813 411 760 base pairs)
 Release 207 – 04/15 (182 188 746 entries; 189 739 230 107 base pairs)
 divided into 18 divisions
 Organism specific (primate , rodent, invertebrate, bacterial, viral… 11 divisions)
 Technology specific







EST - EST sequences (expressed sequence tags)
PAT - patent sequences
STS - STS sequences (sequence tagged sites)
GSS - GSS sequences (genome survey sequences)
HTG - HTG sequences (high-throughput genomic sequences)
HTC - unfinished high-throughput cDNA sequencing
ENV - environmental sampling sequences
GenBank file
GenBank file - header
GenBank file - features
GenBank file - sequence
//
GenBank - interface
GenBank - interface
GenBank - interface
GenBank - interface
GenBank - interface
NCBI/EBI/GenomeNet Formats
NCBI DBs
GenBank: The Nucleotide Sequence Database
PubMed: The Bibliographic Database
Macromolecular Structure Databases
The Taxonomy Project
The Single Nucleotide Polymorphism Database (dbSNP) of Nucleotide
Sequence Variation
 The Gene Expression Omnibus (GEO): A Gene Expression and Hybridization





Repository
 Online Mendelian Inheritance in Man (OMIM): A Directory of Human Genes and
Genetic Disorders
 The NCBI BookShelf: Searchable Biomedical Books
 PubMed Central (PMC): An Archive for Literature from Life Sciences Journals
 The SKY/CGH Database for Spectral Karyotyping and Comparative Genomic
Hybridization Data
 The Major Histocompatibility Complex Database, dbMHC
NCBI - Entrez
http://www.ncbi.nlm.nih.gov/gquery/
General Protein DBs
 UniProt (http://www.uniprot.org)

SWISS-PROT



GenPept/TrEMBL



Manually curated
high-quality annotations, less data
Translated coding sequences from GenBank/EMBL
Few annotations, more up to date
PIR

}
UniProt
(2002)
Phylogenetic-based annotations
European
Bioinformatics
Institute (EBI)
Swiss Institute of
Bioinformatics
(SIB)
Protein Information
Resource (PIR)
Other protein DBs
 Structural DBs (PDB)
PDB (Protein Databank)
 MMDB (Molecular Modeling database)

 Protein domains DB (Pfam)
 Pfam
SMART (a Simple Modular Architecture Research Tool)
 CDD (Conserved Domain Database)

 Protein motif DBs
Scan Prosite
 PRINTS

Other DBs
 Ribosomal RNA DBs
RDP (Michigan State University, USA)
 rRNA database (University of Antwerp, Belgium)
 Silva

 Genome DBs
Colibase (E. coli and related species)
 Flybase (Drosophila)
 Dictybase (Dictyostelium discoideum)

 Metabolic pathways DBs…
Nutrigenomics related DBs
 Gene oriented
 Gene expression:
GEO - Gene Expression Omnibus (NCBI)
 Array Express (EBI)
 CGED (Cancer Gene Expression Database)


Variation databases:
dbSNP (NCBI)
 Hapmap http://hapmap.ncbi.nlm.nih.gov/abouthapmap.html
 HGVbase (Human Genome Variation)

OMIM
 Online Mendelian Inheritance in Man
 database that catalogues all the known diseases with
a genetic component
 relationship between phenotype and genotype
 ~ 20 000 entries
Clinical and Mutation Databases

HGMD

Human Gene Mutation Database
•
•

Database of sequences and phenotypes of disease-causing
mutations
http://www.hgmd.cf.ac.uk/ac/index.php
General Disease DBs
http://swissvar.expasy.ch
 KEGG Disease http://www.genome.jp/kegg/disease/
 Swisswar

Disease-specific mutation databases
Nutrigenomics related DBs
 Nutrigenomics database
microarray data related to nutrition
 http://foodfunction.dc.affrc.go.jp/en/

NuGO
http://www.nugo.org
dbNP – Nutritional Phenotype database
Biological information in db:
genetics
transcriptomics
proteomics
biomarkers
metabolomics
functional assays
food intake and food
composition
Nutrition db – myplate.gov
Nutrition db – myplate.gov
Nutrition db - USDA
http://ndb.nal.usda.gov/
Nutrition db - USDA
Nutrition db - USDA
Nutrition databases
Nutrition databases
http://nutritiondata.self.com/
Literature DBs
ISI Web of knowledge
portal.isiknowledge.com
WOS
WOS
ISI Web of knowledge
ISI – Citation report
ISI 2 do
 Each group takes one department and check




publications of full professors (www.pbf.hr)
Count all publications and citings for your
department
What is the most cited publication for your
department
What is the highest h-factor in your department
Normalize the data...
PubMed Overview
 PubMed is a Web-based retrieval system developed by




the National Center for Biotechnology Information
(NCBI) at the National Library of Medicine (NLM)
NLM has been indexing the biomedical literature since
1879
PubMed is a database of bibliographic information
drawn primarily from the life sciences literature
PubMed contains links to full-text articles at
participating publishers' Web sites as well as links to
other third party sites
PubMed provides access and links to the integrated
molecular biology and chemistry databases maintained
by NCBI
What’s in PubMed?
 Over 23 million records representing articles in the
biomedical literature
 Most PubMed records are MEDLINE citations
 MEDLINE®, the National Library of Medicine’s
premier bibliographic database containing citations
and author abstracts from more than 5,500
biomedical journals
 The scope of MEDLINE includes diverse topics such
as microbiology, delivery of health care, nutrition,
pharmacology and environmental health
PubMed - author search
Full names are
not available for
all authors – it
is smarter to use
only initials
PubMed – author search results
Search results options
Article view
Subject search (simple)
 To search by subject be specific as possible
 Do not use punctuation, tags or operators
 Search for articles on the use of aspirin for heart
attack prevention. Which query to use?
a)
“aspirin for heart attack prevention”
b)
aspirin heart attack prevention
aspirin AND heart AND attack AND prevention
c)
Advanced Pubmed search using MeSH
 MeSH (Medical Subject Headings) is the NLM controlled
vocabulary which gives uniformity and consistency to the
indexing and cataloging of biomedical literature
 Similar to keywords on other systems
 Arranged in s hierarchical manner
Even more about MeSH
 MeSH Vocabulary includes four types of terms:
 Headings —represent concepts found in the biomedical
literature
Body Weight
 Kidney
 Radioactive Waste


Subheadings — attached to MeSH headings to describe a
specific aspect of a concept
Therapy
 Diagnosis
 Metabolism



Supplementary Concept Record
Publication Types
PubMed Search using MeSH – graphic example
Results
MeSH example
 We will be looking for papers dealing with medication of
adults with nutrition disorders
1. go to PubMed advanced search
2. In builder change All Fields to MeSH terms and write
nutrition disorders (choose from dropdown menu)
3. In the next field write “adults” and click on Show index
list – select “adults”
4. Change All fileds to MeSH Subheadings and from index
list select “drug therapy”
5. Click on Search button
Tasks
 Search for papers looking at vitamin B
supplementation and its effects on Alzheimer’s
disease
 Find all reviews published from 2010 dealing with
drug therapies used for Alzheimer’s disease. Export
all abstracts to a file.
Need the full text article?
 If not looking for specific article filter your results
using “Free full text” option
 Try searching PubMed Central (PMC) - a free
archive of biomedical and life sciences journal
literature
 Find paper of interest in pubmed and search Google
Scholar to see if free pdfs are available
Using MeSH
 Go to MeSH homepage - http://www.ncbi.nlm.nih.gov/mesh
 Search MeSH term for chewing
 How is it called?
 What subheadings does it have?
 In how many papers chewing is a major topic?
MeSH – combining queries
 Search for terms obesity and outbreak
 Merge them into one query
MeSH – using subheadings
 Search for papers dealing with genetics of obesity
Tasks
 Find is there genetic basis for the vitamin C
deficiency in humans?
 Find all nutrition disorders indexed in MeSH. To
which group of diseases they belong?
 Find all reviews dealing with prevention and control
of nutrition disorders in children.
OMIM
 Online Mendelian Inheritance in Man
 OMIM is a comprehensive, authoritative, and timely
compendium of human genes and genetic
phenotypes
 OMIM contain information on all known mendelian
disorders and over 12,000 genes
 OMIM focuses on the relationship between
phenotype and genotype
OMIM
 Obesity http://omim.org/entry/601665
 Phenylkenonuria http://omim.org/entry/261600
 Description
 Clinical features
 Biochemical features
 Inheritance
 Clinical management
 Population genetics
 Animal models
Download