A Field Guide to NCBI Resources

advertisement
NCBI

Created as a part of NLM in 1988








Establish public databases
Research in computational biology
Develop software tools for sequence analysis
Disseminate biomedical information
Tools: BLAST(1990), Entrez (1992)
GenBank (1992)
Free MEDLINE (PubMed, 1997)
Human genome (2001)
NCBI Home Page
www.ncbi.nlm.nih.gov
To learn more, visit
“Site Map” and
“About NCBI”
web pages
Entrez:
An Integrated Database Search and
Retrieval System
The (ever) Expanding Entrez System
UniGene
PubMed
Nucleotide
Protein
Journals
Structure
CDD
Genome
Entrez
SNP
PopSet
OMIM
3D Domains
Taxonomy
UniSTS
ProbeSet
Books
Literature Databases





PubMed
Books
PubMed Central
Journals
On-Line Mendelian Inheritance in Man
(OMIM)
Molecular Sequence
Databases


Sequence Databases
 Nucleotide (GenBank)
 Taxonomy
 PopSet
 Protein
Marker Databases
 Single Nucleotide Polymorphisms (SNP’s, dbSNP)
 Sequence Tagged Sites (STS’s, dbSTS)
 Expressed Sequence Tags (EST’s, dbEST)
 UniGene
Molecular Databases

Primary Databases



Original submissions by experimentalists
Database staff organize but don’t add additional information
 Example: GenBank
Derivative Databases


Human curated

compilation and correction of data

Example: SWISS-PROT, NCBI RefSeq mRNA
Computationally Derived


Example: UniGene
Combinations

Example: NCBI Genome Assembly
Curators
RefSeq
TATAGCCG
AGCTCCGATA
CCGATGACAA
Labs
Genome
Assembly
TATAGCCG
TATAGCCG
TATAGCCG
TATAGCCG
GenBank
UniGene
Algorithms
The International Nucleotide Sequence
Database Collaboration
NIH
NCBI
ENTREZ
GenBank
NIG
CIB
Get Entry
DDBJ
Qu i ck Ti m e ™ an d a T IF F ( Un co m p re ss ed ) d ec om pr es so r a r e ne ed ed t o s ee th i s pi c tu r e.
EMBL
EBI
SRS
EMBL
Entrez Nucleotide
EMBL 9%
RefSeq 1%
PDB 0.01%
DDBJ 19%
GenBank 71%
What is GenBank?
NCBI’s Primary Sequence Database




Nucleotide only sequence database
Archival in nature
GenBank Data
 Direct submissions individual records (BankIt,
Sequin)
 Batch submissions via email (EST, GSS, STS)
 ftp accounts established for sequencing centers
Data shared amongst three collaborating databases:
 GenBank
 DNA Database of Japan (DDBJ).
 European Molecular Biology Laboratory Database
(EMBL)
The Old Way
From Fran Lewitter, Whitehead Institute
GenBank: NCBI’s Primary Sequence Database
Release 136
June 2003
25,592,865
Records
32,528,249,295
Nucleotides
18,197,119(June 2002)
22,616,937,182(June 2002)
110,000 +
Species
• full release every two months
• incremental and cumulative updates daily
• available only through internet
ftp://ftp.ncbi.nih.gov/genbank/
ftp://genbank.sdsc.edu/pub
ftp://bio-mirror.net/biomirror/genbank/
121 Gigabytes of data
GenBank Divisions
Traditional Divisions
BCT
INV
MAM
PHG
PLN
PRI
ROD
SYN
VRL
VRT
Bacterial/Archeal
Invertebrate
Mammalian (ex. ROD/PRI)
Phage
Plant/Fungal
Primate
Rodent
Synthetic (cloning vectors)
Viral
Other Vertebrate
Bulk Sequence Divisions
EST
Expressed Sequence Tag
STS
Sequence Tagged Site
GSS
Genome Survey
Sequence
HTGS High Throughput
Genomic Sequence
HTC
High Throughput cDNA
A Traditional GenBank Record
Locus Field
Molecule Type
Definition Line
GI (GenInfo)
Keywords
Taxonomy
Submission Field
Modification Date
GenBank Division
Feature Table
GenPept Record
Genomic DNA
Sequence
Bulk Sequence Divisions
Bulk Sequence Divisions
•Batch Submission, e-mail, or ftp EST
•Inaccurate
STS
•Poorly Characterized
Expressed Sequence Tag
Sequence Tagged Site
HTGS High Throughput
Genomic Sequence
EST Division:
Expressed Sequence Tags
>IMAGE:275615 5' mRNA sequence
GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTT
TCTGGCCTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAG
AATGGAAAGTCAAATTTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAG
TTGACTTACTGAAGAATGGAGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAG
CAAGGACTGGTCTTTCTATCTCTTGTACTACACTGAATTCACCCCCACTGAAAAAGATGAGT
ATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCCAAGTTNAGTTTAAGTGGGNA
5’
TCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTTTGGATTGGGA
30,000
TGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
nucleus
genes
3’
>IMAGE:275615 3', mRNA sequence
- isolate unique clones
NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACC
-sequence once
ATGCCTTACTTTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTC
RNA
ATTCATTATAACAAATTTCCAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATT
from each end
gene products
CTAAGCAGAGTATGTAAATTGGAAGTTAACTTATGCACGCTTAACTATCTTAACAAGCTTT
GAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGATGTTGATGTTGGATAAGAGAATT
CTCTGCTCCCCACCTCTANGTTGCCAGCCCTC
make cDNA
library
80-100,000 unique
cDNA clones in library
What is UniGene?
A gene-oriented view of sequence entries
•MegaBlast-based automated sequence clustering
•Nonredundant set of gene-oriented clusters
•Each cluster represents a unique gene
•Provides information on tissue-specific expression and
map locations
•Includes well-characterized genes and novel ESTs
•Useful for gene discovery and selection of mapping
reagents
EST hits to Homo sapiens
muscle creatine kinase mRNA
Query Sequence
(muscle creatine kinase mRNA)
3’ EST Hits
5’ EST Hits
UniGene Entry for H. sapiens
Muscle Creatine Kinase
STS Division :
Sequence Tagged Sites



Segment of gene, EST, mRNA or genomic DNA
of known position (microsatellite)
PCR with STS primers gives one product per genome
Basis of Radiation Hybrid Mapping



UniGene
Genome Assembly
Related resource: Electronic PCR
UniSTS:
Database of Mapped Markers
HTG Division: High Throughput Genome
phase 1
HTG
unfinished, may be unordered,with
gaps
phase 2
HTG
unfinished, oriented,ordered,may have gaps
Acc = AC109609.1
Acc =AC109609.6
phase 3
Acc = AC109609.10
finished,no gaps
Same accession numbers, different versions
40,000 to > 50,000 bp
ROD
HTG Division:
High Throughput Genome
RefSeq:
NCBI’s Derivative Sequence Database

Curated transcripts and proteins




Human model transcripts and proteins
Assembled Genomic Regions (contigs)



reviewed
human, mouse, rat, fruit fly, zebrafish, arabidopsis
draft human genome
mouse genome
Chromosome records



Microbial
viral
organelle
Reference Sequences
Chromosome:
NC_000000
mRNA:
Gene:
NM_000000
NG_000000
protein:
NP_000000
Contig:
NT_000000
NW_000000
RNA:
NR_000000
Model mRNA:
XM_000000
Model RNA:
XR_000000
Model protein:
XP_000000
Curated
Automated
RefSeq
Chromosomes:
NC_
LOCUS
DEFINITION
ACCESSION
VERSION
KEYWORDS
SOURCE
ORGANISM
REFERENCE
AUTHORS
TITLE
JOURNAL
MEDLINE
PUBMED
NC_002695 5498450 bp
DNA
circular BCT
02-OCT-2001
Escherichia coli O157:H7, complete genome.
NC_002695
NC_002695.1 GI:15829254
.
Escherichia coli O157:H7.
Escherichia coli O157:H7
Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
Escherichia.
1 (sites)
Makino,K., Yokoyama,K., Kubota,Y., Yutsudo,C.H., Kimura,S.,
Kurokawa,K., Ishii,K., Hattori,M., Tatsuno,I., Abe,H., Iida,T.,
Yamamoto,K., Ohnishi,M., Hayashi,T., Yasunaga,T., Honda,T.,
Sasakawa,C. and Shinagawa,H.
Complete nucleotide sequence of the prophage VT2-Sakai carrying the
verotoxin 2 genes of the enterohemorrhagic Escherichia coli O157:H7
derived from the Sakai outbreak
Genes Genet. Syst. 74 (5), 227-239 (1999)
20198780
10734605
RefSeq Contig: NT_, NW_
Curated RefSeq Records: NM_, NP_
Alignment Generated Transcripts:
XM_,XP_
REFSEQ: Summary
BLAST
a starting point for most bioinformatics related problems…
BLAST
One BLAST, many flavors
BLAST
databases
Example:
BLASTing protein sequence
BLAST output
BLAST output formatting
BLAST output
BLAST output
low complexity filter
BLAST
•Scores we get from BLAST
have
an
underlying
distribution.
•E-value:
the number of alignments
with a particular score, or
better score, that are expected
to occur by chance when
comparing
two
random
sequences
BLAST
Download