A Field Guide to GenBank and NCBI Resources

advertisement
NCBI Molecular Biology Resources
A Field Guide
NCBI
Nov. 6, 2001
NCBI Resources

About NCBI

NCBI Sequence Databases
• Primary Database – GenBank
• Derivative Databases - RefSeq
Entrez Databases and Text Searching

BLAST Services

Genomic Resources
NCBI

The National Center for Biotechnology
Information (NCBI)

Created as a part of the National Library of Medicine in
1988
• Establish public databases
• Research in computational biology
• Develop software tools for sequence analysis
• Disseminate biomedical information
Tools: BLAST(1990), Entrez (1992)

GenBank (1992)

Free MEDLINE (PubMed, 1997)

Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM,
UniGene, GeneMap, Taxonomy, CGAP, SAGE, LocusLink,
RefSeq
NCBI

Molecular Databases

Primary Databases
•
•

Original submissions by experimentalists
Database staff organize but don’t add additional information
• Example: GenBank
Derivative Databases
•
Human curated
• compilation and correction of data
• Example: SWISS-PROT, NCBI RefSeq mRNA
•
Computationally Derived
•
Combinations
• Example: NCBI Genome Assembly
NCBI
• Example: UniGene
What is GenBank? NCBI’s Primary Sequence Database

Nucleotide only sequence database

Archival in nature

GenBank Data
•
•
•

Direct submissions individual records (BankIt, Sequin)
Batch submissions via email (EST, GSS, STS)
ftp accounts sequencing centers
Data shared nightly among three collaborating databases
NCBI
• GenBank
• DNA Database of Japan (DDBJ).
• European Molecular Biology Laboratory Database (EMBL) at EBI.
Entrez
NIH
NCBI
GenBank
•Submissions
•Updates
•Submissions
•Updates
EMBL
CIB
NIG
DDBJ
•Submissions
•Updates
getentry
EBI
SRS
EMBL
NCBI
GenBank
Release 126
13,602,262
14,396,883,064
80,000 +
October2001
Records
Nucleotides
Species
ftp://ncbi.nlm.nih.gov/genbank/
or
ftp://genbank.sdsc.edu/pub/
NCBI
• full release every two months
• incremental and cumulative updates daily
• available only through internet
GenBank on FTP site
ftp> open ftp.ncbi.nlm.nih.gov
.
.
ftp> cd genbank
NCBI
Release 125: 243 files; 55.23 Gigabytes uncompressed
GenBank Divisions
Bulk Sequence Divisions
PAT
EST
STS
GSS
HTG
HTC
CON
Patent
Expressed Sequence Tags (133 files)
Sequence Tagged Site
Genome Survey Sequence (41 files)
High Throughput Genome (25 files)
High Throughput cDNA
Contig
Traditional Divisions
BCT INV MAM PHG PLN PRI
ROD SYN UNA VRL VRT
EST Division: Expressed Sequence Tags
>IMAGE:275615 5' mRNA sequence
GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGCC
TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAAAT
TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGA
GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACAC
TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCC
5’
AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTT
30,000
TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
nucleus
genes
3’
make cDNA
library
80-100,000 unique
cDNA clones in library
NCBI
>IMAGE:275615 3', mRNA sequence
- isolate unique clones
NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTACT
-sequence once
TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTCC
80-100,000 RNA
AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTAA
from each end
gene products
CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGAT
GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC
STS Division : Sequence Tagged Sites

Segment of gene, EST , mRNA or genomic DNA of known
position (microsatellite)

PCR with STS primers gives unique product (one per
genome)

Basis of Radiation Hybrid Mapping
•
•
Related resource: Electronic PCR
http://www.ncbi.nlm.nih.gov/genome/sts/epcr.cgi
NCBI

UniGene
Genome Assembly
RH mapping using STSs
A
A
B
B
Human Chromosome
C
C
D
D
D
Hybrid Cells
A
B
A
B
C
D
+
+
+
+
+
+
+
-
NCBI
PCR Results
ePCR Results Hexokinase 1 EST
SHGC-35892
dbSTS id: 44155, GenBank Accession: G29974
Organism: Homo sapiens
Primer1: CATACGACACGGCTCACAAA
Primer2: CTGTTTGTCTCGTGGGGG
STS location: 30..160 Chromosome: 10
Expected amplicon size: 129, Observed amplicon size: 130
Primers match in forward orientation
Query sequence:
TTTTTGAATT
TTCCAGTGAT
TTCCGCAGAC
GCTAGGACTG
AGGCCACAGT
AGCATGTGCC
GAGGGGGAAC
GGTACAAAGT
GGCATTGTTT
GTGTCCACCT
GTTCCACGGA
GGGTGCCAGG
CCGGGAGGAG
CAAGGATGAG
TTACTAGGTC
GTTGGTTGGT
CCCCCCACGA
CACACGATTT
AGGGGAGGAA
GCCCGGCAGT
CTTTGGAGGC
ATACGACACG
TCCTTTTATC
GACAAACAGA
TGTGGCATTG
GCAGCTAATG
GTCTGCTGGT
CAGAAGGCTG
GCTCACAAAG
CAAATGGAGA
ATGCAAGACT
ACACACCACG
CTATGCCCAC
GATAATACAT
TCAGGTGGTG
CGGTGGGAAA
CAAGACACAT
GTCACACGCG
ATGCGATGCC
ACTCGCCTTC
TTCACACGGG
TG
NCBI
1
61
121
181
241
301
361
Genome Sequencing
Whole BAC insert (or genome)
sonication
sequencing
cloning isolating
GSS division
Draft Sequence (HTG division)
NCBI
assembly
GSS Division: Genome Survey Sequences
•Genomic equivalent of ESTs
•BAC and other first pass surveys
•BAC end sequences
•Whole Genome Shotgun (some)
•RAPIDS and other anonymous loci
SP6 end
T7 end
NCBI
Genomic Clone (BAC)
HTG Division: High Throughput Genome Records
phase 1
Acc = AC008701
gi = 6601005
phase 2
Acc = AC008701
gi = 6671909
HTG
PRI
phase 3
Acc = AC008701
HTG
gi = 7328720
40,000 to > 350,000 bp
The GenBank Record
NCBI
A Simple GenBank Record
LOCUS
DEFINITION
ACCESSION
VERSION
KEYWORDS
SOURCE
ORGANISM
REFERENCE
AUTHORS
TITLE
JOURNAL
REFERENCE
AUTHORS
TITLE
JOURNAL
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
COMMENT
AF062069
3808 bp
mRNA
INV
02-MAR-2000
Limulus polyphemus myosin III mRNA, complete cds.
AF062069
AF062069.2 GI:7144484
.
Atlantic horseshoe crab.
Limulus polyphemus
Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata;
Xiphosura; Limulidae; Limulus.
1 (bases 1 to 3808)
Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
A myosin III from Limulus eyes is a clock-regulated phosphoprotein
J. Neurosci. (1998) In press
2 (bases 1 to 3808)
Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
Direct Submission
Submitted (29-APR-1998) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
3 (bases 1 to 3808)
Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
Direct Submission
Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
Sequence update by submitter
On Mar 2, 2000 this sequence version replaced gi:3132700.
GenBank Record, cont.
FEATURES
source
CDS
Location/Qualifiers
1..3808
/organism="Limulus polyphemus"
/db_xref="taxon:6850"
/tissue_type="lateral eye"
258..3302
/note="N-terminal protein kinase domain; C-terminal
myosin
heavy chain head; substrate for PKA"
/codon_start=1
/product="myosin III"
/protein_id="AAC16332.2"
/db_xref="GI:7144485"
/translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA
NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI
EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF
SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG
ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR
PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ
BASE COUNT
1201 a
689 c
782 g
1136 t
ORIGIN
1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt
3781 aagatacagt aactagggaa aaaaaaaa
//
Sequence and Database Identifiers
Locus, accession, gi, version
Locus Name
Modification Date
Sequence mol-type
mRNA (= cDNA)
length
rRNA
snRNA
DNA
3808 bp
GB Division
LOCUS
AF062069
mRNA
INV
02-MAR-2000
DEFINITION
Limulus polyphemus myosin III mRNA, complete cds.
ACCESSION
AF062069
VERSION
AF062069.2
Accession Number
GI:7144484
DEF line (Title)
Accession.version
gi number
Keywords, Source-organism
Legacy field
exception
•EST
•GSS
•HTG
KEYWORDS
SOURCE
ORGANISM
Accepted common name
.
Scientific name
Atlantic horseshoe crab.
Limulus polyphemus
Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata;
Xiphosura; Limulidae; Limulus.
NCBI
Taxonomic lineage according to GenBank
Citation
REFERENCE
AUTHORS
TITLE
JOURNAL
REFERENCE
AUTHORS
TITLE
JOURNAL
REFERENCE
AUTHORS
REMARK
COMMENT
Previous version
NCBI
TITLE
JOURNAL
1 (bases 1 to 3808)
Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Article
Greenberg,R.M. and Smith,W.C.
A myosin III from Limulus eyes is a clock-regulated phosphoprotein
J. Neurosci. (1998) In press
2 (bases 1 to 3808)
Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
Submitter Block
Direct Submission
Submitted (29-APR-1998) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
3 (bases 1 to 3808)
Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C.
Update history
Direct Submission
Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
Sequence update by submitter
On Mar 2, 2000 this sequence version replaced gi:3132700.
Feature Table
FEATURES
source
CDS
Coding
Sequence
"
Location/Qualifiers
1..3808
/organism="Limulus polyphemus"
Biosource
/db_xref="taxon:6850"
/tissue_type="lateral eye"
258..3302
/note="N-terminal protein kinase domain;
C-terminal myosin heavy chain head;
substrate
Reading
Frame for PKA"
/codon_start=1
/product="myosin III"
GenPept Protein Identifiers
/protein_id="AAC16332.2"
/db_xref="GI:7144485"
/translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDK
NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWL
Sequence
Indicates beginning of sequence data
BASE COUNT
1201 a
689 c
782 g
1136 t
ORIGIN
1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt
<sequence omitted>
3721 accaatgtta taatatgaaa tgaaataaag cagtcatggt agcagtggct gtttgaaata
3781 aagatacagt aactagggaa aaaaaaaa
//
End of record
NCBI Derivative Sequence Databases: RefSeq
NCBI Reference Sequences
mRNAs and Proteins
NM_123456
NP_123456
XM_123456
XP_123456
Curated mRNA
Curated Protein
Predicted Transcript
Predicted Protein
Assemblies
NT_123456 Contig (Mouse and Human Genomes)
NC_123455 Chromosome (Microbial Genomes)
NCBI
Gene Records
NG_123456 Reference Genomic Sequence
Curated RefSeq Records: NM_, NP_
LOCUS
NM_000492
6159 bp
mRNA
PRI
26-JUL-1999
DEFINITION Homo sapiens cystic fibrosis transmembrane conductance
regulator(CFTR)
mRNA. was derived from M28668.1,
REFSEQ: This
reference sequence
RefSeq Nucleotide
ACCESSION
M55131.1.NM_000492
On Feb 17, 2000 this sequence version replaced gi:4502784.
Summary: Cystic fibrosis transmembrane conductance regulator is
LOCUS
1480 aacassete sub-family C.PRI
26-JUL-1999
member 7 NP_000483
of the ATP-binding
The protein
DEFINITION
fibrosis channel
transmembrane
conductance
regulator.of
functionscystic
as a chloride
and controls
the regulation
ACCESSION
NP_000483
other transport
pathways. Mutations in this gene cause the
PID autosomalg4502785
Protein
recessive disorder, cystic fibrosis (CF)RefSeq
and congenital
VERSION
GI:4502785
bilateralNP_000483.1
aplasia of the
vas deferens (CBAVD). Alternative splice
DBSOURCE
accession
NM_000492.1
variants REFSEQ:
have been
described,
many of which result from mutations
in the CFTR gene.
COMPLETENESS: full length.
COMMENT REFSEQ: This reference sequence was derived from M55131.
PROVISIONAL RefSeq: This is a provisional reference sequence
record that has not yet been subject to human review. The final
curated reference sequence record may be somewhat different from
this one.
Reviewed
Alignment Generated Transcripts: XM_, XP_
LOCUS
DEFINITION
ACCESSION
VERSION
XM_004980
6128 bp
mRNA
PRI
16-NOV-2000
Homo sapiens cystic fibrosis transmembrane conductance regulator,
ATP-binding cassette (sub-family C, member 7) (CFTR), mRNA.
mismatch
XM_004980
XM_004980.3 GI:13631444
NCBI
RefSeq Human Contig: NT_
mRNA
complement(join(1255889..1257642,1258986..1259091,
LOCUS
NT_007935 1888399
bp
DNA
CON
16-NOV-2000
1259690..1259862,1271619..1271708,1281957..1282112,
DEFINITION Homo sapiens
chromosome 7 working draft sequence segment,
1296780..1297028,1309837..1309937,1312742..1312969,
CONTIG
join(AC073042.3:1155..2680,gap(100),AC074390.2:119526..151445,
complete
sequence.
1313881..1314031,1317797..1317876,1320768..1321018,
1321687..1321724,1329492..1329620,1331893..1332616,
gap(100),AC074390.2:1..5245,gap(100),
ACCESSION NT_007935
1334111..1334197,1336717..1336811,1364895..1365086,
complement(AC074390.2:17705..23645),gap(100),
VERSION
NT_007935.1
GI:11422165
1375727..1375909,1382442..1382534,1384204..1384450,
AC074390.2:97658..119425,AC073042.3:106479..121155,
KEYWORDS
HTG.
1387877..1388002,1389139..1389302,1390185..1390274,
AC074390.2:164226..165036,AC073042.3:70628..79503,gap(100),
1393436..1393651,1415408..1415516,1420187..1420297,
SOURCE
human.
1444403..1444587))
AC073042.3:4627..6382,gap(100),AC073042.3:2781..4526,gap(100),
ORGANISM Homo
sapiens
/partial
complement(AC073042.3:183627..209083),gap(100),
Eukaryota;
Metazoa; Chordata; Craniata; Vertebrata;
/gene="CFTR"
AC073042.3:79604..88622,gap(100),AC073042.3:139234..160437,
Euteleostomi;Mammalia;
Primates;conductance
Catarrhini;
/product="cystic Eutheria;
fibrosis transmembrane
regulator,
ATP-binding cassette (sub-family C, member 7)"
gap(100),complement(AC073042.3:6483..8319),gap(100),
Hominidae;
Homo.
REFERENCE 1complement(AC073042.3:39354..45372),gap(100),
(bases 1/transcript_id="XM_004980.1"
to 1888399)
/db_xref="LocusID:1080"
complement(AC073042.3:21461..24064),gap(100),
AUTHORS International
Human Genome Project collaborators.
/db_xref="MIM:602421"
AC074390.2:156347..160294,gap(100),
by automated
computational
analysis
TITLE
Toward
the /note="derived
complete sequence
of the
human
genome
Reordering
draftusing
sequence
gene
prediction
method:
Acembly.
Supporting
evidence
complement(AC074390.2:5346..10750),gap(100),
JOURNAL Unpublished
includes similarity to: 9 proteins, 1 mRNAs See details in
complement(AC074390.2:153911..156246),gap(100),
COMMENT
GENOME
ANNOTATION
NCBI contigs are derived from
AceView" REFSEQ:
complement(AC074390.2:23746..32402),gap(100),
assembled
genomic
sequence data. They may include both
gene
complement(1255889..1444587)
/gene="CFTR"
complement(AC074390.2:151546..153810),gap(100),
draft
and finished
sequence.
/note="CF;
MRP7; ABC35; ABCC7"
complement(AC074390.2:57277..75275),gap(100),
COMPLETENESS:
not
full
length.
/db_xref="LocusID:1080"
complement(AC074390.2:75376..97557),gap(100),
Map View of
RefSeqs
NT_
XM_
NCBI
NM_
RefSeq Genome Records: NG_
NCBI
RefSeq Chromosomes:
NC_
LOCUS
DEFINITION
ACCESSION
VERSION
KEYWORDS
SOURCE
ORGANISM
REFERENCE
AUTHORS
JOURNAL
MEDLINE
PUBMED
NCBI
TITLE
NC_002695 5498450 bp
DNA
circular BCT
02-OCT-2001
Escherichia coli O157:H7, complete genome.
NC_002695
NC_002695.1 GI:15829254
.
Escherichia coli O157:H7.
Escherichia coli O157:H7
Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
Escherichia.
1 (sites)
Makino,K., Yokoyama,K., Kubota,Y., Yutsudo,C.H., Kimura,S.,
Kurokawa,K., Ishii,K., Hattori,M., Tatsuno,I., Abe,H., Iida,T.,
Yamamoto,K., Ohnishi,M., Hayashi,T., Yasunaga,T., Honda,T.,
Sasakawa,C. and Shinagawa,H.
Complete nucleotide sequence of the prophage VT2-Sakai carrying the
verotoxin 2 genes of the enterohemorrhagic Escherichia coli O157:H7
derived from the Sakai outbreak
Genes Genet. Syst. 74 (5), 227-239 (1999)
20198780
10734605
Other NCBI Derivative Databases
UniGene
-
LocusLink -
gene oriented expressed sequence
clusters
central resource and interface for
known genes
NCBI
NCBI
Homepage
NCBI
Mendelian Inheritance in Man
Entrez
NCBI
Similarity
Searching
NCBI
Homepage
Using Entrez
An integrated database search and retrieval
system
NCBI
Entrez: Neighboring and Hard Links
Word weight
PubMed
abstracts
33-D
-D
Structure
Structure
Taxonomy
Phylogeny
Genomes
BLAST
Nucleotide
sequences
VAST
(MMDB)
Protein
sequences
BLAST
WWW Entrez
GenBank, EMBL, DDBJ
RefSeq, PDB
•All of MEDLINE plus others
•Abstracts
•Links to online Journals
GenBank, DDBJ, EMBL translations
PDB, PIR, SWISS-PROT, PRF, RefSeq
NCBI’s MMDB - derived from PDB
Reference Genomes:
Graphical views, assembled sequence
and mapping data
NCBI
Database Searching with Entrez
Using limits and field restriction to find mouse GAPD

Linking and neighboring with mouse GAPD
NCBI

Entrez Nucleotides
Mouse
NCBI
Document Summaries: Mouse[All Fields]
3 million records
Chicken not mouse !?
NCBI
Entrez Nucleotides: Limits: Preview/Index
Mouse
NCBI
Entrez Nucleotides: Limits
NCBI
Accession
All Fields
Author Name
EC/RN Number
Mouse
Feature key
Field Restriction
Filter
Gene Name
Issue
Journal Name
Keyword
Exclude unwanted categories of sequences
Modification Date
Organism
Page Number
Gene Location
Molecule
Primary Accession
Genomic DNA/RNA
Genomic DNA/RNA
Properties
Mitochondrion
mRNA
Protein Name
Chloroplast
rRNA
Publication Date
SeqID String
Only From
Sequence Length
RefSeq
Substance Name
GenBank
Text Word
EMBL
Title Word
DDBJ
Uid
Entrez Nucleotides: Limits: Organism
Mouse
NCBI
Document Summaries: Mouse[Organism]
2,976,070[All Fields]
-2,921,009[Organism]
55,061
NCBI
Exclude Bulk Sequences, mRNA
NCBI
Adding Terms: Preview/Index
Title Word
Uid
Volume
Search History
3 phosphate dehydrogenase
NCBI
Accession
All Fields
Author Name
EC/RN Number
Feature key
Filter
Gene Name
Issue
Journal Name
Keyword
Modification Date
Organism
Page Number
Primary Accession
Properties
glyceraldehyde
Protein Name
Publication Date
SeqID String
Sequence Length
Substance Name
Text Word
Mouse GAPD Records
NCBI
Displaying Mouse GAPD Records
NCBI
Summary
Brief
GenBank
ASN.1
Formats
FASTA
GI list
LinkOut
PubMed Links
Protein Links
Links and neighbors (related records)
Nucleotide Neighbors
PopSet Links
Structure Links
Genome Links
Taxonomy Links
OMIM Links
Entrez GenBank / GenPept
NCBI
GenPept
FASTA Format
>gi|193425|gb|M60978.1|MUSGAPDS Mus musculus testis-specific isoform of glycerald
GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC
AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC
ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT
CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC
CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT
GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT
>
AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA
CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA
CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT
gi number
Locus Name
ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA
CTGCACCCTCCCCCGATGCACCCATGTTTGTCATGGGAGTGAACGAGAAGGACTATAACCCTGGCTCTAT
Database Identifiers
GACCATTGTCAGCAATGCATCCTGTACCACCAACTGCCTGGCTCCTCTCGCCAAGGTTATTCATGAAAAC
Accession number
gb
GenBank
TTCGGGATCGTGGAAGGGCTAATGACCACAGTCCATTCCTACACAGCCACTCAGAAGACAGTGGATGGGC
CATCAAAGAAGGACTGGCGAGGTGGCCGCGGCGCTCACCAAAACATCATCCCATCGTCCACTGGGGCTGC
emb
EMBL
CAAGGCTGTAGGCAAAGTCATCCCAGAGCTCAAAGGGAAGCTAACAGGAATGGCATTCCGGGTGCCAACC
dbj
DDBJ
CCAAACGTGTCAGTTGTGGACCTGACCTGCCGCCTGGCCAAGCCTGCTTCTTACTCGGCTATCACGGAGG
sp
SWISS-PROT
CTGTGAAAGCTGCAGCCAAGGGACCTTTGGCTGGCATCCTTGCTTACACAGAGGACCAGGTGGTCTCCAC
GGACTTTAACGGCAATCCCCATTCTTCCATCTTTGATGCTAAGGCTGGAATTGCCCTCAATGACAACTTC
pdb
Protein Databank
GTGAAGCTTGTTGCCTGGTACGACAACGAATATGGCTACAGTAACCGAGTGGTCGACCTCCTCCGCTACA
pir
PIR
TGTTTAGCCGAGAGAAGTAACACAAAAGGCCCCTCCTTGCTCCCCTGCGCACCTCGCGTTCCTGACTTCG
prf
PRF
GCTTCCACTCAAAGGCGCCGCCACCGGGTCAACAATGAAATAAAAACGAGAATGCGC
FASTA Definition Line
>gi|193425|gb|M60978.1|MUSGAPDS
RefSeq
NCBI
ref
Abstract Syntax Notation: ASN.1
Seq-entry ::= set {
level 1 ,
class nuc-prot ,
descr {
title "Mus musculus testis-specific isoform of glyceraldehyde 3-phosphate
dehydrogenase (Gapd-S) mRNA, and translated products" ,
update-date
std {
year 1994 ,
month 11 ,
day 9 } ,
source {
org {
taxname "Mus musculus" ,
common "house mouse" ,
db {
{
db "taxon" ,
tag
id 10090 } } ,
GenPept
GenBank
ASN.1
FASTA
Nucleotide
NCBI
FASTA
Protein
NCBI Toolbox
/*****************************************************************************
*
*
asn2ff.c
*
convert an ASN.1 entry to flat file format, using the FFPrintArrayPtrs.
*
*****************************************************************************/
#include <accentr.h>
#include "asn2ff.h"
#include "asn2ffp.h"
#include "ffprint.h"
#include <subutil.h>
#include <objall.h>
#include <objcode.h>
#include <lsqfetch.h>
#include <explore.h>
Toolbox Sources
FILE *fpl;
ftp://ncbi.nlm.nih.gov/toolbox/ncbi_tools
Args myargs[] = {
{"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL},
{"Input is a Seq-entry","F", NULL ,NULL ,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL},
{"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL},
{"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL},
{"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL},
NCBI
ftp> open ncbi.nlm.nih.gov
.
.
#ifdef ENABLE_ID1 ftp> cd toolbox
#include <accid1.h>
ftp> cd ncbi_tools
#endif
Protein Neighbors-Structure Links
Related Proteins
Cn3D GAPD Structure
Structure Links
NCBI
Advanced Neighbors: BLink
NCBI
BLink
NCBI
PubMed Link
NCBI
Online Books
NCBI
Entrez Structures
Molecular Modeling Database (MMDB) and Cn3D
NCBI
MMDB: Molecular Modeling Data Base

Derived from experimentally determined PDB records

Value added to PDB records including:
•
•
•
•
Structure neighbors determined by
Vector Alignment Search Tool (VAST)
NCBI

Addition of explicit chemical graph information
Validation
Inclusion of Taxonomy, Citation, and other information
Conversion to parseable ASN.1 data description language
Searching MMDB
NCBI
1CET
Structure Summary
BLAST neighbors
VAST neighbors
NCBI
Cn3D viewer
Cn3D : Displaying Structures
NCBI
Chloroquine
Structure Neighbors
NCBI
Structural Alignments
Chloroquine
NADH
NCBI
Why do we need similarity searching?
Identification and annotation
•Incomplete or no annotations (GenBank)
•Incorrectly annotated sequences
but it ain’t necessarily so!
NCBI
 Evolutionary relationships
homologous molecules may
have similar functions
Basic Local Alignment Search Tool

Widely used similarity search tool

Heuristic approach based on Smith Waterman algorithm

Finds best local alignments

Provides statistical significance

All combinations (DNA/Protein) query and database.

DNA vs DNA
DNA translation vs Protein
Protein vs Protein
Protein vs DNA translation
DNA translation vs DNA translation
www, email server, standalone, and network clients
NCBI
•
•
•
•
•
Local Alignment Statistics
High scores of local alignments between two random sequences
follow Extreme Value Distribution
For ungapped alignments:
Expected number with score S or greater
E = Kmne-S
or
E = mn2-S’
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
NCBI
K = scale for search space
 = scale for scoring system
S’= bitscore = (S - lnK)/ln2
Scoring Systems
•Nucleic acids
identity matrix
•Proteins
•Position Independent Matrices
•PAM Matrices (Percent Accepted Mutation)
•Implicit model of evolution
•Higher PAM number all calculated from PAM1
•PAM250 widely used
•BLOSUM Matrices (BLOck SUbstition Matrices)
•Position Specific Score Matrices (PSSM)
•PSI and RPS BLAST
NCBI
•Empirically determined from alignment
of conserved blocks
•Each includes information up to a certain level of identity
•BLOSUM62 widely used
A 4
R -1 5
N -2 0 6
D -2 -2 1 6
C 0 -3 -3 -3 9
Q -1 1 0 0 -3 5
E -1 0 0 2 -4 2 5
Common amino acids have low weights
G 0 -2 0 -1 -3 -2 -2 6
H -2 0 1 -1 -3 0 0 -2 8
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
Rare amino acids have high weights
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
TNegative
0 -1 for
0 -1
-1 substitutions
-1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
less-1
likely
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
for -1
more
substitutions
X 0 -1 -1 Positive
-1 -2 -1
-1 likely
-1 -1
-1 -1 -1 -1 -2 0 0 -2 -1 -1 -1
A R N D C Q E G H I L K M F P S T W Y V X
BLOSUM62
NCBI
Position Specific Substitution Rates
Typical serine
Active site serine
NCBI
Position Specific Score Matrix (PSSM)
D
G
V
I
S
S
C
N
G
D
S
G
G
P
L
N
C
Q
A
R N D C Q E G H I L K M
-2 0 2 -4 2 4 -4 -3 -5 -4 0 -2
-1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2
1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5
3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2
-5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6
-4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4
-7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5
Serine scored differently
0 2 -1 -6 7 0 -2 0 -6 -4 2 0
in these
-3 -3 -4 -4 -4
-5 7two
-4 positions
-7 -7 -5 -4
-5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7
-4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5
-6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6
-6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6
Active site nucleophile
-6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6
-6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1
-6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4
-4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0
1 4 2 -5 2 0 0 0 -4 -2 1 0
-1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2
F
-6
-3
-6
-5
-7
-5
0
-2
-4
-7
-6
-7
-7
-7
0
-3
-1
0
-2
P
1
-2
-4
-5
-5
-1
-7
-5
-6
-5
-4
-6
-6
9
-6
-6
-4
0
-3
S
0
-2
0
-3
1
4
-4
-1
-3
-4
7
-4
-2
-4
-6
-2
-1
-1
0
T
-1
-1
-2
0
-3
3
-4
-3
-5
-4
-2
-5
-4
-4
-5
-1
0
-1
-2
W
-6
0
-6
-1
-7
-6
-5
-3
-6
-8
-6
-6
-6
-7
-5
-6
-5
-3
-2
Y
-4
-6
-4
-4
-5
-5
0
-4
-6
-7
-5
-7
-7
-7
-4
-1
0
-3
-2
V
-1
-5
-2
0
-6
-3
-4
-3
-6
-7
-5
-7
-7
-6
0
6
0
-4
-3
NCBI
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
A
0
-2
-1
-3
-2
4
-4
-2
-2
-5
-2
-3
-3
-2
-4
-1
0
0
-1
Gapped Alignments
•Gapping provides more biologically realistic alignments
•Statistical behavior not completely understood for
gapped alignments
•Gapped BLAST parameters must be found by
simulations for each matrix
NCBI
•Affine gap costs = -(a+bk)
a = gap open penalty b = gap extend penalty
A gap of length 1 receives the score -(a+b)
Intermission
NCBI
Download