Molecular biology databases Based on Chapter 2 of by Minoru Kanehisa,

advertisement
Molecular biology databases
Based on Chapter 2 of
Post-genome Informatics
by Minoru Kanehisa,
Oxford University Press, 2000
2.1 History
2.2 Information Technology
2.3 New generation databases
Evolution of molecular biology databases
Database category
Data content
Examples
1
Literature database
Bibliographic citation s
On-line journals
MEDLINE (1971)
2
Factual database
Nucleic acid sequence
Amino acid sequence s
3D Molecular structures
GenBank (1982), EMBL (1982), DDBJ (1984)
PIR (1968), PRF (1979), SWISS-PROT (1986 )
PDB (1971), CSD (1965)
3
Knowledge base
Motif libraries
Molecular classifi cations
Biochemical pathways
PROSITE (1988)
SCOP (1994)
KEGG (1995)
The addresses for the major databases
Database
Organization
Address
MEDLINE
National Library of Medicine
www.nlm.nih.gov
GenBank
National Center for Biotechno logy Info rmation
www.ncbi.nlm. nih.gov
EMBL
European Bioinformatics Institute
www.ebi.ac.uk
DDBJ
National Institute of Genetics, Japan
www.ddbj.nig.ac.jp
SWISS-PROT
Swiss Institute of Bioinformatics
www.expasy.ch
PIR
National Biomedical Research Founda tion
www-nbrf.georgetown.edu
PRF
Protein Research Found ation, Japan
www.prf.or.jp
PDB
Research Collaboratory for Structural Bioinfo rmatics
www.rcsb.org
CSD
Cambridge Crystallographic Data Centre
www.ccdc.cam.ac.uk
New generation of molecular biology databases
Info rmation
Database
Address
Compounds and reactions
LIGAND
Aaindex
PROSITE
Blocks
PRINTS
Pfam
Pro Dom
SCOP
CATH
COG
KEGG
KEGG
WIT
EcoCyc
UM-BBD
NCBI Taxono my
OMIM
www.geno me.ad.jp/dbget/li gand .html
www.geno me.ad.jp/dbget/aaindex.html
www.expasy.ch/sprot/prosite.html
www.blocks.fhcrc.org/
www.biochem.ucl.ac.uk.bsm.dbbrowser/PRINTS/
www.sanger.ac.uk/Pfam/,pfam.wus tl.edu/
protein.toulouse.inra.fr/prodom.html
scop.mrc-lmb.cam.ac.uk/scop/
www.biochem.ucl.ac.uk/bsm/cath/
www.ncbi.nlm. nih.gov /COG/
www.geno me.ad.jp/kegg/
www.geno me.ad.jp/kegg/
www.mcs.anl.gov/WIT2/
ecocyc.Pange aSystems.com/ecocyc/
www.labmed.umn.edu/umbbd/
www.ncbi.nlm. nih.gov /Taxono my/
www.ncbi.nlm. nih.gov /Omim/
Protein families and
sequence motifs
3D fold classifications
Orthologous genes
Biochemical pathways
Geno me diversity
100 000
10 000
1000
Amount (x1000)
100
10
1
0.1
MEDLINE records
MEDLINE G5 MeSH
Transistors / chip
DNA sequences
Mapped human genes
3-D structures
0.01
0.001
1965
1970
1975
1980
1985
Year
1990
1995
2000
Example of sequence database entry for Genbank
LOCUS
DRODPPC
4001 bp
INV
15-MAR-1990
DEFINITION
D.melanogaster decapentaplegic gene complex (DPP-C), complete cds.
ACCESSION
M30116
KEYWORDS
.
SOURCE D.melanogaster, cDNA to mRNA.
ORGANISM
Drosophila melanogaster
Eurkaryote; mitochondrial eukaryotes; Metazoa; Arthropoda;
Tracheata; Insecta; Pterygota; Diptera; Brachycera; Muscomorpha;
Ephydroidea; Drosophilidae; Drosophilia.
REFER ENCE
1 (bases 1 to 4001)
AUTHORS
Padgett, R.W., St Johnston, R.D. and Gelbart, W.M.
TITLE
A transcript from a Drosophila pattern gene predicts a protein
homologous to the transforming growth factor-beta family
JOURNAL
Nature 325, 81-84 (1987)
MEDLINE
87090408
COMMENT The initiation codon could be at either 1188-1190 or 1587-1589
FEATURES
Location/Qualifiers
source
1..4001
/organism=“Drosophila melanogaster”
/db_xref=“taxon:7227”
mRNA
<1..3918
/gene=“dpp”
/note=“decapentaplegic protein mRNA”
/db_xref=“FlyBase:FBgn0000490”
gene
1..4001
/note=“decapentaplegic”
/gene=“dpp”
/allele=“”
/db_xref=“FlyBase:FBgn0000490”
CDS
1188..2954
/gene=“dpp”
/note=“decapentaplegic protein (1188 could be 1587)”
/codon_start=1
/db_xref=“FlyBase:FBgn0000490”
/db_xref=“PID:g157292”
/translation=“MRAWLLLLAVLATFQTIVRVASTEDISQRFIAAIAPVAAHIPLA
SASGSGSGRSGSRSVGASTSTALAKAFNPFSEPASFSDSDKSHRSKTNKKPSKSDANR
……………………
LGYDAYYCHGKCPFPLADHFNSTNAVVQTLVNNMNPGKVPKACCVPTQLDSVAMLYL
NDQSTBVVLKNYQEMTBBGCGCR”
BASE COUNT
1170 a
1078 c
956 g
797 t
ORIGIN
1 gtcgttcaac agcgctgatc gagtttaaat ctataccgaa atgagcggcg gaaagtgagc
61 cacttggcgt gaacccaaag ctttcgagga aaattctcgg acccccatat acaaatatcg
121 gaaaaagtat cgaacagttt cgcgacgcga agcgttaaga tcgcccaaag atctccgtgc
181 ggaaacaaag aaattgaggc actattaaga gattgttgtt gtgcgcgagt gtgtgtcttc
241 agctgggtgt gtggaatgtc aactgacggg ttgtaaaggg aaaccctgaa atccgaacgg
301 ccagccaaag caaataaagc tgtgaatacg aattaagtac aacaaacagt tactgaaaca
361 gatacagatt cggattcgaa tagagaaaca gatactggag atgcccccag aaacaattca
421 attgcaaata tagtgcgttg cgcgagtgcc agtggaaaaa tatgtggatt acctgcgaac
481 cgtccgccca aggagccgcc gggtgacagg tgtatccccc aggataccaa cccgagccca
541 gaccgagatc cacatccaga tcccgaccgc agggtgccag tgtgtcatgt gccgcggcat
601 accgaccgca gccacatcta ccgaccaggt gcgcctcgaa tgcggcaaca caattttcaa
………………………….
3841 aactgtataa acaaaacgta tgccctataa atatatgaat aactatctac atcgttatgc
3901 gttctaagct aagctcgaat aaatccgtac acgttaatta atctagaatc gtaagaccta
3961 acgcgtaagc tcagcatgtt ggataaatta atagaaacga g
//
Example of sequence database entry for SWISS-PROT
ID
AC
DT
DT
DT
DE
GN
OS
OC
RN
RP
RM
RA
RL
RN
RP
RM
RA
RL
CC
CC
CC
CC
CC
DR
DR
DR
DR
DR
KW
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
SQ
DECA_DROME
STANDARD;
PRT;
588AA.
P07713;
01-APR-1988 (REL. 07, CREATED)
01-APR-1988 (REL. 07, LAST SEQUENCE UPDATE)
01-FEB-1995 (REL. 31, LAST ANNOTATION UPDATE)
DECAPENTAPLEGIC PROTEIN PRECURSOR (DPP-C PROTEIN).
DPP.
DROSOPHILA MELANOGASTER (FRUIT FLY).
EUKARYOTA; METAZOA; ARTHROPODA; INSECTA; DIPTERA.
[1]
SEQUENCE FROM N.A.
87090408
PADGETT R.W., ST JOHNSTON R.D., GELBART W.M.;
NATURE 325:81-84 (1987)
[2]
CHARACTERIZATION, AND SEQUENCE OF 457-476.
90258853
PANGANIBAN G.E.F., RASHKA K.E., NEITZEL M.D., HOFFMANN F.M.;
MOL. CELL. BIOL. 10:2669-2677(1990).
-!- FUNCTION: DPP IS REQUIRED FOR THE PROPER DEVELOPMENT OF THE
EMBRYONIC DOORSAL HYPODERM, FOR VIABILITY OF LARVAE AND FOR CELL
VIABILITY OF THE EPITHELIAL CELLS IN THE IMAGINAL DISKS.
-!- SUBUNIT: HOMODIMER, DISULFIDE-LINKED.
-!- SIMILARITY: TO OTHER GROWTH FACTORS OF THE TGF-BETA FAMILY.
EMBL; M30116; DMDPPC.
PIR; A26158; A26158.
HSSP; P08112; 1TFG.
FLYBASE; FBGN0000490; DPP.
PROSITE; PS00250; TGF_BETA.
GROWTH FACTOR; DIFFERENTIATION; SIGNAL.
SIGNAL
1
?
POTENTIAL.
PROPEP
?
456
CHAIN
457
588
DECAPENTAPLEGIC PROTEIN.
DISULFID
487
553
BY SIMILARITY.
DISULFID
516
585
BY SIMILARITY.
DISULFID
520
587
BY SIMILARITY.
DISULFID
552
552
INTERCHAIN (BY SIMILARITY).
CARBOHYD
120
120
POTENTIAL.
CARBOHYD
342
342
POTENTIAL.
CARBOHYD
377
377
POTENTIAL.
CARBOHYD
529
529
POTENTIAL.
SEQUENCE
588 AA;
65850MW;
1768420 CN;
MRAWLLLLAV LATFQTIVRV ASTEDISQRF IAAIAPVAAH IPLASASGSG SGRSGSRSVG
ASTSTAGAKA FNRFSEPASF SDSDKSHRSK TNKKPSKSDA NRQFNEVHKP RTDQLENSKN
KSKQLVNKPN HNKMAVKEQR SHHKKSHHHR SHQPKQASAS TESHQSSSIE SIFVEEPTLV
LDREVASINV PANAKAIIAE QGPSTYSKEA LIKDKLKPDP STYLVEIKSL LSLFNMKRPP
KIDRSKIIIP EPMKKLYAEI MGHELDSVNI PKPGLLTKSA NTVRSFTHKD SKIDDRFPHH
HRFRLHFDVK SIPADEKLKA AELQLTRDAL SQQVVASRSS ANRTRYQBLV YDITRVGVRG
QREPSYLLLD TKTBRLNSTD TVSLDVQPAV DRWLASPQRN YGLLVEVRTV RSLKPAPHHH
VRLRRSADEA HERWQHKQPL LFTYTDDGRH DARSIRDVSG GEGGGKGGRN KRHARRPTRR
KNHDDTCRRH SLYVDFSDVG WDDWIVAPLG YDAYYCHGKC PFPLADHRNS TNHAVVQTLV
NNMNPGKBPK ACCBPTQLDS VAMLYLNDQS TVVLKNYQEM TVVGCGCR
Functional classification of E. coli genes according to Monica Riley
I.
II.
III.
IV.
V.
VI.
Intermedia ry metabolism
A.
Degradation
B.
Central intermediary metabolism
C.
Respiration (aerobic and ana erobic)
D.
Fermentation
E.
ATP-proton motive force interconver sions
F.
Broad regul atory fun ctions
Biosynthesis of small molecules
A.
Amino acids
B.
Nucleotides
C.
Suga rs and suga r molecules
D.
Cofactors, prosthetic groups, electron carriers
E.
Fatty a cids and lipids
F.
Polyamines
Macromolecule metabolism
A.
Synthesis and modification
B.
Degradation of macromolecules
Cell structure
A.
Membrane componen ts
B.
Murein sacculus
C.
Surface polysaccha rides and antigens
D.
Surface struc tures
Cellular processes
A.
Transport/binding proteins
B.
Cell division
C.
Chemotaxis and mobilit y
D.
Protein secretion
E.
Osmotic adaptions
Other func tions
A.
Cryptic genes
B.
Phage -related func tions and prophag es
C.
Colicin-related func tions
D.
Plasmid-related func tions
E.
Drug/analog sensitivity
F.
Radation sensiti vity
G.
DNA sites
H.
Adaptations to atypical cond iti ons
Pages
MUID
Relational database. A table (relation) is a set and the three basic table operations
shown here are extensions of the standard set operations.
Paper 1
Paper 2
Paper 3
Paper 4
....
SELECT
Author 1-1
Author 1-2
Author 2-1
Author 2-2
Author 2-3
Author 3-1
....
Author
MUID
JOIN
Author
Pages
MUID
PROJECT
A history of database technology development
Object-oriented
Programming
(Kay, 1972)
Object-oriented
Database
(1986)
Relational database
(Codd, 1970)
Logic programming
(Kowalski, 1972)
Deductive database\
(1977)
Deductive, objectOriented database
(1989)
Multimedia in GenomeNet
Data type
Nucleic acid sequences
Protein sequence s
3D molecular structures
Sequenc e motifs
Chemical reactions
Chemical compounds
Biochemical pathways
Gene catalogues
Genomes
Expression profil es
Genetic diseases
Amino acid mutations
Amino acid indices
Literature
Database links
Database
GenBank, EMBL
SWISS-PROT, PIR, PRF
PDB
EPD, TRANSFAC, PROSITE
LIGAND/ENZYME
LIGAND/COMPOUND
KEGG/PATHWAY
KEGG/GENES
KEGG/GENOME
KEGG/EXPRESSION
OMIM
PMD
AAindex
Medline, LITDB
Link DB
Media
Text
Text
Text, 3D graphics
Text, 3D graphics
Text
Text, image, 2D graphics
Image, Java applet
Text
Text image, Java app let
Image, Java applet
Text
Text
Text
Text
Text
Pancreatic trypsin inhibitor PDB: 4PTI
ribbon model and variant with cylinder
for alpha helix (figures from PDB)
The periodic table of chemical elements where the shaded
elements are those normally found in biology.
1
2
H
3
Li
11
He
4
5
Be
B
12
13
Na Mg
19
K
37
Rb
20
Ca
38
Sr
Al
21
Sc
39
Y
22
Ti
40
Zr
72
23
V
41
24
42
56
71
Cs
Ba
Lu
Hf
Ta
87
88
103
104
105
Fr
Ra
Lr
Rf
Db Sg
58
La
Ce
89
90
Ac
Th
43
Nb Mo Tc
73
59
Pr
91
Pa
26
Cr Mn Fe
55
57
25
74
44
27
Co
45
Ru Rh
75
76
W
Re
Os
Ir
106
107
108
109
Bh Hs
Mt
60
61
62
77
63
28
Ni
46
29
30
Cu
Zn
47
48
Pd Ag Cd
78
Pt
79
Au Hg
110 111
Uun Uuu
64
65
Nd Pm Sm Eu Gd Td
92
U
93
94
95
96
80
97
Np Pu Am Cm Bk
31
6
C
14
Si
7
N
15
P
32
33
Ga Ge
As
49
In
81
Tl
8
O
16
S
34
Se
50
51
52
Sn
Sb
Te
82
83
84
Pb
Bi
Po
112
Uub
66
67
Dy Ho
98
Cf
99
68
69
70
Er Tm Yb
100
101
102
Es Fm Md No
9
F
17
Cl
35
Br
53
I
85
At
10
Ne
18
Ar
36
Kr
54
Xe
86
Rn
Biologically important classes of organic compounds
derived from the six basic elements
H (hyd rogen)
C (carbon)
N (nitr ogen)
CO2 (carbon d ioxid e)
NO3- (nitr ate)
HCO3- (hyd rogen carbonate) NO2- (nitr ite)
CH4 (methane )
NH3 (ammonia)
CH3 (methyl group ) COOH (carboxyl group)
NH2 (amino group )
R (alkyl group)
R-COOH (carboxylic acid) R-NH2 (amine)
NH2-CHR-COOH (amino acid)
P (phospho rus)
PO34- (pho sphate)
R-COO-R'
HPO3-O-R'
R-O-PO2-O-R'
R-NH-CO-R'
R-S-CO-R'
(carboxylic acid e ster such a s fats)
(phospho ric acid monoester such as phospholipids)
(phosphod iester bond in nuc leic acids)
(peptide bond in proteins)
(thio ester such as acetyl-CoA)
O (oxygen )
H2O (water)
OH (hyd roxyl group)
R-OH (alcohol) R-CHO (aldehyde )
R-O-R' (ether) R-CO-R' (ketone)
S (sulfur)
SO24- (sulfate)
SO23- (sulfite)
S2O23- (thiosulfate)
H2S (hydrogen sulfide)
SH (sulfhydryl group)
R-SH (thiol)
R-S-S-R' (disulfide)
The 20 common
amino acids
BLO(ck)SU(bstitution)M(atrix) (Henikoff & Henikoff 1992)
• Derived from a set (2000) of aligned and ungapped regions from protein
families; emphasizing more on chemical similarities (versus how easy it is
to mutate from one residue to another). BLOSUMx is derived from the set
of segments of x% identity.
BLOSUM62 Matrix, log-odds representation
Substitution/Scoring Matrices
• Pam matrices (Dayhoff et al. 1978) --- phylogeny-based.
PAM1: expected number of mutation = 1%
PAM250 matrix, log-odds representation
A hidden Markov model for sequence analysis
d1
d2
d3
d4
I0
I1
I2
I3
I4
m0
m1
m2
m3
m4
Start
m5
End
m= match state (output), I = insert state (output), d= delete state (no output)
Globin fold
 protein
myoglobin
PDB: 1MBN
 sandwich
 protein
immunoglobulin
PDB: 7FAB
TIM barrel
 /  protein
Triose
phosphate
IsoMerase
PDB: 1TIM
A fold in
 + protein
ribonuclease A
PDB: 7RSA
434 Cro
protein
complex
(phage)
PDB: 3CRO
Zinc finger
DNA recognition
(Drosophila)
PDB: 2DRP
..YRCKVCSRVY THISNFCRHY VTSH...
Leucine zipper
(yeast)
PDB: 1YSA
..RA RKLQRMKQLE DKVEE LLSKN YHLENEVARL...
The orthologue group table for F1-F0 ATP synthase
(upper) and V-type ATP synthase (lower).
Organism
eco
epsilon
b3731
beta
b3732
gamma
b3733
alpha
b3734
delta
b3735
b
b3736
c
b3737
a
b3738
bsu
atpC
atpD
atpG
atpA
atpH
atpF
atpE
atpB
mtu
Rv1311
Rv1310
Rv1309
Rv1308
Rv1307
Rv1306
Rv1305
Rv1304
aae
aq_673
aq_2038
aq_2041
aq_679
aq_1588
aq_177
aq_179
syn
slr1330
slr1329
sll1327
sll1326
sll1325
aq_1586
aq_1587
sll1324
sll1323
ssl2615
sll1322
C
F
A
BB0094
B
BB0093
E
BB0096
K
BB0090
I
BB0091
D
BB0092
mja
MJ0219
MJ0218
MJ0217
MJ0216
MJ0222
MJ0222
afu
AF1164
AF1165
AF1166
AF1167
MJ0220
MJ0226
AF1163
AF1158
AF1159
AF1159
bbu
AF1168
Reactions and interactions
Note notion of Enzyme Commission (EC) number.
Biochemical pathways
Genome diversity
The tree of life showing the relationship of archaea, bacteria, and
eukaryotes, as well as the relationship of fungi, plants and animals.
Bacteria
Archae
Eukaryotes
Protists
Plants
Fungi
Animals
Related documents
Download