Molecular biology databases Based on Chapter 2 of Post-genome Informatics by Minoru Kanehisa, Oxford University Press, 2000 2.1 History 2.2 Information Technology 2.3 New generation databases Evolution of molecular biology databases Database category Data content Examples 1 Literature database Bibliographic citation s On-line journals MEDLINE (1971) 2 Factual database Nucleic acid sequence Amino acid sequence s 3D Molecular structures GenBank (1982), EMBL (1982), DDBJ (1984) PIR (1968), PRF (1979), SWISS-PROT (1986 ) PDB (1971), CSD (1965) 3 Knowledge base Motif libraries Molecular classifi cations Biochemical pathways PROSITE (1988) SCOP (1994) KEGG (1995) The addresses for the major databases Database Organization Address MEDLINE National Library of Medicine www.nlm.nih.gov GenBank National Center for Biotechno logy Info rmation www.ncbi.nlm. nih.gov EMBL European Bioinformatics Institute www.ebi.ac.uk DDBJ National Institute of Genetics, Japan www.ddbj.nig.ac.jp SWISS-PROT Swiss Institute of Bioinformatics www.expasy.ch PIR National Biomedical Research Founda tion www-nbrf.georgetown.edu PRF Protein Research Found ation, Japan www.prf.or.jp PDB Research Collaboratory for Structural Bioinfo rmatics www.rcsb.org CSD Cambridge Crystallographic Data Centre www.ccdc.cam.ac.uk New generation of molecular biology databases Info rmation Database Address Compounds and reactions LIGAND Aaindex PROSITE Blocks PRINTS Pfam Pro Dom SCOP CATH COG KEGG KEGG WIT EcoCyc UM-BBD NCBI Taxono my OMIM www.geno me.ad.jp/dbget/li gand .html www.geno me.ad.jp/dbget/aaindex.html www.expasy.ch/sprot/prosite.html www.blocks.fhcrc.org/ www.biochem.ucl.ac.uk.bsm.dbbrowser/PRINTS/ www.sanger.ac.uk/Pfam/,pfam.wus tl.edu/ protein.toulouse.inra.fr/prodom.html scop.mrc-lmb.cam.ac.uk/scop/ www.biochem.ucl.ac.uk/bsm/cath/ www.ncbi.nlm. nih.gov /COG/ www.geno me.ad.jp/kegg/ www.geno me.ad.jp/kegg/ www.mcs.anl.gov/WIT2/ ecocyc.Pange aSystems.com/ecocyc/ www.labmed.umn.edu/umbbd/ www.ncbi.nlm. nih.gov /Taxono my/ www.ncbi.nlm. nih.gov /Omim/ Protein families and sequence motifs 3D fold classifications Orthologous genes Biochemical pathways Geno me diversity 100 000 10 000 1000 Amount (x1000) 100 10 1 0.1 MEDLINE records MEDLINE G5 MeSH Transistors / chip DNA sequences Mapped human genes 3-D structures 0.01 0.001 1965 1970 1975 1980 1985 Year 1990 1995 2000 Example of sequence database entry for Genbank LOCUS DRODPPC 4001 bp INV 15-MAR-1990 DEFINITION D.melanogaster decapentaplegic gene complex (DPP-C), complete cds. ACCESSION M30116 KEYWORDS . SOURCE D.melanogaster, cDNA to mRNA. ORGANISM Drosophila melanogaster Eurkaryote; mitochondrial eukaryotes; Metazoa; Arthropoda; Tracheata; Insecta; Pterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophilia. REFER ENCE 1 (bases 1 to 4001) AUTHORS Padgett, R.W., St Johnston, R.D. and Gelbart, W.M. TITLE A transcript from a Drosophila pattern gene predicts a protein homologous to the transforming growth factor-beta family JOURNAL Nature 325, 81-84 (1987) MEDLINE 87090408 COMMENT The initiation codon could be at either 1188-1190 or 1587-1589 FEATURES Location/Qualifiers source 1..4001 /organism=“Drosophila melanogaster” /db_xref=“taxon:7227” mRNA <1..3918 /gene=“dpp” /note=“decapentaplegic protein mRNA” /db_xref=“FlyBase:FBgn0000490” gene 1..4001 /note=“decapentaplegic” /gene=“dpp” /allele=“” /db_xref=“FlyBase:FBgn0000490” CDS 1188..2954 /gene=“dpp” /note=“decapentaplegic protein (1188 could be 1587)” /codon_start=1 /db_xref=“FlyBase:FBgn0000490” /db_xref=“PID:g157292” /translation=“MRAWLLLLAVLATFQTIVRVASTEDISQRFIAAIAPVAAHIPLA SASGSGSGRSGSRSVGASTSTALAKAFNPFSEPASFSDSDKSHRSKTNKKPSKSDANR …………………… LGYDAYYCHGKCPFPLADHFNSTNAVVQTLVNNMNPGKVPKACCVPTQLDSVAMLYL NDQSTBVVLKNYQEMTBBGCGCR” BASE COUNT 1170 a 1078 c 956 g 797 t ORIGIN 1 gtcgttcaac agcgctgatc gagtttaaat ctataccgaa atgagcggcg gaaagtgagc 61 cacttggcgt gaacccaaag ctttcgagga aaattctcgg acccccatat acaaatatcg 121 gaaaaagtat cgaacagttt cgcgacgcga agcgttaaga tcgcccaaag atctccgtgc 181 ggaaacaaag aaattgaggc actattaaga gattgttgtt gtgcgcgagt gtgtgtcttc 241 agctgggtgt gtggaatgtc aactgacggg ttgtaaaggg aaaccctgaa atccgaacgg 301 ccagccaaag caaataaagc tgtgaatacg aattaagtac aacaaacagt tactgaaaca 361 gatacagatt cggattcgaa tagagaaaca gatactggag atgcccccag aaacaattca 421 attgcaaata tagtgcgttg cgcgagtgcc agtggaaaaa tatgtggatt acctgcgaac 481 cgtccgccca aggagccgcc gggtgacagg tgtatccccc aggataccaa cccgagccca 541 gaccgagatc cacatccaga tcccgaccgc agggtgccag tgtgtcatgt gccgcggcat 601 accgaccgca gccacatcta ccgaccaggt gcgcctcgaa tgcggcaaca caattttcaa …………………………. 3841 aactgtataa acaaaacgta tgccctataa atatatgaat aactatctac atcgttatgc 3901 gttctaagct aagctcgaat aaatccgtac acgttaatta atctagaatc gtaagaccta 3961 acgcgtaagc tcagcatgtt ggataaatta atagaaacga g // Example of sequence database entry for SWISS-PROT ID AC DT DT DT DE GN OS OC RN RP RM RA RL RN RP RM RA RL CC CC CC CC CC DR DR DR DR DR KW FT FT FT FT FT FT FT FT FT FT FT SQ DECA_DROME STANDARD; PRT; 588AA. P07713; 01-APR-1988 (REL. 07, CREATED) 01-APR-1988 (REL. 07, LAST SEQUENCE UPDATE) 01-FEB-1995 (REL. 31, LAST ANNOTATION UPDATE) DECAPENTAPLEGIC PROTEIN PRECURSOR (DPP-C PROTEIN). DPP. DROSOPHILA MELANOGASTER (FRUIT FLY). EUKARYOTA; METAZOA; ARTHROPODA; INSECTA; DIPTERA. [1] SEQUENCE FROM N.A. 87090408 PADGETT R.W., ST JOHNSTON R.D., GELBART W.M.; NATURE 325:81-84 (1987) [2] CHARACTERIZATION, AND SEQUENCE OF 457-476. 90258853 PANGANIBAN G.E.F., RASHKA K.E., NEITZEL M.D., HOFFMANN F.M.; MOL. CELL. BIOL. 10:2669-2677(1990). -!- FUNCTION: DPP IS REQUIRED FOR THE PROPER DEVELOPMENT OF THE EMBRYONIC DOORSAL HYPODERM, FOR VIABILITY OF LARVAE AND FOR CELL VIABILITY OF THE EPITHELIAL CELLS IN THE IMAGINAL DISKS. -!- SUBUNIT: HOMODIMER, DISULFIDE-LINKED. -!- SIMILARITY: TO OTHER GROWTH FACTORS OF THE TGF-BETA FAMILY. EMBL; M30116; DMDPPC. PIR; A26158; A26158. HSSP; P08112; 1TFG. FLYBASE; FBGN0000490; DPP. PROSITE; PS00250; TGF_BETA. GROWTH FACTOR; DIFFERENTIATION; SIGNAL. SIGNAL 1 ? POTENTIAL. PROPEP ? 456 CHAIN 457 588 DECAPENTAPLEGIC PROTEIN. DISULFID 487 553 BY SIMILARITY. DISULFID 516 585 BY SIMILARITY. DISULFID 520 587 BY SIMILARITY. DISULFID 552 552 INTERCHAIN (BY SIMILARITY). CARBOHYD 120 120 POTENTIAL. CARBOHYD 342 342 POTENTIAL. CARBOHYD 377 377 POTENTIAL. CARBOHYD 529 529 POTENTIAL. SEQUENCE 588 AA; 65850MW; 1768420 CN; MRAWLLLLAV LATFQTIVRV ASTEDISQRF IAAIAPVAAH IPLASASGSG SGRSGSRSVG ASTSTAGAKA FNRFSEPASF SDSDKSHRSK TNKKPSKSDA NRQFNEVHKP RTDQLENSKN KSKQLVNKPN HNKMAVKEQR SHHKKSHHHR SHQPKQASAS TESHQSSSIE SIFVEEPTLV LDREVASINV PANAKAIIAE QGPSTYSKEA LIKDKLKPDP STYLVEIKSL LSLFNMKRPP KIDRSKIIIP EPMKKLYAEI MGHELDSVNI PKPGLLTKSA NTVRSFTHKD SKIDDRFPHH HRFRLHFDVK SIPADEKLKA AELQLTRDAL SQQVVASRSS ANRTRYQBLV YDITRVGVRG QREPSYLLLD TKTBRLNSTD TVSLDVQPAV DRWLASPQRN YGLLVEVRTV RSLKPAPHHH VRLRRSADEA HERWQHKQPL LFTYTDDGRH DARSIRDVSG GEGGGKGGRN KRHARRPTRR KNHDDTCRRH SLYVDFSDVG WDDWIVAPLG YDAYYCHGKC PFPLADHRNS TNHAVVQTLV NNMNPGKBPK ACCBPTQLDS VAMLYLNDQS TVVLKNYQEM TVVGCGCR Functional classification of E. coli genes according to Monica Riley I. II. III. IV. V. VI. Intermedia ry metabolism A. Degradation B. Central intermediary metabolism C. Respiration (aerobic and ana erobic) D. Fermentation E. ATP-proton motive force interconver sions F. Broad regul atory fun ctions Biosynthesis of small molecules A. Amino acids B. Nucleotides C. Suga rs and suga r molecules D. Cofactors, prosthetic groups, electron carriers E. Fatty a cids and lipids F. Polyamines Macromolecule metabolism A. Synthesis and modification B. Degradation of macromolecules Cell structure A. Membrane componen ts B. Murein sacculus C. Surface polysaccha rides and antigens D. Surface struc tures Cellular processes A. Transport/binding proteins B. Cell division C. Chemotaxis and mobilit y D. Protein secretion E. Osmotic adaptions Other func tions A. Cryptic genes B. Phage -related func tions and prophag es C. Colicin-related func tions D. Plasmid-related func tions E. Drug/analog sensitivity F. Radation sensiti vity G. DNA sites H. Adaptations to atypical cond iti ons Pages MUID Relational database. A table (relation) is a set and the three basic table operations shown here are extensions of the standard set operations. Paper 1 Paper 2 Paper 3 Paper 4 .... SELECT Author 1-1 Author 1-2 Author 2-1 Author 2-2 Author 2-3 Author 3-1 .... Author MUID JOIN Author Pages MUID PROJECT A history of database technology development Object-oriented Programming (Kay, 1972) Object-oriented Database (1986) Relational database (Codd, 1970) Logic programming (Kowalski, 1972) Deductive database\ (1977) Deductive, objectOriented database (1989) Multimedia in GenomeNet Data type Nucleic acid sequences Protein sequence s 3D molecular structures Sequenc e motifs Chemical reactions Chemical compounds Biochemical pathways Gene catalogues Genomes Expression profil es Genetic diseases Amino acid mutations Amino acid indices Literature Database links Database GenBank, EMBL SWISS-PROT, PIR, PRF PDB EPD, TRANSFAC, PROSITE LIGAND/ENZYME LIGAND/COMPOUND KEGG/PATHWAY KEGG/GENES KEGG/GENOME KEGG/EXPRESSION OMIM PMD AAindex Medline, LITDB Link DB Media Text Text Text, 3D graphics Text, 3D graphics Text Text, image, 2D graphics Image, Java applet Text Text image, Java app let Image, Java applet Text Text Text Text Text Pancreatic trypsin inhibitor PDB: 4PTI ribbon model and variant with cylinder for alpha helix (figures from PDB) The periodic table of chemical elements where the shaded elements are those normally found in biology. 1 2 H 3 Li 11 He 4 5 Be B 12 13 Na Mg 19 K 37 Rb 20 Ca 38 Sr Al 21 Sc 39 Y 22 Ti 40 Zr 72 23 V 41 24 42 56 71 Cs Ba Lu Hf Ta 87 88 103 104 105 Fr Ra Lr Rf Db Sg 58 La Ce 89 90 Ac Th 43 Nb Mo Tc 73 59 Pr 91 Pa 26 Cr Mn Fe 55 57 25 74 44 27 Co 45 Ru Rh 75 76 W Re Os Ir 106 107 108 109 Bh Hs Mt 60 61 62 77 63 28 Ni 46 29 30 Cu Zn 47 48 Pd Ag Cd 78 Pt 79 Au Hg 110 111 Uun Uuu 64 65 Nd Pm Sm Eu Gd Td 92 U 93 94 95 96 80 97 Np Pu Am Cm Bk 31 6 C 14 Si 7 N 15 P 32 33 Ga Ge As 49 In 81 Tl 8 O 16 S 34 Se 50 51 52 Sn Sb Te 82 83 84 Pb Bi Po 112 Uub 66 67 Dy Ho 98 Cf 99 68 69 70 Er Tm Yb 100 101 102 Es Fm Md No 9 F 17 Cl 35 Br 53 I 85 At 10 Ne 18 Ar 36 Kr 54 Xe 86 Rn Biologically important classes of organic compounds derived from the six basic elements H (hyd rogen) C (carbon) N (nitr ogen) CO2 (carbon d ioxid e) NO3- (nitr ate) HCO3- (hyd rogen carbonate) NO2- (nitr ite) CH4 (methane ) NH3 (ammonia) CH3 (methyl group ) COOH (carboxyl group) NH2 (amino group ) R (alkyl group) R-COOH (carboxylic acid) R-NH2 (amine) NH2-CHR-COOH (amino acid) P (phospho rus) PO34- (pho sphate) R-COO-R' HPO3-O-R' R-O-PO2-O-R' R-NH-CO-R' R-S-CO-R' (carboxylic acid e ster such a s fats) (phospho ric acid monoester such as phospholipids) (phosphod iester bond in nuc leic acids) (peptide bond in proteins) (thio ester such as acetyl-CoA) O (oxygen ) H2O (water) OH (hyd roxyl group) R-OH (alcohol) R-CHO (aldehyde ) R-O-R' (ether) R-CO-R' (ketone) S (sulfur) SO24- (sulfate) SO23- (sulfite) S2O23- (thiosulfate) H2S (hydrogen sulfide) SH (sulfhydryl group) R-SH (thiol) R-S-S-R' (disulfide) The 20 common amino acids BLO(ck)SU(bstitution)M(atrix) (Henikoff & Henikoff 1992) • Derived from a set (2000) of aligned and ungapped regions from protein families; emphasizing more on chemical similarities (versus how easy it is to mutate from one residue to another). BLOSUMx is derived from the set of segments of x% identity. BLOSUM62 Matrix, log-odds representation Substitution/Scoring Matrices • Pam matrices (Dayhoff et al. 1978) --- phylogeny-based. PAM1: expected number of mutation = 1% PAM250 matrix, log-odds representation A hidden Markov model for sequence analysis d1 d2 d3 d4 I0 I1 I2 I3 I4 m0 m1 m2 m3 m4 Start m5 End m= match state (output), I = insert state (output), d= delete state (no output) Globin fold protein myoglobin PDB: 1MBN sandwich protein immunoglobulin PDB: 7FAB TIM barrel / protein Triose phosphate IsoMerase PDB: 1TIM A fold in + protein ribonuclease A PDB: 7RSA 434 Cro protein complex (phage) PDB: 3CRO Zinc finger DNA recognition (Drosophila) PDB: 2DRP ..YRCKVCSRVY THISNFCRHY VTSH... Leucine zipper (yeast) PDB: 1YSA ..RA RKLQRMKQLE DKVEE LLSKN YHLENEVARL... The orthologue group table for F1-F0 ATP synthase (upper) and V-type ATP synthase (lower). Organism eco epsilon b3731 beta b3732 gamma b3733 alpha b3734 delta b3735 b b3736 c b3737 a b3738 bsu atpC atpD atpG atpA atpH atpF atpE atpB mtu Rv1311 Rv1310 Rv1309 Rv1308 Rv1307 Rv1306 Rv1305 Rv1304 aae aq_673 aq_2038 aq_2041 aq_679 aq_1588 aq_177 aq_179 syn slr1330 slr1329 sll1327 sll1326 sll1325 aq_1586 aq_1587 sll1324 sll1323 ssl2615 sll1322 C F A BB0094 B BB0093 E BB0096 K BB0090 I BB0091 D BB0092 mja MJ0219 MJ0218 MJ0217 MJ0216 MJ0222 MJ0222 afu AF1164 AF1165 AF1166 AF1167 MJ0220 MJ0226 AF1163 AF1158 AF1159 AF1159 bbu AF1168 Reactions and interactions Note notion of Enzyme Commission (EC) number. Biochemical pathways Genome diversity The tree of life showing the relationship of archaea, bacteria, and eukaryotes, as well as the relationship of fungi, plants and animals. Bacteria Archae Eukaryotes Protists Plants Fungi Animals