Guest Lecture
Graduate level course MCB221b - Mechanistic Enzymology
Tobias Kind – November 2007
• Database concepts - what is a “good” database (DB)
• How is data stored and queried and curated
• Enzyme DBs, Protein and peptide DBs, small molecule DBs
This document is hyperlinked (pictures and green text ).
To use WWW links in this PPT switch to slide show mode.
1
(*)
Database interface – is what you see
Database queries – what you ask the database
DB2
Oracle
MySQL
Database objects – where the data is stored (index and tables)
Database types – relational databases, object oriented databases, flat file DBs
Database brands – Oracle, MySQL, Apache, IBM DB2, PostgreSQL, MS SQL
Database query language – how a database can be programmed (SQL)
Database dump file – the whole database in a single (*.dmp) file
Database Ontology – database vocabulary and used relationships
Database Semantics – capture meaning by grammar or logical analysis
(*) you can study this for several years and get a PhD in computer and database sciences.
2
Source: wikimedia.org
Source: wikimedia.org
Good DB:
• allows multiple input queries
• exports in multiple output formats
• connects to other DBs
• is curated (means checked for errors by humans or machines)
• is regularly updated (daily, yearly)
• cost money (your money or tax payers money) or time
• allows bulk download (millions of data sets can be downloaded)
• has open interfaces (APIs) for query requests
Bad DB:
• allow only single requests (which have to be typed manually)
• are not databases but just lists or tables
• have no link-out and no link-in
• allow no bulk download
• are not curated
• …
3
XML format – general purpose data format (CML for storing chemical data)
<?xml version="1.0" ?>
<molecule id="m1">
<atomArray>
<atom id="a1" elementType="C" x2="-3.0333333015441895" y2="2.9166667461395264" />
</atomArray>
<bondArray>
</bondArray>
</molecule>
Methane
BioPax format – used for representing pathway data (data exchange format)
SBML format – representing models of biochemical reaction networks
SDF format – general purpose chemical structure format (small molecules)
RDF format – format for storing chemical reactions (small molecules)
PDB format – general purpose chemical structure format (proteins)
4
Source: Akira Funahashi – Cell Designer Tutorial
• List of supported SBML programs (more than 200) from sbml.org
• List of curated and published SBML models (around 200) from biomodels DB
5
• Application programming interfaces (API) are important to connect and automate data exchange between local programs and databases;
Example: NCBI SOAP or PubChem PUG (Power User Interface) can be used to download certain data via the web to another service or to a local program
• Mashups and integration services use new web technology ( RDF , Yahoo Pipes ) to combine data sources and create new knowledge or enhance usage
• SQL used for programming databases
Large Database Table SQL query yr subject winner
1901 Chemistry Jacobus H. van 't Hoff
1902 Chemistry Emil Fischer
1903 Chemistry Svante Arrhenius
1904 Chemistry Sir William Ramsay
1905 Chemistry Adolf von Baeyer
1906 Chemistry Henri Moissan
1907 Chemistry Eduard Buchner
1908 Chemistry Ernest Rutherford
1909 Chemistry Wilhelm Ostwald
1910 Chemistry Otto Wallach
1913 …
SELECT yr, subject, winner
FROM nobel
WHERE yr = 1909 and subject = 'chemistry' yr
1909
Result subject winner
Chemistry Wilhelm Ostwald
Visit the SQL Zoo
6
Enhanced NCI Database Browser Release 2 (CACTVS DB)
• Small molecule DB with revolutionary web-front-end (2001)
• Multiple input an output (export) methods
• Allows matching of molecule lists against DB (as SMILES, CAS, NCI number)
• Links to other services
• Visualization modes (2D, 3D)
• 20 different molecular output formats (SDF, CML, SMILES)
• export to different other
(calculational) services
• 30 different query modes
7
• Visualize complex networks; uses plug-in-technology from different sources
• Map your own compound data (proteins, genes, molecules) onto networks
• Perform literature search with enzymes, genes, small molecules
Source: Cytoscape.org
Start Cytoscape via JAVA webstart
8
9
Source: http://gaggle.systemsbiology.org/docs/geese/
• Frameworks
• Portals
• Mashups
10
Integration of tools and database services
ListLink
The Gaggle: an open-source software system for integrating bioinformatics software and data sources.
Shannon PT, Reiss DJ, Bonneau R, Baliga NS.
BMC Bioinformatics. 2006 Mar 28;7:176.
Use Gaggle
11
Example: LipidMaps DB with Instant-JChem
• Download the whole LipidMaps DB (10,000 lipids) as SDF file [ LINK ]
• Use Instant-JChem as data DB, molecule DB, reaction DB [ LINK ]
• Perform data and molecule queries on your laptop (PC, LINUX, MAC)
(…also works with KEGG/Biometa DB )
12
ChemBioGrid – collection of most chemistry databases current number ~ 156
Pathguide.org
– collection of pathway, enzyme, metabolite DBs current number ~ 231
Chemistry related (big players):
PubChem , CAS (subscription), Beilstein (subscription), Chemspider (fast growing)
Important for chemistry/metabolomics:
Spectral databases (NMR, mass spectral databases), compound property DBs
Pathway, Enzyme related:
KEGG, Brenda, Reactome, Expasy, MetaCyc
13
Pathguide is a meta-database:
Comprehensive collection of pathway, small molecule, enzyme, protein interaction databases
14
KDBI - Kinetic Data of Bio-molecular Interactions database http://bidd.nus.edu.sg/group/kdbi/
SABIO-RK - SABIO-Reaction Kinetics Database http://sabio.villa-bosch.de/SABIORK/
BRENDA - Comprehensive Enzyme Information System http://www.brenda.uni-koeln.de/
EMP - Enzymes and Metabolic Pathways Database http://www.empproject.com/
ENZYME - Enzyme nomenclature database (EXPASY) http://www.expasy.ch/enzyme/
IntEnz - Integrated relational Enzyme database http://www.ebi.ac.uk/intenz/index.html
TECR - Thermodynamics of Enzyme-Catalyzed Reaction http://xpdb.nist.gov/enzyme_thermodynamics/
REBASE - Restriction Enzyme Database http://rebase.neb.com/
Precise - Predicted and Consensus Interaction Sites in Enzymes http://precise.bu.edu/
Source: Pathguide; Own search
15
• Most important small molecule DB
• There was no large open chemistry DB until 10 years ago (!)
• All records can be downloaded via FTP
• All other small molecule link to PubChem
• PubChem Compounds
(true chemicals)
• PubChem Substances
(formulations, mixtures)
• substructure search and multiple other options
Goto PubChem
16
• 33 million molecules and 60 million peptides/proteins
• Largest reaction DB (14 million reactions) and literature DB
• A must for chemist and biochemist/biologist
• no bulk download, no good Import/ Export, no Linkouts
• only proprietary Windows interface (no plugins)
• no text mining (requires ANAVIST)
Download Scifinder
17
BRENDA - Comprehensive Enzyme Information System
18
Example: Brenda connection to RSCB Protein Data bank
Visit Brenda
19
KEGG ID: C00002 (ATP)
KEGG pathway map ID: map00195 (Photosynthesis)
KEGG reaction ID: R05668 (ATP + NAD reaction)
Visit KEGG
20
Example: Skypainter, map your given KEGG IDs to pathways
Visit Reactome
21
• Curation, Curation, Curation (costs money)
• Inhale the good DB and bad DB scheme and apply when you enter a DB portal
• Learn some basic database programming ( Ruby on Rails , JAVA, SQL ) using bioinformatics and chemoinformatics approaches is crucial for research
• Learn how to import and store and handle database search results on your local computer (simple: parse important data with regular expressions)
• Don’t be overwhelmed by the database jungle, take some time to play around;
Finally automation and clever use of DB tools will innovate your research
• Multiple unique identifier problem (Kegg ID, PubChem ID, CAS number) and biological naming problem still exist
• The systems biology and chemistry database world is still different in terms of re-use. Most of the chemistry data published (including molecules) is not machine readable, hence can’t be automatically harvested by software robots.
22
The Gaggle: An open-source software system for integrating bioinformatics software and data sources
Correcting ligands, metabolites, and pathways
Large-Scale Annotation of Small-Molecule Libraries Using Public Databases
23
1) Find three bad or evil databases in the biochemistry/chemistry world please give a reason in a short sentence.
2) Find the year in which most papers about “enzyme kinetics” were published using
SciFinder (use Explore enter search term, then Analyze year)
3) Find the molecules which were analyzed most in papers regarding "enzyme kinetics" and "crickets“ using SciFinder (use Explore, then Analyze CAS Number)
4) Find the price for 1g ATP from Pfaltz & Bauer
(in SciFinder use locate substance then use the Erlenmeyer icon for price info)
5) Goto Brenda and find out how many coronavirus types are in the DB
(use TaxExplorer and query)
6) Goto Brenda and find out how many enzymes are listed as resistant against perchloric acid, report publication title (goto Brenda, Advanced search)
7) Goto KEGG Ligand DB find the KEGG Numbers for D-Hexose and ATP
8) Goto KEGG Reaction Prediction (e-zyme) : How many similar reactions occur between
D-Hexose and ATP? (Enter above KEGG IDs, press view structures; press compute)
9) Goto PubChem; What is the PubChem compound ID (CID) and the topological surface area for Tobias acid?
24
Pathways and enzymes http://www.biocarta.com/pathfiles/h_etcPathway.asp#
SQL learning http://sqlzoo.net/
Databases http://www.google.com/search?hl=en&q=enzyme+kinetics+database&btnG=Google+Search
SQL biologists
I’m a biologist Jim, not a programmer
SQL biologists
SciView part 5: interview with Alexei Drummond
25