Bio-Chemical databases - Metabolomics Fiehn Lab

advertisement

Bio-Chemical databases

Guest Lecture

Graduate level course MCB221b - Mechanistic Enzymology

Tobias Kind – November 2007

Database concepts - what is a “good” database (DB)

How is data stored and queried and curated

Enzyme DBs, Protein and peptide DBs, small molecule DBs

This document is hyperlinked (pictures and green text ).

To use WWW links in this PPT switch to slide show mode.

1

Databases – very short primer

(*)

Database interface – is what you see

Database queries – what you ask the database

DB2

Oracle

MySQL

Database objects – where the data is stored (index and tables)

Database types – relational databases, object oriented databases, flat file DBs

Database brands – Oracle, MySQL, Apache, IBM DB2, PostgreSQL, MS SQL

Database query language – how a database can be programmed (SQL)

Database dump file – the whole database in a single (*.dmp) file

Database Ontology – database vocabulary and used relationships

Database Semantics – capture meaning by grammar or logical analysis

(*) you can study this for several years and get a PhD in computer and database sciences.

2

What is a good database?

As in normal life its important to distinguish between good and evil

Source: wikimedia.org

Source: wikimedia.org

Good DB:

allows multiple input queries

exports in multiple output formats

connects to other DBs

is curated (means checked for errors by humans or machines)

is regularly updated (daily, yearly)

cost money (your money or tax payers money) or time

allows bulk download (millions of data sets can be downloaded)

has open interfaces (APIs) for query requests

Bad DB:

allow only single requests (which have to be typed manually)

are not databases but just lists or tables

have no link-out and no link-in

allow no bulk download

are not curated

3

Exchange formats – SMBL, XML, BioPax

XML format – general purpose data format (CML for storing chemical data)

<?xml version="1.0" ?>

<molecule id="m1">

<atomArray>

<atom id="a1" elementType="C" x2="-3.0333333015441895" y2="2.9166667461395264" />

</atomArray>

<bondArray>

</bondArray>

</molecule>

Methane

BioPax format – used for representing pathway data (data exchange format)

SBML format – representing models of biochemical reaction networks

SDF format – general purpose chemical structure format (small molecules)

RDF format – format for storing chemical reactions (small molecules)

PDB format – general purpose chemical structure format (proteins)

4

SBML (Systems Biology Markup Language)

Source: Akira Funahashi – Cell Designer Tutorial

List of supported SBML programs (more than 200) from sbml.org

List of curated and published SBML models (around 200) from biomodels DB

5

APIs, Mashups, SQL

• Application programming interfaces (API) are important to connect and automate data exchange between local programs and databases;

Example: NCBI SOAP or PubChem PUG (Power User Interface) can be used to download certain data via the web to another service or to a local program

• Mashups and integration services use new web technology ( RDF , Yahoo Pipes ) to combine data sources and create new knowledge or enhance usage

• SQL used for programming databases

Large Database Table SQL query yr subject winner

1901 Chemistry Jacobus H. van 't Hoff

1902 Chemistry Emil Fischer

1903 Chemistry Svante Arrhenius

1904 Chemistry Sir William Ramsay

1905 Chemistry Adolf von Baeyer

1906 Chemistry Henri Moissan

1907 Chemistry Eduard Buchner

1908 Chemistry Ernest Rutherford

1909 Chemistry Wilhelm Ostwald

1910 Chemistry Otto Wallach

1913 …

SELECT yr, subject, winner

FROM nobel

WHERE yr = 1909 and subject = 'chemistry' yr

1909

Result subject winner

Chemistry Wilhelm Ostwald

Visit the SQL Zoo

6

Database front-ends (a good one)

Enhanced NCI Database Browser Release 2 (CACTVS DB)

• Small molecule DB with revolutionary web-front-end (2001)

• Multiple input an output (export) methods

• Allows matching of molecule lists against DB (as SMILES, CAS, NCI number)

• Links to other services

• Visualization modes (2D, 3D)

• 20 different molecular output formats (SDF, CML, SMILES)

• export to different other

(calculational) services

• 30 different query modes

7

Database visualization

Visualize complex networks; uses plug-in-technology from different sources

Map your own compound data (proteins, genes, molecules) onto networks

Perform literature search with enzymes, genes, small molecules

Source: Cytoscape.org

Start Cytoscape via JAVA webstart

8

Uber-portals ( NCBI ENTREZ )

9

Database and tools integration

Gaggle

Source: http://gaggle.systemsbiology.org/docs/geese/

Frameworks

• Portals

• Mashups

10

Gaggle

Integration of tools and database services

ListLink

The Gaggle: an open-source software system for integrating bioinformatics software and data sources.

Shannon PT, Reiss DJ, Bonneau R, Baliga NS.

BMC Bioinformatics. 2006 Mar 28;7:176.

Use Gaggle

11

Use or built your own local database

Example: LipidMaps DB with Instant-JChem

• Download the whole LipidMaps DB (10,000 lipids) as SDF file [ LINK ]

• Use Instant-JChem as data DB, molecule DB, reaction DB [ LINK ]

• Perform data and molecule queries on your laptop (PC, LINUX, MAC)

(…also works with KEGG/Biometa DB )

12

Welcome to the (database) jungle!

ChemBioGrid – collection of most chemistry databases current number ~ 156

Pathguide.org

– collection of pathway, enzyme, metabolite DBs current number ~ 231

Chemistry related (big players):

PubChem , CAS (subscription), Beilstein (subscription), Chemspider (fast growing)

Important for chemistry/metabolomics:

Spectral databases (NMR, mass spectral databases), compound property DBs

Pathway, Enzyme related:

KEGG, Brenda, Reactome, Expasy, MetaCyc

13

Pathguide.org

Pathguide is a meta-database:

Comprehensive collection of pathway, small molecule, enzyme, protein interaction databases

14

Enzyme and kinetics related databases

KDBI - Kinetic Data of Bio-molecular Interactions database http://bidd.nus.edu.sg/group/kdbi/

SABIO-RK - SABIO-Reaction Kinetics Database http://sabio.villa-bosch.de/SABIORK/

BRENDA - Comprehensive Enzyme Information System http://www.brenda.uni-koeln.de/

EMP - Enzymes and Metabolic Pathways Database http://www.empproject.com/

ENZYME - Enzyme nomenclature database (EXPASY) http://www.expasy.ch/enzyme/

IntEnz - Integrated relational Enzyme database http://www.ebi.ac.uk/intenz/index.html

TECR - Thermodynamics of Enzyme-Catalyzed Reaction http://xpdb.nist.gov/enzyme_thermodynamics/

REBASE - Restriction Enzyme Database http://rebase.neb.com/

Precise - Predicted and Consensus Interaction Sites in Enzymes http://precise.bu.edu/

Source: Pathguide; Own search

15

PubChem

Most important small molecule DB

There was no large open chemistry DB until 10 years ago (!)

All records can be downloaded via FTP

All other small molecule link to PubChem

PubChem Compounds

(true chemicals)

PubChem Substances

(formulations, mixtures)

substructure search and multiple other options

Goto PubChem

16

CAS SciFinder

33 million molecules and 60 million peptides/proteins

Largest reaction DB (14 million reactions) and literature DB

A must for chemist and biochemist/biologist

no bulk download, no good Import/ Export, no Linkouts

only proprietary Windows interface (no plugins)

no text mining (requires ANAVIST)

Download Scifinder

17

BRENDA - Comprehensive Enzyme Information System

18

Brenda 3D model output with JMOL

Example: Brenda connection to RSCB Protein Data bank

Visit Brenda

19

KEGG – Pathway DB

KEGG ID: C00002 (ATP)

KEGG pathway map ID: map00195 (Photosynthesis)

KEGG reaction ID: R05668 (ATP + NAD reaction)

Visit KEGG

20

Reactome – curated pathway maps

Example: Skypainter, map your given KEGG IDs to pathways

Visit Reactome

21

Outlook for the database lesson

Curation, Curation, Curation (costs money)

Inhale the good DB and bad DB scheme and apply when you enter a DB portal

Learn some basic database programming ( Ruby on Rails , JAVA, SQL ) using bioinformatics and chemoinformatics approaches is crucial for research

Learn how to import and store and handle database search results on your local computer (simple: parse important data with regular expressions)

Don’t be overwhelmed by the database jungle, take some time to play around;

Finally automation and clever use of DB tools will innovate your research

Multiple unique identifier problem (Kegg ID, PubChem ID, CAS number) and biological naming problem still exist

The systems biology and chemistry database world is still different in terms of re-use. Most of the chemistry data published (including molecules) is not machine readable, hence can’t be automatically harvested by software robots.

22

Reading List databases

The Gaggle: An open-source software system for integrating bioinformatics software and data sources

Correcting ligands, metabolites, and pathways

Large-Scale Annotation of Small-Molecule Libraries Using Public Databases

23

Homework for homework discussion III (30 min)

1) Find three bad or evil databases in the biochemistry/chemistry world please give a reason in a short sentence.

2) Find the year in which most papers about “enzyme kinetics” were published using

SciFinder (use Explore enter search term, then Analyze year)

3) Find the molecules which were analyzed most in papers regarding "enzyme kinetics" and "crickets“ using SciFinder (use Explore, then Analyze CAS Number)

4) Find the price for 1g ATP from Pfaltz & Bauer

(in SciFinder use locate substance then use the Erlenmeyer icon for price info)

5) Goto Brenda and find out how many coronavirus types are in the DB

(use TaxExplorer and query)

6) Goto Brenda and find out how many enzymes are listed as resistant against perchloric acid, report publication title (goto Brenda, Advanced search)

7) Goto KEGG Ligand DB find the KEGG Numbers for D-Hexose and ATP

8) Goto KEGG Reaction Prediction (e-zyme) : How many similar reactions occur between

D-Hexose and ATP? (Enter above KEGG IDs, press view structures; press compute)

9) Goto PubChem; What is the PubChem compound ID (CID) and the topological surface area for Tobias acid?

24

Pathways and enzymes http://www.biocarta.com/pathfiles/h_etcPathway.asp#

SQL learning http://sqlzoo.net/

Databases http://www.google.com/search?hl=en&q=enzyme+kinetics+database&btnG=Google+Search

SQL biologists

I’m a biologist Jim, not a programmer

SQL biologists

SciView part 5: interview with Alexei Drummond

Thank you!

Thanks to all Wikimedia.org contributors for pictures!

Thanks to the Dinesh Kumar (FiehnLab) for discussions.

25

Download