Overview of Genome Databases Peter D. Karp, Ph.D. SRI International

advertisement
Overview of Genome Databases
Peter D. Karp, Ph.D.
SRI International
pkarp@ai.sri.com
www-db.stanford.edu/dbseminar/seminar.html
Talk Overview
 Definition
of bioinformatics
 Motivations
 Issues
for genome databases
in building genome databases
Definition of Bioinformatics
 Computational
techniques for management and
analysis of biological data and knowledge
 Methods for disseminating, archiving, interpreting, and
mining scientific information
 Computational
 Genome
theories of biology
Databases is a subfield of bioinformatics
Motivations for Bioinformatics

Growth in molecular-biology knowledge
(literature)

Genomics
1.
Study of genomes through DNA sequencing
2.
Industrial Biology
Example Genomics Datatypes
 Genome
sequences
 DOE Joint Genome Institute


 Gene
511M bases in Dec 2001
11.97G bases since Mar 1999
and protein expression data
 Protein-protein
 Protein
interaction data
3-D structures
Genome Databases

Experimental data
 Archive experimental datasets
 Retrieving past experimental results should be faster than repeating the
experiment
 Capture alternative analyses
 Lots of data, simpler semantics

Computational symbolic theories
 Complex theories become too large to be grasped by a single mind
 The database is the theory
 Biology is very much concerned with qualitative relationships
 Less data, more complex semantics
Bioinformatics


Distinct intellectual field at the intersection of CS and
molecular biology
Distinct field because researchers in the field must know
CS, biology, and bioinformatics

Spectrum from CS research to biology service

Rich source of challenging CS problems

Large, noisy, complex data-sets and knowledge-sets

Biologists and funding agencies demand working solutions
Bioinformatics Research
 algorithms
+ data structures = programs
 algorithms
+ databases = discoveries
 Combine
sophisticated algorithms with the right
content:
 Properly structured
 Carefully curated
 Relevant data fields
 Proper amount of data
Reference on Major Genome
Databases
 Nucleic
Acids Research Database Issue
 http://nar.oupjournals.org/content/vol30/issue1/

112 databases
Questions to Ask of a New
Genome Database
What are Database Goals and
Requirements?
 What
 Who
problems will database be used to solve?
are the users and what is their expertise?
What is its Organizing Principle?
 Different
DBs partition the space of genome
information in different dimensions
 Experimental
 Organism
methods (Genbank, PDB)
(EcoCyc, Flybase)
What is its Level of Interpretation?
 Laboratory
data
 Primary
literature (Genbank)
 Review
(SwissProt, MetaCyc)
 Does
DB model disagreement?
What are its Semantics and Content?
 What
 How
entities and relationships does it model?
does its content overlap with similar DBs?
 How many entities of each type are present?
 Sparseness of attributes and statistics on
attribute values
What are Sources of its Data?
 Potential
information sources
 Laboratory instruments
 Scientific literature



Manual entry
Natural-language text mining
Direct submission from the scientific community

Genbank
 Modification
policy
 DB staff only
 Submission of new entries by scientific community
 Update access by scientific community
What DBMS is Employed?
 None
 Relational
 Object
oriented
 Frame
knowledge representation system
Distribution / User Access
 Multiple
distribution forms enhance access
 Browsing access with visualization tools
 API
 Portability
What Validation Approaches are
Employed?
 None
 Declarative
consistency constraints
 Programmatic
 Internal
 What
consistency checking
vs external consistency checking
types of systematic errors might DB
contain?
Database Documentation
 Schema
and its semantics
 Format
 API
 Data
acquisition techniques
 Validation techniques
 Size of different classes
 Coverage of subject matter
 Sparseness of attributes
 Error rates
 Update frequency
Relationship of Database Field to
Bioinformatics
 Scientists
generally unaware of basic DB
principles
 Complex queries vs click-at-a-time access
 Data model
 Defined semantics for DB fields
 Controlled vocabularies
 Regular syntax for flatfiles
 Automated consistency checking
 Most biologists take one programming class
 Evolution of typical genome database
 Finer points of DB research off their radar screen
 Handfull of DB researchers work in bioinformatics
Database Field
 For
many years, the majority of bioinformatics
DBs did not employ a DBMS
 Flatfiles were the rule
 Scientists want to see the data directly
 Commercial DBMSs too expensive, too complex
 DBAs too expensive
 Most
scientists do not understand
 Differences between BA, MS, PhD in CS
 CS research vs applications
 Implications for project planning, funding, bioinformatics
research
Recommendation
 Teaching
scientists programming is not enough
 Teaching scientists how to build a DBMS is
irrelevant
 Teach scientists basic aspects of databases and
symbolic computing
 Database requirements analysis
 Data models, schema design
 Knowledge representation, ontologies
 Formal grammars
 Complex queries
 Database interoperability
BioSPICE Bioinformatics
Database Warehouse
Peter Karp, Dave Stringer-Calvert, Tom Lee, Kemal
Sonmez
SRI International
http://www.BioSPICE.org/
Project Goal
 Create
a toolkit for constructing bioinformatics
database warehouses that collect together a set
of bioinformatics databases into one physical
DBMS
Motivations



Important bioinformatics problems require
access to multiple bioinformatics databases
Hundreds of bioinformatics databases exist
 Nucleic Acids Research 30(1) 2002 – DB issue
 Nucleic Acids Research DB list: 350 DBs at
http://www3.oup.co.uk/nar/database/a/
Different problems require different sets of
databases
Motivations
 Combining
multiple databases allows for data
verification and complementation
 Simulation
problems require access to data on
pathways, enzymes, reactions, genetic regulation
Why is the Multidatabase Approach
Not Sufficient?






Multidatabase query approaches assume
databases are in a DBMS
Internet bandwidth limits query throughput
Most sites that do operate DBMSs do not allow
remote SQL access because of security and
loading concerns
Control data stability
Need to capture, integrate and publish locally
produced data of different types
Multidatabase and Warehouse approaches
complementary
Scenario 1
 BioSPICE
scientist wants to model multiple
metabolic pathways in a given organism
 Enumerate pathways and reactions
 What enzymes catalyze each reaction?
 What genes code for each enzyme?
 What control regions regulate each gene?
Approach




Oracle and MySQL implementations
Warehouse schema defines many bioinformatics
datatypes
Create loaders for public bioinformatics DBs
 Parse file format for the DB
 Semantic transformations
 Insert database into warehouse tables
Warehouse query access mechanisms
 SQL queries via Perl, ODBC, OAA
Example: Swiss-Prot DB




Version 40.0 describes 101K proteins in a 320MB
file
Each protein described as one block of records
(an entry) in a large text file
Loader tool parses file one entry at a time
Creates new entries in a set of warehouse tables
Warehouse Schema


Manages many bioinformatics datatypes
simultaneously
 Pathways, Reactions, Chemicals
 Proteins, Genes, Replicons
 Citations, Organisms
 Links to external databases
Each type of warehouse object implemented
through one or more relational tables (currently
43)
Warehouse Schema

Databases on our wish list:
 Genbank (nucleotide sequences)
 Protein expression database
 Protein-protein interactions database
 Gene expression database
 NCBI Taxonomy database
 Gene Ontology
 CMR
Warehouse Schema






Manages multiple datasets simultaneously
 Dataset = Single version of a database
Support alternative measurements and
viewpoints
Version comparison
Multiple software tools or experiments that
require access to different versions
Each dataset is a warehouse entity
Every warehouse object is registered in a dataset
Warehouse Schema



Different databases storing the same biological
types are coerced into same warehouse tables
Design of most datatypes inspired by multiple
databases
Representational tricks to decrease schema
bloat
 Single space of primary keys
 Single set of satellite tables such as for synonyms, citations,
comments, etc.
Warehouse Schema
 Examples
Protein data from Swiss-Prot, TrEMBL, KEGG, and EcoCyc
all loaded into same relational tables
 Pathway data from MetaCyc and KEGG are loaded into the
same relational tables

Example: Swiss-Prot DB
ID
AC
DT
DT
DT
DE
DE
GN
1A11_CUCMA STANDARD;
PRT; 493 AA.
P23599;
01-NOV-1991 (Rel. 20, Created)
01-NOV-1991 (Rel. 20, Last sequence update)
15-DEC-1998 (Rel. 37, Last annotation update)
1-AMINOCYCLOPROPANE-1-CARBOXYLATE SYNTHASE CMW33 (EC 4.4.1.14) (ACC
SYNTHASE) (S-ADENOSYL-L-METHIONINE METHYLTHIOADENOSINE-LYASE).
ACS1 OR ACCW.
How Swiss-Prot is Loaded into
The Warehouse
 Register
Swiss-Prot in Datasets table
 Create entry in Entry and Protein tables for each
Swiss-Prot protein
 Satellite tables store
 Protein synonyms, citations, comments, accession numbers,
organism, sequence features, subunits/complexes, DB links
Protein Table
CREATE TABLE Protein
(
WID
Name
AASequence
Charge
Fragment
MolecularWeightCalc
MolecularWeightExp
PICalc
PIExp
DataSetWID
);
NUMBER
--The warehouse ID of this protein
VARCHAR2(500) --Common name of the protein
VARCHAR2(4000),--Amino-acid sequence for this prote
NUMBER,
--Charge of the chemical
CHAR(1),
--Is this protein a fragment or not,
NUMBER,
--Molecular weight calculated from s
NUMBER,
--Molecular Weight determined throug
VARCHAR2(50), --pI calculated from its sqeuence.
VARCHAR2(50), --pI value determined through experi
NUMBER
--Reference to the data set from whi
Database Loaders



Loader tool defined for each DB to be loaded into
Warehouse
Example loaders available in several languages
Loaders
 KEGG (C)
 BioCyc collection of 15 pathway DBs (C)
 Swiss-Prot (Java)
 ENZYME (Java)
Terminology
Organism Database (MOD) –
DB describing genome and other
information about an organism
Pathway/Genome Database
(PGDB) – MOD that combines
information about
 Pathways, reactions, substrates
 Enzymes, transporters
 Genes, replicons
 Transcription factors, promoters,
operons, DNA binding sites
Model
– Collection of 15 PGDBs
at BioCyc.org
 EcoCyc, AgroCyc, YeastCyc
BioCyc
Loader Architecture
Swiss-Prot
Datafile
Grammar for
Swiss-Prot
ANTLR
Parser
Generator
Parser for
SwissProt
SQL Insert
Commands
Oracle
Loadable
File
Current Warehouse Contents
KEGG
ENZYME
SwissProt
BsubCyc
Warehouse Total
Chemicals
7,284
2,952
0
576
10,812
Genes
5,714
0
88,605
4,221
98,540
60
0
103,807
1
103,868
Proteins
3,829
3,870
101,602
4,150
113,451
Enzymatic
Reactions
3,509
0
0
717
4,226
Pathways
4,517
0
0
138
4,655
Pathway
Reactions
36,271
0
0
530
36,801
Organisms
Example Warehouse Uses
 Check
completeness of data sources
Count reactions in ENZYME database with (and without)
associated protein sequences in SWISS-PROT database:
3870 reactions in ENZYME
1662 reactions (43%) with a sequence in SWISS-PROT
2208 reactions (57%) without a sequence in SWISS-PROT
Count #of distinct non-partial EC numbers in SWISS-PROT:
1554 distinct EC numbers in SWISS-PROT (non-partial)
Download