Contents of this Talk

advertisement
Contents of this Talk
 [Used
as intro to Genome Databases Seminar,
2002]
 Overview
of bioinformatics
 Motivations for genome databases
 Analogy of virus reverse-eng to genome analysis
 Questions to ask of a genome DB
Overview of Genome Databases
Peter D. Karp, Ph.D.
SRI International
pkarp@ai.sri.com
www-db.stanford.edu/dbseminar/seminar.html
Talk Overview
 Definition
of bioinformatics
 Motivations
 Computer
 Issues
for genome databases
virus analogy
in building genome databases
Definition of Bioinformatics
 Computational
techniques for management and
analysis of biological data and knowledge
 Methods for disseminating, archiving, interpreting, and
mining scientific information
 Computational
 Genome
theories of biology
Databases is a subfield of bioinformatics
Motivations for Bioinformatics

Growth in molecular-biology knowledge
(literature)

Genomics
1.
Study of genomes through DNA sequencing
2.
Industrial Biology
Example Genomics Datatypes
 Genome
sequences
 DOE Joint Genome Institute


 Gene
511M bases in Dec 2001
11.97G bases since Mar 1999
and protein expression data
 Protein-protein
 Protein
interaction data
3-D structures
Genome Databases

Experimental data
 Archive experimental datasets
 Retrieving past experimental results should be faster than repeating the
experiment
 Capture alternative analyses
 Lots of data, simpler semantics

Computational symbolic theories
 Complex theories become too large to be grasped by a single mind
 The database is the theory
 Biology is very much concerned with qualitative relationships
 Less data, more complex semantics
Bioinformatics


Distinct intellectual field at the intersection of CS and
molecular biology
Distinct field because researchers in the field should know
CS, biology, and bioinformatics

Spectrum from CS research to biology service

Rich source of challenging CS problems

Large, noisy, complex data-sets and knowledge-sets

Biologists and funding agencies demand working solutions
Bioinformatics Research
 algorithms
+ data structures = programs
 algorithms
+ databases = discoveries
 Combine
sophisticated algorithms with the right
content:
 Properly structured
 Carefully curated
 Relevant data fields
 Proper amount of data
Goals of Systems Biology
 Catalog
the molecular parts lists of cells
 Understand
the function(s) of each part
 Understand
how those parts interact to produce
the behavior of a cell or organism
 Understand
parts
the evolution of those molecular
Analogy: Genome Analysis and
Virus Analysis

Given: Virus binary executable file for known machine
architecture

Reverse engineer the program
 Procedures
 Call graph
 Specifications for I/O behavior of the program and all procedures

Capture and publish an annotated analysis of the virus

Comparative analysis of related viruses
Genome Analysis
 Example:
 Given:
M. tuberculosis genome
4.4Mbp of DNA (genome)
 Infer:
Molecular parts list of Mtb
 A model of the biochemical machinery of Mtb cell

 DNA
is a blueprint for the program of life
Start
4.4Mbyte binary program
4.4Mbp DNA sequence
Step 1
Distinguish code from data segments
Find procedure boundaries
Distinguish coding from non-coding regions –
Gene Finding
Step 2
Predict semantics of procedures
A
B
Predict gene functions
C
D
Step 3
Predict procedure call graph
A
B
D
C
A
C
B
A
B
D
C
Predict biochemical and gene networks
D
Step 4
Predict conditions under which procedures are invoked
A
B
D
Q
R
S
C
Predict expression of network fragments
Step 5
Infer complete program specification
Formulate dynamic cellular simulation
Step 6
Internet publishing of structured program
annotation with explanations, references,
commentary
Internet publishing of structured genome
annotation with explanations, references,
commentary
Step 7
Comparative analysis of viruses
Evolutionary relationships among viruses
Comparative analysis of genomes
Evolutionary relationships among genomes
Step 8
Identify measures to disable virus or prevent its spread
A
B
D
Q
R
S
C
Identify target proteins for anti-microbial drug discovery
Database of Viruses
 Create
a database that stores
 Binaries for all viruses
 All annotation of virus programs by different investigators
 Comparative analyses
 Support
Remote API access
 Click-at-a-time browsing

Reference on Major Genome
Databases
 Nucleic
Acids Research Database Issue
 http://nar.oupjournals.org/content/vol30/issue1/

112 databases
Questions to Ask of a New
Genome Database
What are Database Goals and
Requirements?
 How
many users?
 What expertise do users have?
 What
problems will database be used to solve?
What is its Organizing Principle?
 Different
DBs partition the space of genome
information in different dimensions
 Experimental
 Organism
methods (Genbank, PDB)
(EcoCyc, Flybase)
What is its Level of Interpretation?
 Laboratory
data
 Primary
literature (Genbank)
 Review
(SwissProt, MetaCyc)
 Does
DB model disagreement?
What are its Semantics and Content?
 What
 How
entities and relationships does it model?
does its content overlap with similar DBs?
 How many entities of each type are present?
 Sparseness of attributes and statistics on
attribute values
What are Sources of its Data?
 Potential
information sources
 Laboratory instruments
 Scientific literature



Manual entry
Natural-language text mining
Direct submission from the scientific community

Genbank
 Modification
policy
 DB staff only
 Submission of new entries by scientific community
 Update access by scientific community
What DBMS is Employed?
 None
 Relational
 Object
oriented
 Frame
knowledge representation system
Distribution / User Access
 Multiple
distribution forms enhance access
 Browsing access with visualization tools
 API
 Portability
What Validation Approaches are
Employed?
 None
 Declarative
consistency constraints
 Programmatic
 Internal
 What
consistency checking
vs external consistency checking
types of systematic errors might DB
contain?
Database Documentation
 Schema
and its semantics
 Format
 API
 Data
acquisition techniques
 Validation techniques
 Size of different classes
 Coverage of subject matter
 Sparseness of attributes
 Error rates
Relationship of Database Field to
Bioinformatics
 Scientists
generally ignorant of basic DB
principles
 Complex queries vs click-at-a-time access
 Data model
 Defined semantics for DB fields
 Controlled vocabularies
 Regular syntax for flatfiles
 Automated consistency checking
 Most biologists take one programming class
 Evolution of typical genome database
 Finer points of DB research off their radar screen
 Handfull of DB researchers work in bioinformatics
Database Field
 For
many years, the majority of bioinformatics
DBs did not employ a DBMS
 Flatfiles were the rule
 Scientists want to see the data directly
 Commercial DBMSs too expensive, too complex
 DBAs too expensive
 Most
scientists do not understand
 Differences between BA, MS, PhD in CS
 CS research vs applications
 Implications for project planning, funding, bioinformatics
research
Recommendation
 Teaching
scientists programming is not enough
 Teaching scientists how to build a DBMS is
irrelevant
 Teach scientists basic aspects of databases and
symbolic computing
 Database requirements analysis
 Data models, schema design
 Knowledge representation, ontologies
 Formal grammars
 Complex queries
 Database interoperability
Download