Pathway/Genome Databases and Software Tools Peter D. Karp, Ph.D. Bioinformatics Research Group

advertisement
Pathway/Genome Databases
and Software Tools
Peter D. Karp, Ph.D.
Bioinformatics Research Group
SRI International
pkarp@ai.sri.com
http://ecocyc.DoubleTwist.com/ecocyc/
SRI International
Bioinformatics
Overview
 Overview
of bioinformatics
 Motivations
for the EcoCyc project
 EcoCyc
demo
 Description of EcoCyc database and Pathway
Tools software
 Underlying
technologies
 Ocelot object database
 GKB Editor
 X-windows to WWW translator
Definition of Bioinformatics
 Computational
SRI International
Bioinformatics
techniques for management and
analysis of biological data and knowledge
 Methods for disseminating, archiving, interpreting, and
mining scientific information
SRI International
Bioinformatics
Motivations for Bioinformatics
 Growth
in molecular-biology knowledge
 Industrialization
 High-throughput
of biological experimentation
biology
 Genome sequences
 Gene and protein expression data
 Protein-protein interaction data
 Protein 3-D structures
 ….
SRI International
Bioinformatics
A
E
Motivations for EcoCyc -E. coli Encyclopedia
 Integrate
SRI International
Bioinformatics
E. coli information dispersed in the
literature
 New paradigm of scientific publishing
 Model
the full metabolic network of an organism
 Integrate
 Develop
 Provide
genomic data with functional data
algorithms for computing with function
a challenging domain for computerscience research
SRI International
Bioinformatics
Definitions
A
chemical reaction interconverts chemical
compounds
A+B=C+D
 An
enzyme is a protein that accelerates chemical
reactions
A
pathway is a linked set of reactions
A
A
C
E
conceptual unit of cell’s biochemical machine
Organism-Specific
Pathway/Genome Databases
 Layer
SRI International
Bioinformatics
functional information above the genome
 Rich
ontology to encode biological information
with high fidelity
 Chromosomes, genes, operons, gene products, reactions,
pathways
 Curated
by experts for that organism
 Integrate literature and computational predictions
Pathway Tools Software
 Pathway/Genome
SRI International
Bioinformatics
Navigator
 WWW publishing of PGDBs
 Graphic depictions of pathways, chromosomes, operons
 Pathway visualization of gene-expression data
 Pathway/Genome
Editors
 Distributed curation of genome annotations
 Distributed object database system
 Interactive editing tools
 PathoLogic

Prediction of metabolic network from genome
SRI International
EcoCyc = E.coli Dataset +
Bioinformatics
Pathway/Genome Navigator
Metabolic Network
Pathways: 158
Reactions: 1,117
Compounds: 1,887
Gene Products:
4,393
Genes: 4,393
Operons: 375
http://ecocyc.DoubleTwist.com/ecocyc/
EcoCyc
 Collaborative
development via internet
 Karp -- Bioinformatics architect
 Riley -- Metabolic pathways, signal transduction
 Saier and Paulsen -- Transport
 Collado -- Regulation of gene expression
 Ontology
of 1000 biological classes
 14,000 instances
 Over
2,600 registered users
SRI International
Bioinformatics
Pathway Tools Software
SRI International
Bioinformatics
Pathway/Genome
Navigator
PathoLogic
Pathway
Predictor
Pathway/
Genome
Databases
Pathway/
Genome
Editors
SRI International
Bioinformatics
Creation of the Overview Graph
 Run
layout algorithms on individual pathway
graphs
 Automatically determine topology of pathway graph
 Apply associated layout algorithm (linear, circular, tidy tree)
 Use
superpathways to create hierarchical layouts
 Treat each individual pathway as a single node
 Pathway connections are edges
 Run appropriate layout algorithm
 Manually
position the resulting pathway clusters
SRI International
Inference of Metabolic Pathways
Bioinformatics
Pathway/Genome Database
ANNOTATED GENOME
Structured ASCII Text File
MetaCyc
Metabolic Network
List of Gene Products
Pathway
List of Genes/ORFs
PathoLogic
Reactions
DNA Sequence
Gene Products
Genes
Reports
Genomic
Map
Compounds
Summary of H. pylori Analysis
SRI International
Bioinformatics
 For
121 E. coli pathways, what is the evidence
that each pathway occurs in H. pylori?
 Strong evidence: 41
 Medium evidence: 29
 Little or no evidence: 51
 31 reactions catalyzed by H. pylori but not by E. coli
 H.
pylori has partial abilities to synthesize
cofactors and amino-acids, extremely limited
carbohydrate catabolism, some amino acid
utilization, and a reductive citric-acid
pathway
Microbial Pathway/
Genome DBs
SRI International
Bioinformatics
Literature-based Datasets:
PathoLogic-based Datasets:
MetaCyc
Bacillus
Escherichia
coli
subtilis
Mycobacterium tuberculosis
Helicobacter pylori
Haemophilus influenzae
Mycoplasma pneumonia
Treponema pallidum
Chlamydia trachomatis
Saccharomyces
cerevisiae
SRI International
Bioinformatics
Pathway Tools Software Architecture
 Implemented
in Common Lisp
 WWW
server runs as a single Unix process with a
separate thread to service each query
 Grasper-CL
 Ocelot
graph manager
object database
 GKB Editor schema-driven editor
EcoCyc WWW Server
SRI International
Bioinformatics
Pathway Tools Architecture -Development Configuration
WWW
Server
Pathway
Genome
Navigator
GFP API
Ocelot DBMS
SRI International
Bioinformatics
X-Windows
Graphics
Object Editor
Pathway Editor
Reaction Editor
Oracle
Ocelot Database System
SRI International
Bioinformatics
 Object
Database Manager
 Persistence via filesystem or relational DBMS
 Demand
and background faulting of objects from
RDBMS
 Two-level object caching
 Extensive bioinformatics schema
 Stored
transaction history
 Inspect object history
Ocelot Knowledge Server
Architecture
 Frame
SRI International
Bioinformatics
data model
 Persistent
storage via
Disk files
 Oracle DBMS

 Optimistic
concurrency-control protocol
 Schema
evolution
 Logging
facility
The Frame Data Model
 Frames
SRI International
Bioinformatics
are of two types: classes, instances
 Frames
have slots that define their properties,
attributes, relationships
A
slot has one or more values
 Each
value can be any Lisp datatype
 Slotunits
define metadata about slots:
 Domain, range, inverse
 Collection type, number of values, value constraints
Inference Capabilities
 Inheritance
 Slot
SRI International
Bioinformatics
of defaults
values computed via attached procedures
 Maintenance
 Constraint
of inverse relationships
system
 Deferred evaluation
 Tolerant of nonconformant data
Storage System Architecture
 Oracle
SRI International
Bioinformatics
KBs
 DBMS
is submerged within FRS
 Relational schema is domain independent,
supports multiple KBs simultaneously
 Frames
transferred from DBMS to Ocelot
 On demand
 By background prefetcher
 Memory cache
 Persistent disk cache to speed performance via Internet
Frame Faulting
SRI International
Bioinformatics
(get-slot-value gene ‘map-position)
 Gene
present in in-memory object cache?
 Gene present in cache on local disk?
 Query Oracle DBMS
Logging
 Oracle
DBMS stores:
 The latest version of each frame
 A history of all OKBC operations applied to KB
 Reconstruct
earlier versions of KB
 View history of changes to an object
 Update replicates
 Concurrency control
SRI International
Bioinformatics
Schema Management
SRI International
Bioinformatics
 FRSs
store and process class and instance
information similarly
 Applications
can query schema information as
easily as they can query instances
GKB Editor
 Browser
 Four
SRI International
Bioinformatics
and editor for KBs and ontologies
editing tools
 GKB
Editor reusable with multiple FRSs
 All database queries via OKBC/GFP API
 Interoperability achieved with Ocelot, LOOM, Ontolingua
 All
operations are schema driven
 http://www.ai.sri.com/~gkb/overview.html
SRI International
Bioinformatics
Editors
 Taxonomy
 Frame
editor
editor
 Relationships
 Spreadsheet
editor
editor
Results
 Ocelot
SRI International
Bioinformatics
in use in the EcoCyc project for 5 years
 Supports
collaborative development of EcoCyc by
four groups in North America
 Distributed architecture
 GKB Editor in active use
 Supports
development of 8 Pathway/Genome
Databases
SRI International
Bioinformatics
Summary
 Pathway/Genome
 Pathway
Databases
Tools software
 Extract pathways from genomes
 Distributed curation tools
 Query, visualization, WWW publishing
 Analysis algorithms
Computer Science Results
SRI International
Bioinformatics
 Extend
scalability and multiuser access for
knowledge representation systems
 Reusable,
schema-driven KB editor
 Hierarchical
graph layout algorithms
 Dynamic
translation from X-windows to
HTML+GIF
 Importance
of ontologies and of content:
 Discovery
= Algorithm + Database
Problem Solving Depends on
Algorithms and Content
Compute
Time
Algorithm Quality
Solution
Quality
Database Size and Quality
SRI International
Bioinformatics
Bioinformatics Results:
Content
SRI International
Bioinformatics
 The
EcoCyc database describes the full metabolic
map of an organism
 The
MetaCyc database describes over 300
metabolic pathways
 Ontology
spans genome to pathway information
Bioinformatics Results:
Algorithms
SRI International
Bioinformatics
 Software
environment for genome and pathway
information
 Query and visualization
 Distributed database development
 PathoLogic
algorithm predicts the metabolic
network of an organism from its genome
 Algorithms
under development for qualitative
modeling of the cell
Acknowledgements
SRI International
Bioinformatics
 Funding
sources:
 NIH National Center for Research Resources
 Collaborators:
Monica Riley, Marine Biological Laboratory
 Milton Saier, UC San Diego
 Julio Collado, UNAM
 Christos Ouzounis, European Bioinformatics Institute

Peter D. Karp, Ph.D.
http://www.ai.sri.com/pkarp/
Download