Pathways

advertisement
Computational Exploration
of Metabolic Networks with
Pathway Tools
Part 1: Overview &
Representations
Suzanne Paley
Bioinformatics Research Group
SRI International
paley@ai.sri.com
http://BioCyc.org/
Motivation: Theories of CellularSRI International
Bioinformatics
Function Too Large for One Mind to
Grasp




Example: E. coli metabolic network
 160 pathways involving 744 reactions and 791 substrates
Example: E. coli genetic network
 Control by 97 transcription factors of 1174 genes in 630 transcription units
Past solutions:
 Partition theories across multiple minds
 Encode theories in natural-language text
We cannot compute with theories in those forms
 Evaluate theories for consistency with new data: microarrays
 Refine theories with respect to new data
 Compare theories describing different organisms
Solution:
Biological Knowledge Bases
SRI International
Bioinformatics
 Store
biological knowledge and theories in
computers in a declarative form
 Amenable to computational analysis and generative user
interfaces
 Establish
ongoing efforts to curate (maintain,
refine, embellish) these knowledge bases
A
high quality comprehensive knowledge base
enables us to ask and answer important new
questions
Terminology
Organism Database (MOD) –
DB describing genome and other
information about an organism
Model
Pathway/Genome
Database
(PGDB) – MOD that combines
information about
 Pathways, reactions, substrates
 Enzymes, transporters
 Genes, replicons
 Transcription factors, promoters,
operons, DNA binding sites
– Collection of 15 PGDBs
at BioCyc.org
 EcoCyc, AgroCyc, HumanCyc
BioCyc
SRI International
Bioinformatics
Pathway Tools Software
SRI International
Bioinformatics

PathoLogic
 Prediction of metabolic network from genome
 Computational creation of new Pathway/Genome Databases

Pathway/Genome Editors
 Distributed curation of genome annotations
 Distributed object database system
 Interactive editing tools

Pathway/Genome Navigator
 WWW publishing of PGDBs
 Graphic depictions of pathways, chromosomes, operons
 Analysis operations


Pathway visualization of gene-expression data
Global comparisons of metabolic networks
Pathway Tools Software
SRI International
Bioinformatics
Pathway/Genome
Navigator
PathoLogic
Pathway
Predictor
Pathway/
Genome
Databases
Pathway/
Genome
Editors
Pathway/Genome Database
Pathways
Reactions
Compounds
Proteins
Genes
Operons,
Promoters,
DNA Binding Sites
Chromosomes,
Plasmids
CELL
SRI International
Bioinformatics
Pathway Tools Algorithms
Visualization
and editing tools for
following datatypes
Full
Metabolic Map
 Paint gene expression data on metabolic
network; compare metabolic networks
Pathways
 Pathway prediction
Reactions
 Balance checker
Compounds
 Chemical substructure comparison
Enzymes, Transporters, Transcription
Factors
Genes
Chromosomes
Operons
 Operon prediction; visualize genetic network
SRI International
Bioinformatics
SRI International
Bioinformatics
Definitions

Chemical reactions interconvert chemical compounds
A+B
C+D

An enzyme is a protein that accelerates chemical reactions

A pathway is a linked set of reactions
 Often regulated as a unit
A

A conceptual unit of cell’s biochemical machine
C
E
SRI International
Bioinformatics
SRI International
Bioinformatics
SRI International
Bioinformatics
SRI International
Bioinformatics
SRI International
Bioinformatics
SRI International
Bioinformatics
SRI International
Bioinformatics
SRI International
Bioinformatics
SRI International
Bioinformatics
SRI International
Bioinformatics
Operations of the
Metabolic Overview
 Find
SRI International
Bioinformatics
pathways, compounds
 Find
reactions
 By enzyme name, EC number, substrates, modulation
 All with isozymes
 All occurring in multiple pathways
 By EC class, pathway class
 Find
genes
 By name, gene class
 All regulated by transcriptional regulator protein
Metabolic Overview Queries
SRI International
Bioinformatics
 Species
comparison
 Highlight reactions that are



Shared/not-shared with
Any-one/All-of
A specified set of species
 Overlay
expression data
 Colors reflects expression level and are user-configurable
 Can show single experiment or animated time series
EcoCyc Project

E. coli Encyclopedia
 Model-Organism Database for E. coli
 Began in 1992 as collaboration between Karp and Riley
 Over 3500 literature citations

Collaborative development via Internet
 Karp (SRI) -- Bioinformatics architect
 John Ingraham -- Advisor
 (SRI) Metabolic pathways
 Saier (UCSD) and Paulsen (TIGR)-- Transport
 Collado (UNAM)-- Regulation of gene expression

Ontology: 1000 biological classes
Database content: 17,700 instances

SRI International
Bioinformatics
SRI International
EcoCyc = E.coli Dataset +
Bioinformatics
Pathway/Genome Navigator
Pathways: 165
Reactions: 2,760
Enzymes: 914
Transporters: 162
Proteins: 4,273
Promoters: 812
TransFac Sites: 956
Citations: 3,508
Compounds: 774
Genes: 4,393
Transcription
Units: 724
Factors: 110
http://BioCyc.org/
SRI International
Bioinformatics
MetaCyc: Metabolic Encyclopedia







Nonredundant metabolic pathway database
Describe a representative sample of every experimentally
determined metabolic pathway
Literature-based DB with extensive references and
commentary
Pathways, reactions, enzymes, substrates
460 pathways, 1267 enzymes, 4294 reactions
 172 E. coli pathways, 2735 citations
Nucleic Acids Research 30:59-61 2002.
Jointly developed by SRI and Carnegie Institution
 New focus on plant pathways
MetaCyc Data
 MetaCyc
SRI International
Bioinformatics
contains one DB object for each distinct
pathway
 Distinct in terms of reaction steps
 Each pathway labeled with species it occurs in
 MetaCyc
 4218
pathways are experimentally determined
reactions in MetaCyc
 401 lack EC numbers
MetaCyc Enzyme Data
 Reaction(s)
catalyzed
 Alternative substrates
 Cofactors / prosthetic groups
 Activators and inhibitors
 Subunit structure
 Molecular weight, pI
 Comment, literature citations
 Species
SRI International
Bioinformatics
MetaCyc Frequent Organisms
Escherichia coli
156
Arabidopsis thaliana
47
Homo sapiens
30
Pseudomonas
21
Bacillus subtilis
20
Salmonella typhimurium
20
Sulfolobus solfataricus
18
Pseudomonas putida
14
Saccharomyces cerevisiae
14
Haemophilus influenzae
13
Glycine max
11
Deinococcus radiourans
10
SRI International
Bioinformatics
EcoCyc and MetaCyc
 Review
SRI International
Bioinformatics
level databases
 Data derived primarily from biomedical literature
 Manual entry by staff curators
 Updates by staff curators only
 Data validation
 Consistency constraints
 Lisp programs that verify other semantic relationships

Unbalanced chemical reactions
SRI International
Bioinformatics
Computationally-Derived PGDBs
Annotated Genomic
Sequence
Pathway/Genome
Database
Gene Products
Pathways
Genes/ORFs
DNA Sequences
Multi-organism Pathway
Database (MetaCyc)
Pathways
Reactions
PathoLogic
Software
Integrates genome and
pathway data to identify
putative metabolic
networks
Compounds
Gene Products
Genes
Reactions
Genomic Map
Compounds
SRI International
Bioinformatics
PathoLogic Input/Output

Inputs:
 File listing genetic elements





http://bioinformatics.ai.sri.com/ptools/genetic-elements.dat
Files containing DNA sequence for each genetic element
Files containing annotation for each genetic element
MetaCyc database
Output:
 Pathway/genome database for the subject organism
 Directory tree for the subject organism
 Reports that summarize:


Evidence contained in the input genome for the presence of reference
pathways
Reactions missing from inferred pathways
PathoLogic Functionality
 Initialize
SRI International
Bioinformatics
schema for new PGDB
 Transform existing genome to PGDB form
 Infer metabolic pathways and store in PGDB
 Infer operons and store in PGDB
 Assist user with manual tasks
 Assign enzymes to reactions they catalyze
 Identify false-positive pathway predictions
 Build protein complexes from monomers
 Assemble Overview diagram
SRI International
Bioinformatics
BioCyc Collection of
Pathway/Genome DBs
Literature-based Datasets:
Computationally-derived datasets:
Escherichia
Agrobacterium
coli (EcoCyc)
MetaCyc
PGDBs at other sites:
Arabidopsis
thaliana (TAIR)
Methanococcus jannaschii (EBI)
Saccharomyces cerevisiae (SGD)
Synechocystis PCC6803
http://BioCyc.org/
tumefaciens
Caulobacter crescentus
Chlamydia trachomatis
Bacillus subtilis
Helicobacter pylori
Haemophilus influenzae
Homo sapiens
Mycobacterium tuberculosis RvH37
Mycobacterium tuberculosis CDC1551
Mycoplasma pneumonia
Pseudomonas aeruginosa
Treponema pallidum
Vibrio cholerae
Yellow
= Open Database
SRI International
Bioinformatics
HumanCyc: Human Metabolic Pathway
Database









PGDB of human metabolic pathways built using PathoLogic
Contains information on 28,700 genes, their products, and
the metabolic reactions and pathways they catalyze (no
signalling pathways)
Chromosome and contigs from Ensembl
Human genetic loci from LocusLink
Mitochondrion data from GenBank
Ensembl and LocusLink gene entries were merged to
eliminate redundancies where possible.
Contains links to human genome web sites
Plan to hire one curator to refine and curate with respect to
literature over a 2 year period
 Remove false-positive predictions
 Insert known pathways missed by PathoLogic
 Add comments and citations from pathways and enzymes to the literature
 Add enzyme activators, inhibitors, cofactors, tissue information
Funded by commercial consortium
BioCyc and Pathway Tools
Availability
 WWW
SRI International
Bioinformatics
BioCyc freely available to all
 BioCyc.org
 Six
BioCyc DBs openly available to all
 BioCyc
DBs freely available to non-profits
 Flatfiles downloadable from BioCyc.org
 Binary executable:



Sun UltraSparc-170 w/ 64MB memory
PC, 400MHz CPU, 64MB memory, Windows-98 or newer
PerlCyc API
 Pathway
Tools freely available to non-profits
Information Sources

Pathway Tools User’s Guide
 aic-export/ecocyc/genopath/released/doc/userguide1.pdf



Pathway/Genome Navigator
Appendix A: Guide to the Pathway Tools Schema
aic-export/ecocyc/genopath/released/doc/userguide2.pdf

PathoLogic, Editing Tools

Pathway Tools Web Site
 http://bioinformatics.ai.sri.com/ptools/
 Publications, programming examples, etc.

Pathway Tools Tutorial
 http://bioinformatics.ai.sri.com/ptools/tutorial/
SRI International
Bioinformatics
SRI International
Bioinformatics
Pathway Tools Implementation Details
 Allegro
Common Lisp
 Sun and PC platforms
 Ocelot
object database
 250,000
lines of code
 Lisp-based
WWW server at BioCyc.org
 Manages 15 PGDBs
Frame Data Model
 Frame
Data Model -- organizational structure for a
PGDB
 Knowledge
 Frames
 Slots
SRI International
Bioinformatics
base (KB, Database, DB)
Knowledge Base
 Collection
SRI International
Bioinformatics
of frames and their associated slots,
values, facets, and annotations
 AKA: Database, PGDB
 Can
be stored within
 An Oracle DB
 A disk file
 A Pathway Tools binary program
Frames
SRI International
Bioinformatics

Entities with which facts are associated

Kinds of frames:
 Classes: Genes, Pathways, Biosynthetic Pathways
 Instances (objects): trpA, TCA cycle

Classes:
 Superclass(es)
 Subclass(es)
 Instance(s)

A symbolic frame name (id, key) uniquely identifies each
frame
Slots
SRI International
Bioinformatics
 Encode
attributes/properties of a frame
 Integer, real number, string
 Represent
relationships between frames
 The value of a slot is the identifier of another frame
 Every
slot is described by a “slot frame” in a KB
that defines meta information about that slot
Properties of Slots
SRI International
Bioinformatics
 Number
of values
 Single valued
 Multivalued: sets, bags
 Slot
values
 Any LISP object: Integer, real, string, symbol (frame name)
 Slotunits
define properties of slots: datatypes,
classes, constraints
 Two
slots are inverses if they encode opposite
relationships
 Slot Product in class Genes
 Slot Gene in class Polypeptides
Pathway Tools Ontology

SRI International
Bioinformatics
1064 classes
 Main classes such as:


Pathways, Reactions, Compounds, Macromolecules, Proteins, Replicons,
DNA-Segments (Genes, Operons, Promoters)
Taxonomies for Pathways, Reactions, Compounds

205 slots
 Meta-data: Creator, Creation-Date
 Comment, Citations, Common-Name, Synonyms
 Attributes: Molecular-Weight, DNA-Footprint-Size
 Relationships: Catalyzes, Component-Of, Product

Classes, instances, slots all stored side by side in DBMS,
share a single namespace
SRI International
Bioinformatics
Slot Links from Gene to Pathway
Frame
TCA Cycle
left
succinate
in-pathway
FAD
succinate + FAD = fumarate + FADH2
reaction
fumarate
right
FADH2
Enzymatic-reaction
catalyzes
Succinate dehydrogenase
component-of
Sdh-flavo
Sdh-Fe-S
Sdh-membrane-1
Sdh-membrane-2
product
Chrom
sdhA
sdhB
sdhC
sdhD
Enzymatic-reaction frame stores
SRI International
Bioinformatics
properties of pairing between enzyme
and reaction
TCA Cycle
EC#
Keq
Succinate + FAD = fumarate + FADH2
Enzymatic-reaction
Succinate dehydrogenase
Cofactors
Inhibitors
Molecular wt
pI
Sdh-flavo
Sdh-Fe-S
Sdh-membrane-1
Sdh-membrane-2
sdhA
sdhB
sdhC
sdhD
Left-end-position
Monofunctional Monomer
Pathway
Reaction
Enzymatic-reaction
Monomer
Gene
SRI International
Bioinformatics
SRI International
Bioinformatics
Bifunctional Monomer
Pathway
Reaction
Reaction
Enzymatic-reaction
Enzymatic-reaction
Monomer
Gene
Monofunctional Multimer
SRI International
Bioinformatics
Pathway
Reaction
Enzymatic-reaction
Multimer
Monomer
Monomer
Monomer
Monomer
Gene
Gene
Gene
Gene
Pathway and Substrates
Reactant-1
Pathway
left
in-pathway
Reactant-2
Reaction
Product-1
Product-2
SRI International
Bioinformatics
right
Reaction
Reaction
Reaction
SRI International
Bioinformatics
Genetic Network Representation
 Describe
biological entities involved in control of
transcription initiation
 Promoters, operators, transcription factors, operons,
terminators
 Describe
molecular interactions among these
entities
 Modulation of transcription factor activity
 Binding of transcription factors to DNA binding sites
 Effects on transcription initiation
Ontology for
Transcriptional Regulation
SRI International
Bioinformatics
 One
DB object defined for each biological entity
and for each molecular interaction
trp
Complexation reaction
apoTrpR
trpLEDCBA
site001
Int001
pro001
Int002
TrpR*trp
RpoSig70
trpL
trpE
trpD
trpC
trpB
trpA
Int001 (binding of TrpR*trp to
site001) inhibits Int002 (binding of
RNA Polymerase to promoter) and
consequently prevents transcription
of genes in transcription unit.
Principle Classes
 Class
names are capitalized, plural
 Genetic-Elements,
with subclasses:
Chromosomes
 Plasmids
 Genes
 Transcription-Units
 RNAs
 Proteins, with subclasses:
 Polypeptides
 Protein-Complexes

SRI International
Bioinformatics
Principle Classes
 Reactions,
with subclasses:
 Transport-Reactions
 Enzymatic-Reactions
 Pathways
 Compounds-And-Elements
SRI International
Bioinformatics
Slots in Multiple Classes
SRI International
Bioinformatics
 Common-Name
 Synonyms
 Names
(computed as union of Common-Name,
Synonyms)
 Comment
 Citations
 DB-Links
Genes Slots
 Chromosome
 Left-End-Position
 Right-End-Position
 Centisome-Position
 Transcription-Direction
 Product
SRI International
Bioinformatics
Proteins Slots
 Molecular-Weight-Seq
 Molecular-Weight-Exp
 pI
 Locations
 Modified-Form
 Unmodified-Form
 Component-Of
SRI International
Bioinformatics
Polypeptides Slots
 Gene
SRI International
Bioinformatics
Protein-Complexes Slots
 Components
SRI International
Bioinformatics
Reactions Slots
SRI International
Bioinformatics
 EC-Number
 Left,
Right
 Substrates (computed as union of Left, Right)
 Enzymatic-Reaction
 DeltaG0
 Spontaneous?
Enzymatic-Reactions Slots
 Enzyme
 Reaction
 Activators
 Inhibitors
 Physiologically-Relevant
 Cofactors
 Prosthetic-Groups
 Alternative-Substrates
 Alternative-Cofactors
 Reaction-direction
SRI International
Bioinformatics
Pathways Slots
 Reaction-List
 Predecessors
 Primaries
SRI International
Bioinformatics
Download