PathoLogic - Bioinformatics Research Group at SRI International

advertisement
PathoLogic Pathway Predictor
SRI International
Bioinformatics
Inference of Metabolic Pathways
Annotated Genomic
Sequence
Pathway/Genome
Database
Gene Products
Pathways
Genes/ORFs
DNA Sequences
Multi-organism Pathway
Database (MetaCyc)
Pathways
Reactions
PathoLogic
Software
Integrates genome and
pathway data to identify
putative metabolic
networks
Compounds
Gene Products
Genes
Reactions
Genomic Map
Compounds
PathoLogic Functionality
 Initialize
SRI International
Bioinformatics
schema for new PGDB
 Transform existing genome to PGDB form
 Infer metabolic pathways and store in PGDB
 Infer operons and store in PGDB
 Assemble Overview diagram
 Assist user with manual tasks
 Assign enzymes to reactions they catalyze
 Identify false-positive pathway predictions
 Build protein complexes from monomers
 Infer transport reactions
SRI International
Bioinformatics
PathoLogic Input/Output

Inputs:
 File listing genetic elements





http://bioinformatics.ai.sri.com/ptools/genetic-elements.dat
Files containing DNA sequence for each genetic element
Files containing annotation for each genetic element
MetaCyc database
Output:
 Pathway/genome database for the subject organism
 Reports that summarize:


Evidence contained in the input genome for the presence of reference
pathways
Reactions missing from inferred pathways
PathoLogic Analysis Phases




SRI International
Bioinformatics
Trial parsing of input data files [few days]
Initialize schema of new PGDB [3 min]
Create DB objects for replicons, genes, proteins [5 min]
Assign enzymes to reactions they catalyze
 ferrochelatase
[10 min / 1 week]
 glutamate 1-semialdehyde 2,1-aminomutase
 porphobilinogen deaminase
A
E1
B
E2
C
D
E
G
F
PathoLogic Analysis Phases
SRI International
Bioinformatics
 From
assigned reactions, infer what pathways are
present
[5 min / few days]
 Define
metabolic overview diagram
 Define
protein complexes
[30 min]
[few days]
genetic-elements.dat
ID
TEST-CHROM-1
NAME Chromosome 1
TYPE :CHRSM
CIRCULAR?
N
ANNOT-FILE
chrom1.pf
SEQ-FILE
chrom1.fsa
//
ID
TEST-CHROM-2
NAME Chromosome 2
CIRCULAR?
N
ANNOT-FILE
/mydata/chrom2.gbk
SEQ-FILE
/mydata/chrom2.fna
//
SRI International
Bioinformatics
SRI International
Bioinformatics
File Naming Conventions
 One
pair of sequence and annotation files for
each genetic element
 Sequence
files: FASTA format
 suffix fsa or fna
 Annotation
file:
 Genbank format: suffix .gbk
 PathoLogic format: suffix .pf
SRI International
Bioinformatics
Typical Problems Using Genbank
Files With PathoLogic
 Wrong
qualifier names used: read PathoLogic
documentation!
 Extraneous
 Check
information in a given qualifier
results of trial parse carefully
GenBank File Format

Accepted feature types:
 CDS, tRNA, rRNA, misc_RNA

Accepted qualifiers:
 /locus_tag
 /gene
 /product
 /EC_number
 /product_comment
 /gene_comment
 /alt_name
 /pseudo

SRI International
Bioinformatics
Unique ID [recm]
Gene name [req]
[req]
[recm]
[opt]
[opt]
Synonyms [opt]
Gene is a pseudogene [opt]
For multifunctional proteins, put each function in a separate
/product line
PathoLogic File Format



SRI International
Bioinformatics
Each record starts with line containing an ID attribute
Tab delimited
Each record ends with a line containing //

One attribute-value pair is allowed per line
 Use multiple FUNCTION lines for multifunctional proteins

Lines starting with ‘;’ are comment lines

Valid attributes are:
 ID, NAME, SYNONYM
 STARTBASE, ENDBASE, GENE-COMMENT
 FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT
 DBLINK
 INTRON
PathoLogic File Format
SRI International
Bioinformatics
ID
TP0734
NAME
deoD
STARTBASE
799084
ENDBASE
799785
FUNCTION
purine nucleoside phosphorylase
DBLINK
PID:g3323039
PRODUCT-TYPE
P
GENE-COMMENT
similar to GP:1638807 percent identity:
57.51; identified by sequence similarity; putative
//
ID
TP0735
NAME
gltA
STARTBASE
799867
ENDBASE
801423
FUNCTION
glutamate synthase
DBLINK
PID:g3323040
PRODUCT-TYPE
P
SRI International
Bioinformatics
Before you start:
What to do when an error occurs
Navigator errors are automatically trapped –
debugging information is saved to error.tmp file.
 All other errors (including most PathoLogic
errors) will cause software to drop into the Lisp
debugger
 Unix: error message will show up in the original terminal
window from which you started Pathway Tools.
 Windows: Error message will show up in the Lisp console.
The Lisp console usually starts out iconified – its icon is a
blue bust of Franz Liszt
 2 goals when an error occurs:
 Try to continue working
 Obtain enough information for a bug report to send to
pathway-tools support team.
 Most
The Lisp Debugger

SRI International
Bioinformatics
Sample error (details and number of restart actions differ
for each case)
Error: Received signal number 2 (Keyboard interrupt)
Restart actions (select using :continue):
0: continue computation
1: Return to command level
2: Pathway Tools version 10.0 top level
3: Exit Pathway Tools version 10.0
[1c] EC(2):

To generate debugging information (stack backtrace):
:zoom :count :all

To continue from error, find a restart that takes you to the
top level – in this case, number 2
:cont 2

To exit Pathway Tools:
:exit
How to report an error
 Determine
SRI International
Bioinformatics
if problem is reproducible, and how to
reproduce it (make sure you have all the latest
patches installed)
 Send email to ptools-support@ai.sri.com
containing:
 Pathway Tools version number and platform
 Description of exactly what you were doing (which command
you invoked, what you typed, etc.) or instructions for how to
reproduce the problem
 error.tmp file, if one was generated
 If software breaks into the lisp debugger, the complete error
message and stack backtrace (obtained using the command
:zoom :count :all, as described on previous slide)
SRI International
Bioinformatics
Using the PPP GUI to Create a
Pathway/Genome Database
 Input
Project Information
 Organism -> Create New
SRI International
Bioinformatics
Input Project Information
Next Steps
 Trial
Parse
 Build -> Trial Parse
 Fix any errors in input files
 Build pathway/genome database
 Build -> Automated Build
SRI International
Bioinformatics
SRI International
Bioinformatics
PathoLogic Parser Output
SRI International
Bioinformatics
Assign Enzymes to Reactions
5.1.3.2
Gene
product
MetaCyc
UDP-glucose-4epimerase
Match
yes
no
Probable enzyme
-ase
no
yes
Not a metabolic
enzyme
Assign
UDP-D-glucose  UDP-galactose
Manually
search
no
Can’t Assign
yes
Assign
Enzyme Name Matcher
 Matches
SRI International
Bioinformatics
on full enzyme name
 Match is case-insensitive and removes the
punctuation characters “ -_(){}',:”
 Also matches after removal of prefixes and
suffixes such as:
 “Putative”, “Hypothetical”, etc
 alpha|beta|…|catalytic|inducible chain|subunit|component
 Parenthetical gene name
Enzyme Name Matcher
 For
SRI International
Bioinformatics
names that do not match, software identifies
probable metabolic enzymes as those
 Containing “ase”
 Not containing keywords such as





“sensor kinase”
“topoisomerase”
“protein kinase”
“peptidase”
Etc
 Research
unknown enzymes
 MetaCyc, Swiss-Prot, PubMed
SRI International
Bioinformatics
Enzyme Name to Reaction Mapping
See also file PTools Tutorial/PathoLogic Reports/name-matching-report.txt
SRI International
Bioinformatics
Manual Polishing

Refine -> Assign Probable Enzymes
 Do this first

Refine -> Rescore Pathways
 Redo after assigning enzymes

Refine -> Create Protein Complexes
 Can be done at any time

Refine -> Assign Modified Proteins
 Can be done at any time

Refine -> Transport Identification Parser  Can be done at any time

Refine -> Pathway Hole Filler

Refine -> Predict Transcription Units

Refine -> Update Overview  Do this last, and repeat after any material
changes to PGDB
Assign Probable Enzymes
SRI International
Bioinformatics
SRI International
Bioinformatics
How to find reactions for probable
enzymes
 First,
verify that enzyme name describes a
specific, metabolic function
 Search for fragment of name in MetaCyc – you
may be able to find a match that PathoLogic
missed
 Look up protein in SwissProt or other DBs
 Search for gene name in PGDB for related
organism (bear in mind that gene names are not
reliable indicators of function, so check carefully)
 Search for function name in PubMed
 Other…
Manual Polishing

Refine -> Assign Probable Enzymes

Refine -> Rescore Pathways

Refine -> Create Protein Complexes

Refine -> Assign Modified Proteins

Refine -> Transport Identification Parser

Refine -> Pathway Hole Filler

Refine -> Predict Transcription Units

Refine -> Run Consistency Checker

Refine -> Update Overview
SRI International
Bioinformatics
Automated Pathway Inference
 All
SRI International
Bioinformatics
pathways in MetaCyc for which there is at
least one enzyme identified in the target organism
are considered for possible inclusion.
 Algorithm errs on side of inclusivity – easier to
manually delete a pathway from an organism than
to find a pathway that should have been predicted
but wasn’t.
SRI International
Bioinformatics
Considerations taken into account when
deciding whether or not a pathway should
be inferred:




Is there a unique enzyme – an enzyme not involved in any
other pathway?
Does the organism fall in the expected taxonomic domain of
the pathway?
Is this pathway part of a variant set, and, if so, is there more
evidence for some other variant?
If there is no unique enzyme:
 Is there evidence for more than one enzyme?
 If a biosynthetic pathway, is there evidence for final reaction(s)?
 If a degradation pathway, is there evidence for initial reaction(s)?
 If an energy metabolism pathway, is there evidence for more than half the
reactions?
SRI International
Bioinformatics
Assigning Evidence Scores to
Predicted Pathways
 X|Y|Z
denotes score for P in O
 where:



X = total number of reactions in P
Y = enzymes catalyzing number of reactions for which there is
evidence in O
Z = number of Y reactions that are used in other pathways in O
Manual Pruning of Pathways
SRI International
Bioinformatics

Use pathway evidence report
 Coloring scheme aids in assessing pathway evidence

Phase I: Prune extra variant pathways

Rescore pathways, re-generate pathway evidence report

Phase II: Prune pathways unlikely to be present
 No/few unique enzymes
 Most pathway steps present because they are used in another pathway
 Pathway very unlikely to be present in this organism
 Nonspecific enzyme name assigned to a pathway step
Caveats
 Cannot
predict pathways not present in MetaCyc
 Evidence
 Since
SRI International
Bioinformatics
for short pathways is hard to interpret
many reactions occur in multiple pathways,
some false positives
Output from PPP
 Pathway/genome
SRI International
Bioinformatics
database
 Summary
pages
 Pathway evidence page


Click “Summary of Organisms”, then click organism name, then click
“Pathway Evidence”, then click “Save Pathway Report”
Missing enzymes report
 Directory
etc.
tree containing sequence files, reports,
SRI International
Bioinformatics
Resulting Directory Structure

ROOT/ptools-local/pgdbs/user/ORGIDcyc/VERSION/
 input






reports



ORGIDbase.ocelot
data


name-matching-report.txt
trial-parse-report.txt
kb


organism.dat
organism-init.dat
genetic-elements.dat
annotation files
sequence files
overview.graph
released -> VERSION
Manual Polishing

Refine -> Assign Probable Enzymes

Refine -> Rescore Pathways

Refine -> Create Protein Complexes

Refine -> Assign Modified Proteins

Refine -> Transport Identification Parser

Refine -> Pathway Hole Filler

Refine -> Predict Transcription Units

Refine -> Run Consistency Checker

Refine -> Update Overview
SRI International
Bioinformatics
SRI International
Bioinformatics
Creating Protein Complexes
SRI International
Bioinformatics
Complex Subunits Stoichiometries
Manual Polishing

Refine -> Assign Probable Enzymes

Refine -> Re-run Name Matcher

Refine -> Create Protein Complexes

Refine -> Assign Modified Proteins

Refine -> Transport Identification Parser

Refine -> Pathway Hole Filler

Refine -> Predict Transcription Units

Refine -> Run Consistency Checker

Refine -> Update Overview
SRI International
Bioinformatics
SRI International
Bioinformatics
Proteins as Reaction Substrates
Manual polishing

Refine -> Assign Probable Enzymes

Refine -> Rescore Pathways

Refine -> Create Protein Complexes

Refine -> Assign Modified Proteins

Refine -> Transport Identification Parser

Refine -> Pathway Hole Filler

Refine -> Predict Transcription Units

Refine -> Run Consistency Checker

Refine -> Update Overview
SRI International
Bioinformatics
Nomenclature
SRI International
Bioinformatics
• WO pair = pair of genes within an operon
• TUB pair = pair of genes at a transcription unit boundary
(delineate operons)
SRI International
Bioinformatics
Operation of the operon predictor


For each contiguous gene pair, predict whether gene pairs
are within the same operon or at a transcription unit
boundary
Use pairwise predictions to identify potential operons
AB = TUB pair
BC = WO pair
CD = WO pair
DE = TUB pair
A
operon = BCD
B
C
D
E
Operon predictor
SRI International
Bioinformatics

Predicts operon gene pairs based on:
 intergenic distance between genes
 genes in the same functional class

Typically used for operon prediction
 We use method from Salgado et al, PNAS (2000) as a
starting point.


Uses E. coli experimentally verified data as a training set.
Compute log likelihood of two genes being WO or TUB pair based
on intergenic distance.
Operon predictor
SRI International
Bioinformatics
Additional features easily computed from a PGDB
1.
both genes products enzymes in the same metabolic
pathway
2.
both gene products monomers in the same protein
complex
3.
one gene product transports a substrate for a metabolic
pathway in which the other gene product is involved as an
enzyme
4.
a gene upstream or downstream from the gene pair (and
within the same directon) is related to either one of the
genes in the pair as per features 1, 2 and 3 above.
Download