PathoLogic Pathway Predictor - Bioinformatics Research Group at

advertisement
PathoLogic Pathway Predictor
SRI International
Bioinformatics
Inference of Metabolic Pathways
Annotated Genomic
Sequence
Pathway/Genome
Database
Gene Products
Pathways
Genes/ORFs
DNA Sequences
Multi-organism Pathway
Database (MetaCyc)
Pathways
Reactions
PathoLogic
Software
Integrates genome and
pathway data to identify
putative metabolic
networks
Compounds
Gene Products
Genes
Reactions
Genomic Map
Compounds
PathoLogic Functionality
 Initialize
SRI International
Bioinformatics
schema for new PGDB
 Transform existing genome to PGDB form
 Infer metabolic pathways and store in PGDB
 Infer operons and store in PGDB
 Assemble Overview diagram
 Assist user with manual tasks
 Assign enzymes to reactions they catalyze
 Identify false-positive pathway predictions
 Build protein complexes from monomers
 Infer transport reactions
 Fill pathway holes
SRI International
Bioinformatics
PathoLogic Input/Output

Inputs:
 List of all genetic elements





Enter using GUI or provide a file
Files containing annotation for each genetic element
Files containing DNA sequence for each genetic element
MetaCyc database
Output:
 Pathway/genome database for the subject organism
 Reports that summarize:


Evidence in the input genome for the presence of reference pathways
Reactions missing from inferred pathways
SRI International
Bioinformatics
File Naming Conventions
 One
pair of sequence and annotation files for
each genetic element
 Sequence
files: FASTA format
 suffix fsa or fna
 Annotation
file:
 Genbank format: suffix .gbk
 PathoLogic format: suffix .pf
SRI International
Bioinformatics
Typical Problems Using Genbank
Files With PathoLogic
 Wrong
qualifier names used: read PathoLogic
documentation!
 Extraneous
 Check
information in a given qualifier
results of trial parse carefully
GenBank File Format



SRI International
Bioinformatics
Accepted feature types:
 CDS, tRNA, rRNA, misc_RNA
Accepted qualifiers:
 /locus_tag
Unique ID
[recm]
 /gene
Gene name
[req]
 /product
[req]
 /EC_number
[recm]
 /product_comment
[opt]
 /gene_comment
[opt]
 /alt_name
Synonyms
[opt]
 /pseudo
Gene is a pseudogene [opt]
 /db_xref
DB:AccessionID
[opt]
 /go_component, /go_function, /go_process GO terms [opt]
For multifunctional proteins, put each function in a separate
/product line
PathoLogic File Format



Each record starts with line containing an ID attribute
Tab delimited
Each record ends with a line containing //

One attribute-value pair is allowed per line
 Use multiple FUNCTION lines for multifunctional proteins

Lines starting with ‘;’ are comment lines

Valid attributes are:
 ID, NAME, SYNONYM
 STARTBASE, ENDBASE, GENE-COMMENT
 FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT
 DBLINK
 GO
 INTRON
SRI International
Bioinformatics
PathoLogic File Format
SRI International
Bioinformatics
ID
TP0734
NAME
deoD
STARTBASE
799084
ENDBASE
799785
FUNCTION
purine nucleoside phosphorylase
DBLINK
PID:g3323039
PRODUCT-TYPE
P
GENE-COMMENT
similar to GP:1638807 percent identity:
57.51; identified by sequence similarity; putative
//
ID
TP0735
NAME
gltA
STARTBASE
799867
ENDBASE
801423
FUNCTION
glutamate synthase
DBLINK
PID:g3323040
PRODUCT-TYPE
P
GO
glutamate synthase (NADPH) activity [goid
0004355] [evidence IDA] [pmid 4565085]
SRI International
Bioinformatics
Before you start:
What to do when an error occurs
Navigator errors are automatically trapped –
debugging information is saved to error.tmp file.
 All other errors (including most PathoLogic
errors) will cause software to drop into the Lisp
debugger
 Unix: error message will show up in the original terminal
window from which you started Pathway Tools.
 Windows: Error message will show up in the Lisp console.
The Lisp console usually starts out iconified – its icon is a
blue bust of Franz Liszt
 2 goals when an error occurs:
 Try to continue working
 Obtain enough information for a bug report to send to
pathway-tools support team.
 Most
The Lisp Debugger

SRI International
Bioinformatics
Sample error (details and number of restart actions differ
for each case)
Error: Received signal number 2 (Keyboard interrupt)
Restart actions (select using :continue):
0: continue computation
1: Return to command level
2: Pathway Tools version 10.0 top level
3: Exit Pathway Tools version 10.0
[1c] EC(2):

To generate debugging information (stack backtrace):
:zoom :count :all

To continue from error, find a restart that takes you to the
top level – in this case, number 2
:cont 2

To exit Pathway Tools:
:exit
How to report an error
 Determine
SRI International
Bioinformatics
if problem is reproducible, and how to
reproduce it (make sure you have all the latest
patches installed)
 Send email to ptools-support@ai.sri.com
containing:
 Pathway Tools version number and platform
 Description of exactly what you were doing (which command
you invoked, what you typed, etc.) or instructions for how to
reproduce the problem
 error.tmp file, if one was generated
 If software breaks into the lisp debugger, the complete error
message and stack backtrace (obtained using the command
:zoom :count :all, as described on previous slide)
SRI International
Bioinformatics
Using the PPP GUI to Create a
Pathway/Genome Database
 Input
Project Information
 Organism -> Create New
 Creates directory structure for new PGDB
 Creates and saves empty PGDB, populated only with objects
common to all PGDBs (schema classes, elements, etc.) and
data you entered in the form.
 Offers to invoke Replicon Editor
SRI International
Bioinformatics
Input Project Information
Enter Replicon Information
 For
SRI International
Bioinformatics
each replicon
 Name
 Type: chromosome, plasmid, etc.
 Circular?
 Annotation file
 Sequence file (optional)
 Contigs (optional)
 Links to other DBs (optional)
 GUI-Based entry
 Build->Specify Replicons
 File-Based Entry
 Create genetic-elements.dat file using template provided
GUI-Based Replicon Entry
SRI International
Bioinformatics
Batch Entry of Replicon Info
SRI International
Bioinformatics
File /<orgid>cyc/<version>/input/genetic-elements.dat:
ID
TEST-CHROM-1
NAME Chromosome 1
TYPE :CHRSM
CIRCULAR?
N
ANNOT-FILE
chrom1.pf
SEQ-FILE
chrom1.fsa
//
ID
TEST-CHROM-2
NAME Chromosome 2
CIRCULAR?
N
ANNOT-FILE
/mydata/chrom2.gbk
SEQ-FILE
/mydata/chrom2.fna
//
Specify Reference PGDB(s)
 This
SRI International
Bioinformatics
step is optional, and most users will omit it
 MetaCyc is always the primary reference PGDB
 Specify additional reference PGDB if you have
your own curated PGDB which has:
 Pathways and/or reactions that are not in MetaCyc
 Manual functional assignments, with names similar to current
genome
 There is no point specifying any of our PGDBs as
references, only your own curated PGDBs.
Building the PGDB
SRI International
Bioinformatics
 Trial
Parse
 Build -> Trial Parse
 Check output to ensure numbers “look right”



Same number of gene start positions, end positions, names
Did my file contain EC numbers? Were they detected?
Did my file contain RNAs? Were they detected?
Fix any errors in input files
 Build pathway/genome database
 Build -> Automated Build

SRI International
Bioinformatics
PathoLogic Parser Output
Automated Build
 Parses
SRI International
Bioinformatics
input files
 Creates objects for every gene and gene product
 Uses EC numbers, GO annotations and name
matcher to match enzymes to reactions in
MetaCyc
 Imports catalyzed enzymes and compounds from
MetaCyc
 Generates list of likely enzymes that couldn’t be
assigned
 Infers pathways likely to be present
 Generates Cellular Overview Diagram (first pass)
 Generates reports
SRI International
Bioinformatics
Matching Enzymes to Reactions
 Matches
on full EC number (partial ECs ignored)
 Matches on Molecular Function GO terms
 If definition of GO term includes cross-reference either to an
EC number or to a MetaCyc reaction.
 Matches on full enzyme name
 Match is case-insensitive and removes the punctuation
characters “ -_(){}',:”
 Also matches after removal of prefixes and suffixes such as:



“Putative”, “Hypothetical”, etc
alpha|beta|…|catalytic|inducible chain|subunit|component
Parenthetical gene name
Enzyme Name Matcher
SRI International
Bioinformatics
 For
names that do not match, software identifies
probable metabolic enzymes as those
 Containing “ase”
 Not containing keywords such as





 User
“sensor kinase”
“topoisomerase”
“protein kinase”
“peptidase”
Etc
should research unknown enzymes
 MetaCyc, Swiss-Prot, PubMed
SRI International
Bioinformatics
Stored in ORGIDcyc/VERSION/reports/name-matching-report.txt
Automated Pathway Inference
SRI International
Bioinformatics
 All
pathways in MetaCyc for which there is at
least one enzyme identified in the target organism
are considered for possible inclusion.
errs on side of inclusivity – easier to
manually delete a pathway from an organism than
to find a pathway that should have been predicted
but wasn’t.
 Algorithm
SRI International
Bioinformatics
Considerations taken into account when
deciding whether or not a pathway should
be inferred:




Is there a unique enzyme – an enzyme not involved in any
other pathway?
Does the organism fall in the expected taxonomic domain of
the pathway?
Is this pathway part of a variant set, and, if so, is there more
evidence for some other variant?
If there is no unique enzyme:
 Is there evidence for more than one enzyme?
 If a biosynthetic pathway, is there evidence for final reaction(s)?
 If a degradation pathway, is there evidence for initial reaction(s)?
 If an energy metabolism pathway, is there evidence for more than half the
reactions?
SRI International
Bioinformatics
Assigning Evidence Scores to
Predicted Pathways
 X|Y|Z
denotes score for P in O
 where:



X = total number of reactions in P
Y = enzymes catalyzing number of reactions for which there is
evidence in O
Z = number of Y reactions that are used in other pathways in O
Pathway Evidence Report
SRI International
Bioinformatics
 On
Organism Summary Page in Navigator, button
“Generate Pathway Evidence Report”
 Report saved as HTML file, view in browser
 Hierarchical listing of all inferred pathways
 “Pathway Glyph” shows evidence graphically



Steps with/without enzymes (green/black)
Steps that are unique to pathway (orange)
Steps filled by Pathway Hole Filler (blue)
Counts reactions in pathway, with evidence, in other
pathways
 Lists other pathways that share reactions
 Link to pathway in MetaCyc

SRI International
Bioinformatics
Manual Pruning of Pathways
SRI International
Bioinformatics

Use pathway evidence report
 Coloring scheme aids in assessing pathway evidence

Phase I: Prune extra variant pathways

Rescore pathways, re-generate pathway evidence report

Phase II: Prune pathways unlikely to be present
 No/few unique enzymes
 Most pathway steps present because they are used in another pathway
 Pathway very unlikely to be present in this organism
 Nonspecific enzyme name assigned to a pathway step
Caveats
 Cannot
SRI International
Bioinformatics
predict pathways not present in MetaCyc
 Evidence
for short pathways is hard to interpret
 Since
many reactions occur in multiple pathways,
some false positives
 Next
generation pathway inference algorithm is
work currently in progress!
Output from PPP
 Pathway/genome
SRI International
Bioinformatics
database
 Summary
pages
 Pathway evidence page


Click “Summary of Organisms”, then click organism name, then click
“Pathway Evidence”, then click “Save Pathway Report”
Missing enzymes report
 Directory
etc.
tree containing sequence files, reports,
SRI International
Bioinformatics
Resulting Directory Structure

ROOT/ptools-local/pgdbs/user/ORGIDcyc/VERSION/
 input






reports



ORGIDbase.ocelot
data


name-matching-report.txt
trial-parse-report.txt
kb


organism.dat
organism-init.dat
genetic-elements.dat
annotation files
sequence files
overview.graph
released -> VERSION
SRI International
Bioinformatics
Manual Polishing

Refine -> Assign Probable Enzymes
 Do this first

Refine -> Rescore Pathways
 Redo after assigning enzymes

Refine -> Create Protein Complexes
 Can be done at any time

Refine -> Assign Modified Proteins
 Can be done at any time

Refine -> Transport Identification Parser  Can be done at any time

Refine -> Pathway Hole Filler

Refine -> Predict Transcription Units

Refine -> Update Overview  Do this last, and repeat after any material
changes to PGDB
Assign Probable Enzymes
SRI International
Bioinformatics
SRI International
Bioinformatics
How to find reactions for probable
enzymes
 First,
verify that enzyme name describes a
specific, metabolic function
 Search for fragment of name in MetaCyc – you
may be able to find a match that PathoLogic
missed
 Look up protein in UniProt or other DBs
 Search for gene name in PGDB for related
organism (bear in mind that gene names are not
reliable indicators of function, so check carefully)
 Search for function name in PubMed
 Other…
Download