PathoLogic Pathway Predictor SRI International Bioinformatics Inference of Metabolic Pathways Annotated Genomic Sequence Pathway/Genome Database Gene Products Pathways Genes/ORFs DNA Sequences Multi-organism Pathway Database (MetaCyc) Pathways Reactions PathoLogic Software Integrates genome and pathway data to identify putative metabolic networks Compounds Gene Products Genes Reactions Genomic Map Compounds PathoLogic Functionality Initialize SRI International Bioinformatics schema for new PGDB Transform existing genome to PGDB form Infer metabolic pathways and store in PGDB Infer operons and store in PGDB Assemble Overview diagram Assist user with manual tasks Assign enzymes to reactions they catalyze Identify false-positive pathway predictions Build protein complexes from monomers Infer transport reactions Fill pathway holes SRI International Bioinformatics PathoLogic Input/Output Inputs: List of all genetic elements Enter using GUI or provide a file Files containing annotation for each genetic element Files containing DNA sequence for each genetic element MetaCyc database Output: Pathway/genome database for the subject organism Reports that summarize: Evidence in the input genome for the presence of reference pathways Reactions missing from inferred pathways SRI International Bioinformatics File Naming Conventions One pair of sequence and annotation files for each genetic element Sequence files: FASTA format suffix fsa or fna Annotation file: Genbank format: suffix .gbk PathoLogic format: suffix .pf SRI International Bioinformatics Typical Problems Using Genbank Files With PathoLogic Wrong qualifier names used: read PathoLogic documentation! Extraneous Check information in a given qualifier results of trial parse carefully GenBank File Format SRI International Bioinformatics Accepted feature types: CDS, tRNA, rRNA, misc_RNA Accepted qualifiers: /locus_tag Unique ID [recm] /gene Gene name [req] /product [req] /EC_number [recm] /product_comment [opt] /gene_comment [opt] /alt_name Synonyms [opt] /pseudo Gene is a pseudogene [opt] /db_xref DB:AccessionID [opt] /go_component, /go_function, /go_process GO terms [opt] For multifunctional proteins, put each function in a separate /product line PathoLogic File Format Each record starts with line containing an ID attribute Tab delimited Each record ends with a line containing // One attribute-value pair is allowed per line Use multiple FUNCTION lines for multifunctional proteins Lines starting with ‘;’ are comment lines Valid attributes are: ID, NAME, SYNONYM STARTBASE, ENDBASE, GENE-COMMENT FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT DBLINK GO INTRON SRI International Bioinformatics PathoLogic File Format SRI International Bioinformatics ID TP0734 NAME deoD STARTBASE 799084 ENDBASE 799785 FUNCTION purine nucleoside phosphorylase DBLINK PID:g3323039 PRODUCT-TYPE P GENE-COMMENT similar to GP:1638807 percent identity: 57.51; identified by sequence similarity; putative // ID TP0735 NAME gltA STARTBASE 799867 ENDBASE 801423 FUNCTION glutamate synthase DBLINK PID:g3323040 PRODUCT-TYPE P GO glutamate synthase (NADPH) activity [goid 0004355] [evidence IDA] [pmid 4565085] SRI International Bioinformatics Before you start: What to do when an error occurs Navigator errors are automatically trapped – debugging information is saved to error.tmp file. All other errors (including most PathoLogic errors) will cause software to drop into the Lisp debugger Unix: error message will show up in the original terminal window from which you started Pathway Tools. Windows: Error message will show up in the Lisp console. The Lisp console usually starts out iconified – its icon is a blue bust of Franz Liszt 2 goals when an error occurs: Try to continue working Obtain enough information for a bug report to send to pathway-tools support team. Most The Lisp Debugger SRI International Bioinformatics Sample error (details and number of restart actions differ for each case) Error: Received signal number 2 (Keyboard interrupt) Restart actions (select using :continue): 0: continue computation 1: Return to command level 2: Pathway Tools version 10.0 top level 3: Exit Pathway Tools version 10.0 [1c] EC(2): To generate debugging information (stack backtrace): :zoom :count :all To continue from error, find a restart that takes you to the top level – in this case, number 2 :cont 2 To exit Pathway Tools: :exit How to report an error Determine SRI International Bioinformatics if problem is reproducible, and how to reproduce it (make sure you have all the latest patches installed) Send email to ptools-support@ai.sri.com containing: Pathway Tools version number and platform Description of exactly what you were doing (which command you invoked, what you typed, etc.) or instructions for how to reproduce the problem error.tmp file, if one was generated If software breaks into the lisp debugger, the complete error message and stack backtrace (obtained using the command :zoom :count :all, as described on previous slide) SRI International Bioinformatics Using the PPP GUI to Create a Pathway/Genome Database Input Project Information Organism -> Create New Creates directory structure for new PGDB Creates and saves empty PGDB, populated only with objects common to all PGDBs (schema classes, elements, etc.) and data you entered in the form. Offers to invoke Replicon Editor SRI International Bioinformatics Input Project Information Enter Replicon Information For SRI International Bioinformatics each replicon Name Type: chromosome, plasmid, etc. Circular? Annotation file Sequence file (optional) Contigs (optional) Links to other DBs (optional) GUI-Based entry Build->Specify Replicons File-Based Entry Create genetic-elements.dat file using template provided GUI-Based Replicon Entry SRI International Bioinformatics Batch Entry of Replicon Info SRI International Bioinformatics File /<orgid>cyc/<version>/input/genetic-elements.dat: ID TEST-CHROM-1 NAME Chromosome 1 TYPE :CHRSM CIRCULAR? N ANNOT-FILE chrom1.pf SEQ-FILE chrom1.fsa // ID TEST-CHROM-2 NAME Chromosome 2 CIRCULAR? N ANNOT-FILE /mydata/chrom2.gbk SEQ-FILE /mydata/chrom2.fna // Specify Reference PGDB(s) This SRI International Bioinformatics step is optional, and most users will omit it MetaCyc is always the primary reference PGDB Specify additional reference PGDB if you have your own curated PGDB which has: Pathways and/or reactions that are not in MetaCyc Manual functional assignments, with names similar to current genome There is no point specifying any of our PGDBs as references, only your own curated PGDBs. Building the PGDB SRI International Bioinformatics Trial Parse Build -> Trial Parse Check output to ensure numbers “look right” Same number of gene start positions, end positions, names Did my file contain EC numbers? Were they detected? Did my file contain RNAs? Were they detected? Fix any errors in input files Build pathway/genome database Build -> Automated Build SRI International Bioinformatics PathoLogic Parser Output Automated Build Parses SRI International Bioinformatics input files Creates objects for every gene and gene product Uses EC numbers, GO annotations and name matcher to match enzymes to reactions in MetaCyc Imports catalyzed enzymes and compounds from MetaCyc Generates list of likely enzymes that couldn’t be assigned Infers pathways likely to be present Generates Cellular Overview Diagram (first pass) Generates reports SRI International Bioinformatics Matching Enzymes to Reactions Matches on full EC number (partial ECs ignored) Matches on Molecular Function GO terms If definition of GO term includes cross-reference either to an EC number or to a MetaCyc reaction. Matches on full enzyme name Match is case-insensitive and removes the punctuation characters “ -_(){}',:” Also matches after removal of prefixes and suffixes such as: “Putative”, “Hypothetical”, etc alpha|beta|…|catalytic|inducible chain|subunit|component Parenthetical gene name Enzyme Name Matcher SRI International Bioinformatics For names that do not match, software identifies probable metabolic enzymes as those Containing “ase” Not containing keywords such as User “sensor kinase” “topoisomerase” “protein kinase” “peptidase” Etc should research unknown enzymes MetaCyc, Swiss-Prot, PubMed SRI International Bioinformatics Stored in ORGIDcyc/VERSION/reports/name-matching-report.txt Automated Pathway Inference SRI International Bioinformatics All pathways in MetaCyc for which there is at least one enzyme identified in the target organism are considered for possible inclusion. errs on side of inclusivity – easier to manually delete a pathway from an organism than to find a pathway that should have been predicted but wasn’t. Algorithm SRI International Bioinformatics Considerations taken into account when deciding whether or not a pathway should be inferred: Is there a unique enzyme – an enzyme not involved in any other pathway? Does the organism fall in the expected taxonomic domain of the pathway? Is this pathway part of a variant set, and, if so, is there more evidence for some other variant? If there is no unique enzyme: Is there evidence for more than one enzyme? If a biosynthetic pathway, is there evidence for final reaction(s)? If a degradation pathway, is there evidence for initial reaction(s)? If an energy metabolism pathway, is there evidence for more than half the reactions? SRI International Bioinformatics Assigning Evidence Scores to Predicted Pathways X|Y|Z denotes score for P in O where: X = total number of reactions in P Y = enzymes catalyzing number of reactions for which there is evidence in O Z = number of Y reactions that are used in other pathways in O Pathway Evidence Report SRI International Bioinformatics On Organism Summary Page in Navigator, button “Generate Pathway Evidence Report” Report saved as HTML file, view in browser Hierarchical listing of all inferred pathways “Pathway Glyph” shows evidence graphically Steps with/without enzymes (green/black) Steps that are unique to pathway (orange) Steps filled by Pathway Hole Filler (blue) Counts reactions in pathway, with evidence, in other pathways Lists other pathways that share reactions Link to pathway in MetaCyc SRI International Bioinformatics Manual Pruning of Pathways SRI International Bioinformatics Use pathway evidence report Coloring scheme aids in assessing pathway evidence Phase I: Prune extra variant pathways Rescore pathways, re-generate pathway evidence report Phase II: Prune pathways unlikely to be present No/few unique enzymes Most pathway steps present because they are used in another pathway Pathway very unlikely to be present in this organism Nonspecific enzyme name assigned to a pathway step Caveats Cannot SRI International Bioinformatics predict pathways not present in MetaCyc Evidence for short pathways is hard to interpret Since many reactions occur in multiple pathways, some false positives Next generation pathway inference algorithm is work currently in progress! Output from PPP Pathway/genome SRI International Bioinformatics database Summary pages Pathway evidence page Click “Summary of Organisms”, then click organism name, then click “Pathway Evidence”, then click “Save Pathway Report” Missing enzymes report Directory etc. tree containing sequence files, reports, SRI International Bioinformatics Resulting Directory Structure ROOT/ptools-local/pgdbs/user/ORGIDcyc/VERSION/ input reports ORGIDbase.ocelot data name-matching-report.txt trial-parse-report.txt kb organism.dat organism-init.dat genetic-elements.dat annotation files sequence files overview.graph released -> VERSION SRI International Bioinformatics Manual Polishing Refine -> Assign Probable Enzymes Do this first Refine -> Rescore Pathways Redo after assigning enzymes Refine -> Create Protein Complexes Can be done at any time Refine -> Assign Modified Proteins Can be done at any time Refine -> Transport Identification Parser Can be done at any time Refine -> Pathway Hole Filler Refine -> Predict Transcription Units Refine -> Update Overview Do this last, and repeat after any material changes to PGDB Assign Probable Enzymes SRI International Bioinformatics SRI International Bioinformatics How to find reactions for probable enzymes First, verify that enzyme name describes a specific, metabolic function Search for fragment of name in MetaCyc – you may be able to find a match that PathoLogic missed Look up protein in UniProt or other DBs Search for gene name in PGDB for related organism (bear in mind that gene names are not reliable indicators of function, so check carefully) Search for function name in PubMed Other…