Analysis of environmental genomes using Pathway Tools Steven Hallam | University of British Columbia SRI International, 2013 Overview • Through the looking glass… • Environmental Pathway/Genome Databases • MetaPathways Pipeline Development 2 Metabolism Vertex = chemical [substrate, product] Edge = enzyme • Metabolism, or the synthesis and decomposition of chemicals in a cell can be organized into pathways represented by graphs. 3 Cellular Pathways Genome Management Information System, Oak Ridge National Laboratory • Our genetic and biochemical understanding of metabolism is based largely on the study of complete pathways within cells. 4 Distributed Pathways • However, microbial communities form distributed metabolic pathways directing matter and energy exchange. 5 Community Metabolism • The goal is to predict and compare distributed pathways to better understand biogeochemical cycling and community metabolism in the environment. 6 Predicting Community Metabolism Plurality Sequencing Single-Cell Sequencing Fragment Recruitment, SOM, PCA Environmental PGDB (ePGDB) with Taxonomic Binning Simulated ePGDB 7 From Genomes to Biomes Falkowski et al., (2008) Science 320, 1034-1038 Metagenome Distributed Pathways Biogeochemical Cycles • “The regulation of the pools and fluxes in biogeochemical cycles have their origins in the genetic inventory of individual microbes, and the regulation of these genes within the organism is determined by the environment. As such, one can look at the microbial food web as a collection of genomes whose expression and replication is coordinated through complex feedback loops at the organismal, population, and ecosystem level. “Chisholm 8 Foundational Questions • What is the taxonomic and functional structure of the ecosystem? • How does this structure change in response to environmental perturbation? • What are the ecological consequences of this change? • What are relevant units of selection, conservation or utilization for ecological genomic resources? 9 Overview • Through the looking glass… • Environmental Pathway/Genome Databases • MetaPathways Pipeline Development 10 Inference of Metabolic Pathways Organisms PGDB Navigator Genomic Map Genomic Map Pathways Genes/ORF Genes/ORF s s Reactions Gene Products Gene Products Compunds Pathologic* PGDB Compounds Gene Products Reactions Pathways * Integrates genome and pathway data to identify putative metabolic networks Genes/ORF s Genomic Map 11 Pathway/Genome Navigator Pathway Viewer Homepage Evidence Glyph Metabolite Enzyme Found Unique Enzyme PGDB* Pathway Information Gene Information *http://ecocyc.org/META/new-image?type=PATHWAY&object=GLYCOLYSIS 12 Environmental PGDB ePGDB ??? Celllar Overview Metabolic Pathway Reaction Genomic Map Genomic Map Open Reading Frame Pathway s Genes/ORF Genes/ORF s s Reactions Gene Products Gene Products Compounds Pathologic* ePGDB Compounds Gene Products Reactions Pathway s * Integrates genome and pathway data to identify putative distributed metabolic networks Genes/ORF s Genomic Map 13 ePGDB Navigation ePGDB Celllar Overview Metabolic Pathway Reaction Open Reading Frame 14 http://engcyc.org/ (a) BioCyc PGDBs Tier-1 Highly Curated EcoCyc Tier-2 Moderately Curated Tier-3 Automatically Curated EngCyc 15 Overview • Through the looking glass… • Environmental Pathway/Genome Databases • MetaPathways Pipeline Development 16 MetaPathways • A modular pipeline for constructing Pathway/Genome Databases from environmental sequence information • MetaPathways currently supports four “data products” including i) GenBank submission, ii) LCA, iii) MLTreeMap, and iv) ePGDBs with associated feature summary tables and GFF files • MetaPathways externalizes computeintensive processes onto a user defined cluster using Sun Grid Engine or the Amazon elastic cloud 17 MetaPathways • ePGDBs facilitate pathway-centric exploration of environmental sequence information using Pathway Tools and the MetaCyc web interface • Provides inference-based approach to metabolic reconstruction based on explicit computational rules to predict presence or absence of distributed metabolic networks http://www.github.com/hallamlab/MetaPathways http://hallam.microbiology.ubc.ca/MetaPathways • MetaPathways can be used with multimolecular data sets (DNA, RNA or protein) sourced from cultured isolates, single-cells and natural or human engineered ecosystems 18 ePGDB Navigation ePGDB Celllar Overview Metabolic Pathway Reaction Open Reading Frame 19 ePGDB Validation 20 EcoCyc Pathways • The number of E. coli pathways identified using the MetaCyc blast database decreases with increasing blast score ratio (BSR) cut-off while the others stay relatively constant. From this an optimal BSR between 0.4-0.6 can be inferred. 21 MetaSim Pathways Sim1 Sim2 (a) Sim2 (b) Predicted Pathways Taxa Vibrio cholerae str. N16961 Synechococcus elongatus PCC 7942 Mycobacterium tuberculosis H37Rv Mycobacterium tuberculosis CDC1551 Helicobacter pylori 26695 Caulobacter crescentus NA1000 Caulobacter crescentus CB15 Bacillus subtilis 168 Aurantimonas manganoxydans SI85-9A 1 Agrobacterium tumefaciens C58 0.0 0.2 0.4 0.6 0 Copy Number 40 60 80 100 Sequencing (% Unique-Gm) (d) 0.8 0.6 Sensitivity 0.2 0.4 1.0 0.8 0.6 0.4 0.0 0.2 0.0 Precision 20 1.0 (c) Sequential Kegg MetaCyc+RefSeq MetaCyc 0 100 200 300 400 500 600 700 Sim1 0 20 40 60 80 100 Sequencing (% Unique-Gm) 0 20 40 60 80 Sequencing (% Unique-Gm) 100 22 Synthetic Ecology tetrahydropteroyl tri-L-glutamate b 5-methyltetrahydropteroyltri-L-glutamate methionine synthase ii: AAB_8041 5-methyltetrahydropteroyltrigulamatehomocysteine methyltransferase: AAB_5400 L-homocysteine 2.1.1.14 adenosine putative adenosylhomocysteinase 3: AAB_3597 3.3.1.1 a H2O L-methionine H2O ATP a+b s-adenosylmethionine synthase: diphosphate AAB_7188 phospate AAB_3549 2.5.1.6 S-adenosyl-L-methionine 2.1.1.- a demethylated methyl acceptor a methylated methyl acceptor S-adenosyl-L-homocysteine • The pathway (S-adenosyl-L-methionine cycle II) was identified by Pathway Tools in the simulated metagenome based on the combined contribution of two genomes (a + b). 23 Infering Trophic Interactions L-aspartate ATP ADP chorismate putative aspartate kinase 2.7.2.4 L-aspartyl-4-phosphate NADPH H+ phosphate NADP+ aspartatesemialdehyde dehydrogenase: 1.2.1.11 L-glutamine anthranilate synthase compontent I 4.1.3.27 pyruvate L-glutamate H+ anthranilate putative anthranilate phosphoribosyltransferase diphosphate 2.4.2.18 N(5’ phosphoribosyl) anthranilate 2 H2O H+ dihydrodipicolinate synthase 4.2.1.52 NAD(P)+ dihydrodipicolinate reductase 1.3.1.26 tetrahydropipicolinate succinyl-CoA H2O coenzyme A 5.3.1.24 putative 3-phosphoshikimate-1carboxyvinyltransferase: chorismate synthase 4.2.3.5 5-enopyruvyl-shikimate- 2.5.1.19 shikimate-3chorismate 3-phosphate phosphate 1-(o-carboxyphenylamino)-1’ deoxyribulose-5’phosphate shikimate 5-dehydrogenase 1.1.1.25 putative shikimate kinase 2.7.1.71 shikimate Chorismate biosynthesis I H+ L-2,3-dihydrodipicolinate H+ NAD(P)H 7-phosphate 5-phospho-a-D-ribose 1-diphosphate L-aspartate-semialdehyde pyruvate 2-dehydro-3putative 3deoxyphosphoheptonate 3-dehydroquniate dehydratase dehydroquinate aldolase type III synthase 4.2.1.10 2.5.1.54 3-deoxy-D-aramino4.2.3.4 D-erythrose-4-phosphate 3-dehydroquinate 3-dehydroshikimate heptulosonate- putative tetrahydrodipicolinate succinylase: 2.3.1.117 N-succinyl-2-amino-6-ketopimelate 4.1.1.48 H2O CO2 carbamoyl-phosphate synthase large/small chain 6.3.5.5 bicarbonate carbamoyl-phosphate (1S,2R)-1-C-(indol-3yl)glycerol 3-phosphate tryptophane synthase subunit alpha 4.1.2.8 D-glyceraldehyde3-phosphate indole tryptophane synthase subunit beta 4.2.1.- 2-oxoglutarate 1.4.1.3 L-glutamate L-serine H2O L-tryptophan Arginine biosynthesis IV & Uridine-5’phosphate biosynthesis Moranella Lysine Biosynthesis I ornithine carbamoyltransferase subunit I 2.1.3.3 2.6.1.13 L-glutamate L-ornithine L-citrulline g-semialdehyde argininosuccinate argininosuccinate synthase 6.3.4.5 lyase 4.3.2.1 L-arginino-succinate L-arginine Tremblaya Both Neither Tryptophan biosynthesis • An ePGDB constructed for the Mealybug symbionts Tremblaya princeps and Moranella endobia predicted interpathway complementarity in essential amino acid biosynthetic pathways. McCutcheon, J.P. and von Dohlen, C.D. “An interdependent metabolic patchwork in the nested symbiosis of mealybugs.” Current Biology, 2011, DOI: 10.1016/j.cub.2011.06.051 24 Hawaii Ocean Time Series (HOT) DeLong et al. Community Genomics Among Stratified Assemblages in the Ocean’s Interior. (2006) Science 311 T. Danhorn, C. R. Young, E. F. Delong, Comparison of large-insert, small-insert and pyrosequencing libraries for metagenomic analysis, ISME J (2012), doi:10.1038/ismej.2012.35. c1988-2012 25 Environmental Sequence Information HOT Sample Depth (m) Description Information Sequencing Platform Number of Sequences Average Sequence Length Protein Coding Sequences Annotated MetaCyc MetaCyc Coding Reactions Pathways Sequences 25 upper euphotic DNA Roche 454 623559 257 405613 214149 4138 864 75 upper euphotic DNA Roche 454 673674 244 430689 222572 4052 854 110 chlorophyll max DNA Roche 454 473166 270 336035 165775 4133 860 500 mesopelagic DNA Roche 454 995747 276 714743 361193 4464 949 25 upper euphotic RNA Roche 454 561821 248 234404 85781 3433 723 75 upper euphotic RNA Roche 454 557718 239 203359 66855 3208 669 110 chlorophyll max RNA Roche 454 398436 228 135107 36912 2549 532 500 mesopelagic RNA Roche 454 479661 266 207465 71400 3034 641 • ePGDBs were generated for environmental sequence information (DNA and RNA) sourced from the HOT water column. 26 Core Pathways HOT 25m (RNA/DNA) HOT 75m (RNA/DNA) HOT 110m (RNA/DNA) HOT 500m (RNA/DNA) MetaCyc Pathways DNA 110m DNA 500m RNA 25m RNA 75m RNA 110m Normalized ORF Counts 8000 10000 6000 4000 2000 0 -2000 10000 8000 6000 4000 0 2000 -2000 8000 10000 6000 4000 0 2000 -2000 8000 10000 RNA 500m 6000 Top 50 DNA 75m 4000 Biosynthesis DNA 25m 2000 Degradation 0 Energy Metabolism ammonium transport aerobic respiration (cytochrome c) TCA cycle VI (obligate autotrophs) TCA cycle V (2-oxoglutarate:ferredoxin oxidoreductase) TCA cycle IV (2-oxoglutarate decarboxylase) mixed acid fermentation heterolactic fermentation NADH to cytochrome bd oxidase electron transfer NADH to cytochrome bo oxidase electron transfer respiration (anaerobic) glycolysis I TCA cycle I (prokaryotic) glycolysis III (glucokinase) Rubisco shunt glycolysis IV (plant cytosol) TCA cycle II (eukaryotic) pyruvate fermentation to butanol I TCA cycle III (helicobacter) pyruvate fermentation to butanoate pentose phosphate pathway (non-oxidative branch) methylaspartate cycle succinate fermentation to butyrate formate oxidation to CO2 photosynthesis light reactions reductive TCA cycle II 3-hydroxypropionate/4-hydroxybutyrate cycle Calvin-Benson-Bassham cycle fatty acid β-oxidation I incomplete reductive TCA cycle nitrate reduction VI (assimilatory) formaldehyde assimilation I (serine pathway) glycine cleavage complex fatty acid beta-oxidation II (core pathway) reductive TCA cycle I purine nucleotides degradation IV (anaerobic) formaldehyde assimilation II (RuMP Cycle) glycine betaine degradation glutaryl-CoA degradation isoleucine degradation I gallate degradation III (anaerobic) creatinine degradation II purine nucleotides degradation III (anaerobic) phenylacetate degradation I (aerobic) ammonia assimilation cycle II octane oxidation ammonia assimilation cycle I lysine fermentation to acetate and butyrate nitrate reduction II (assimilatory) 4-aminobutyrate degradation V glutamate degradation V (via hydroxyglutarate) formaldehyde assimilation III (dihydroxyacetone cycle) 4-hydroxyphenylacetate degradation nitrate reduction I (denitrification) alkylnitronates degradation tRNA charging adenosine nucleotides de novo biosynthesis NAD/NADH phosphorylation and dephosphorylation gluconeogenesis I glutamine biosynthesis III arginine biosynthesis II (acetyl cycle) guanosine nucleotides de novo biosynthesis pyrimidine deoxyribonucleotides de novo biosynthesis II uridine-5-phosphate biosynthesis pyrimidine deoxyribonucleotides de novo biosynthesis I isoleucine biosynthesis I (from threonine) citrulline biosynthesis isoleucine biosynthesis II valine biosynthesis arginine biosynthesis III 5-aminoimidazole ribonucleotide biosynthesis I sucrose biosynthesis formylTHF biosynthesis I leucine biosynthesis lysine biosynthesis I methylerythritol phosphate pathway folate transformations II folate transformations I UDP-N-acetylmuramoyl-pentapeptide biosynthesis III lysine biosynthesis VI mycolate biosynthesis 4-hydroxybenzoate biosynthesis V tetrapyrrole biosynthesis I 5-aminoimidazole ribonucleotide biosynthesis II cis-vaccenate biosynthesis jasmonic acid biosynthesis isoleucine biosynthesis IV seleno-amino acid biosynthesis cysteine biosynthesis I isoleucine biosynthesis III -2000 Transport 27 Cellular Overview • Comparison of DNA (Blue) and RNA +DNA (Red) pathway predictions 28 Pathway Partitioning • Comparison of genetic potential and gene expression data in photic and dark ocean waters 29 Diagnostic Pathways Unique to 25m, 75m, and 110m (DNA/RNA) DNA 75m DNA 110m DNA 500m RNA 25m RNA 75m RNA 110m Logged Normalized ORF Counts 20 10 0 10 20 RNA 500m 10 Biosynthesis DNA 25m 0 Degradation photosynthesis light reactions hydrogen production VIII (S)-acetoin biosynthesis ribitol degradation sorbitol degradation I ammonia oxidation I (aerobic) intra-aerobic nitrite reduction nitrate reduction IV (dissimilatory) guanosine nucleotides degradation II L-rhamnose degradation II D-mannose degradation 2-methylcitrate cycle II acetate formation from acetyl-CoA II citrate degradation reductive monocarboxylic acid cycle methane oxidation to methanol I threonine degradation II threonine degradation III (to methylglyoxal) methionine degradation II flavonoid biosynthesis salidroside biosynthesis diploterol and cycloartenol biosynthesis heme biosynthesis from uroporphyrinogen-III I adenosylcobalamin biosynthesis from cobyrinate I adenosylcobalamin biosynthesis from cobyrinate II lipoate biosynthesis and incorporation I thiamin diphosphate biosynthesis II (Bacillus) thiamin diphosphate biosynthesis I (E. coli) glutathione biosynthesis phosphopantothenate biosynthesis III biotin biosynthesis from 7-keto-8-aminopelargonate thiamin diphosphate biosynthesis IV (eukaryotes) trans, trans-farnesyl diphosphate biosynthesis menaquinol-8 biosynthesis 5,6-dimethylbenzimidazole biosynthesis coenzyme M biosynthesis I mycothiol biosynthesis coenzyme B/coenzyme M regeneration pyridoxal 5-phosphate biosynthesis II UDP-N-acetyl-D-galactosamine biosynthesis II glycogen biosynthesis I (from ADP-D-Glucose) CMP-N-acetylneuraminate biosynthesis I (eukaryotes) ADP-L-glycero-beta-D-manno-heptose biosynthesis homocysteine and cysteine interconversion selenocysteine biosynthesis II (archaea and eukaryotes) glycine biosynthesis IV 10 Energy Metabolism Unique to 500m (DNA/RNA) 30 Cryptic Pathways • For each depth interval, a small number of cryptic pathways were predicted in RNA that were not predicted in DNA data sets • These pathways showed depth distributions consistent with niche-partitioning between sunlit and dark ocean waters 31 Known Hazards • Missing ATP citrate lyase indicates false positive for rTCA 32 Things to Keep in Mind… • Pathologic cannot predict pathways not present in MetaCyc • Evidence for short pathways is hard to interpret • False positives due to shared enzymes in multiple pathways or incorrect annotations create hazards • Currently no taxonomic assignment or coverage information is mapped onto identified pathways • Limited functional validation for pathways in metagenomes 33 “One gene is many hypotheses”Anonymous 34 University of British Columbia Maya Bhatia Monica Torres Beltran Annie Cox Evan Durno Diane Fairly Esther Geis Alyse Hawley Aria Hahn Niels Hansen Sam Kheirandish Kishori Konwar Keith Mewis Antoine Page Melanie Scofield Young Song Nicole Sukdeo Jody Wright Elena Zaikova SRI Peter Karp Tomer Altman Institute for Ocean Sciences Joint Genome Institute Pacific Northwest National Laboratory Marie Robert Robin Brown Susannah Tringe Tijana Glavina del Rio Angela Norbeck Ljiljana Pasa-Tolic Heather Brewer 35