Principles of comparative genomics • • • • Scale of the ‘unknown’ gene problem Shared plant-prokaryote genes Comparative genomics • • • • When Blast tells you nothing…. The ‘guilt by association’ principle ‘Two-dimensional’ gene annotation SEED subsystems Plant-prokaryote examples • • Filling ‘pathway holes’ – FolQ Linking new functions to known systems – COG0354 Whole genome sequencing progress 10000 Number of genomes 9000 www.genomesonline.org 8000 7000 6000 5000 4000 3000 Ongoing 2000 Complete 1000 0 ● Functional annotation of genes has nowhere near kept pace ● Functional annotations are often absent, vague, or wrong Orphan genes • 20-60% of genes in any given genome have no known function or only a vague one (‘esterase’ etc) Orphan enzymes • 1437/3736 enzymes (38%) with EC numbers have no associated genes The unknown protein problem in various groups Percentage of unknown proteins encoded by diverse genomes Known Unknown Percent of proteins 100 80 60 40 20 0 Bacteria Archaea Eukarya Data from The SEED http://theseed.uchicago.edu/ Plants & prokaryotes share many (unknown) genes ● Estimates for Arabidopsis vary – but all are many thousands ● Functions of most shared genes are metabolic Source of genes Number of genes % of genome 11170 43.4 Cyanobacteria 5470 21.0 Proteobacteria 1170 4.6 Gram+ bacteria 2280 9.1 Other bacteria 1160 4.6 Archaea 1090 4.4 Total From de Crecy-Lagard & Hanson Trends Microbiol 15: 563 (2007) ● Shared genes identifiably from various groups ● Plants are conglomerates of microbial metabolic genes ● Many opportunities for comparative genomics The power of comparative genomics ● Suppose you have an unknown plant protein: ● BlastP search gives various prokaryote hits ● None of them have clear functions Dead end ● No! This is the beginning of comparative genomics ● Predicts functions via ‘guilt by association’ principle ● Genes of related function are associated in various ways ● e.g. Enzymes in a pathway, proteins in a complex ● Whatever a gene’s associates do, it probably does too Genomic evidence A B C Association evidence Post-genomic evidence Gene W Gene X Gene Y Gene Z D Gene clustering Co-expression Orf X Orf Y Orf XY A Gene fusion XYYX A XYYX Predictions B XYYX XYYX B C Protein-protein interactions V M Organelle proteomes C D Shared regulatory sites + + + – – + Testing (genetics, biochemistry) Essentiality & other phenome data + – – Phylogenetic occurrence Structures Two-dimensional gene annotation • • • ‘Dimensions’ are: • • Molecular function (e.g., an enzyme activity with EC no.) Functional context (e.g., other enzymes of a pathway) ‘2-Dimensions good, 1-dimension bad’ • • Even an EC no. function may be wrong if pathway not there Pathway context may be wrong if certain enzymes missing GenBank etc annotations are 1-dimensional (mol. function) SEED subsystems • SSs Subsystems cover many (SSs) genomes, capture both have annotation form of spreadsheet: dimensions • Columns are molecular functions • Sets of molecular functions (e.g. enzymes) that together • Rows areagenomes implement specific biological process (e.g. a pathway) • Each cell identifies the genes for proteins with the specific molecular functional role in the designated genome Folate biosynthesis subsystem Pathway hole Plant – prokaryote examples • • • • Prokaryote association evidence is mainly genomic Plant association evidence is mainly post-genomic Post-genomic evidence is noisier but very useful Superb plant post-genomic resources: • • • • Microarrays, RNAseq (organ- and environment-specific) Organellar targeting prediction, proteomics (location can r/o function) Phenome databases (chlorosis, lethality can support function) Vast plant metabolism bibliome FolQ – Filling a pathway hole Folate synthesis pathway FolE GTP FolQ DHN-P3 [P-ase] DHN-P PabAB Chrorismate FolB DHN FolK HMDHP FolP HMDHP-P2 FolA DHF THF Glu PabC ADC FolC DHP pABA • FolQ universally missing (prokaryotes, plants, fungi, protists) • Missing step known to be a pyrophosphohydrolase, ~17 kDa • Search genomes for small hydrolase clustered with fol genes • YlgG candidate in Firmicutes, Nudix hydrolase family, 19 kDa Lactococcus lactis folate gene cluster • folEK folP YlgG has a plant homolog – At1g68760 ylgG folC FolQ – Experimental tests Folate synthesis pathway FolE GTP FolQ [P-ase] DHN-P DHN-P3 PabAB Chrorismate FolK HMDHP FolP HMDHP-P2 FolC DHP FolA DHF THF Glu PabC ADC ylgG KO accumulates DHN-P3 pABA • YlgG & At1g68760 act on DHN-P3 Recombinant proteins release DHN-P + PPi WT 240 KO YlgG DHN-P3 At1g68760 1.5 Product formation (nmol/assay) 200 Fluorescence • FolB DHN 160 120 DHNP3 80 40 0.9 1.0 0.6 0.5 0.3 0 0 2 4 6 2 Minutes 4 6 0 DHNP Pi PPi DHNP Pi PPi COG0354 – A–folate protein Fe/S cluster repair in oxidative stress COG0354 Linking a for new function to known system Mouse Fly • Yeast - Bacteria - Archaea - Fungi Animals Leishmania At4g12130 Rickettsia Ehrlichia Anaplasma Bradyrhizobium Burkholderia Neisseria In all kingdoms of life • Plants 2 plant proteins - 1 related to rickettsias (mitochondria) Xanthomonas Psychrobacter - 1 related to cyanobacteria (plastids) E. coli Shewanella Thermus Deinococcus Synechocystis At1g60990 Synechococcus Nostoc • Homolog of GcvT protein - But clearly a distinct clade Haloarcula Natronomonas Corynebacterium Folate-dependent Streptomyces Solibacter Blastopirellula Pirellula GcvT Yeast GcvT Mouse GcvT Arabidopsis GcvT Rice GcvT COG0354 – Comparative genomics & post-genomic data Arabidopsis Transcriptome DB (Max Planck Institute, Golm) Developmental series Mitochondrial COG0354 Mitochondrial Frataxin Ferritin 2 Mitochondrial COG0354 • Co-expression in Arabidopsis - Mitochondrial COG0354 expression correlates with frataxin (Fe/S assembly) - And with ferritin 2 (Fe storage) COG0354 – Comparative genomics & post-genomic data COG0354 Fe/S protein Fe/S partner ● Nif cluster in Methylococcus capsulatus 0354 nifQ fd nifX nifN nifE fd nifK nifD nifH ● Suf cluster in Rubrobacter xylanophilus 0354 sufC sufB sufD sufS thiC ● Sdh operon in Stenotrophomonas maltophila 0354 sdhC sdhD sdhB sdhA ● NAD synthesis cluster in Pelagibacter ubique 0354 nadA nadC ● MiaB (Radical SAM) in Buchnera aphidicola 0354 MiaB • Co-expression in Arabidopsis • Clusters with Fe/S proteins COG0354 – Comparative genomics & post-genomic data • Co-expression in Arabidopsis • Clusters with Fe/S proteins • Only occurs if IscA is present Bacteria Firmicutes Clostridiales Mollicutes Lactobacillales Staphylococcaceae Listeriaceae Bacillaceae Fusobacteria Actinobacteria Bifidobacterium Cyanobacteria Acidobacteria δ/ε-Proteobacteria α-Proteobacteria β-Proteobacteria γ-Proteobacteria Magnetococcus Spirochaetes Planctomycetes Chlamydiales Chlorobi Bacteroidetes Campylobacterales Bdellovibrionales Desulfobacterales Desulfovibrionales Desulfuromonadales Myxococcales Syntrophobacterales Bacteroidales Flavobacteria Sphingobacteria Deinococcus/Thermus Chloroflexi Thermotogae Archaea Nanoarcheota Crenarchaeota Euryarchaeota Gene present Gene absent Archaeoglobi Halobacteria Methanobacteria Methanococci Methanomicrobia Methanopyri Thermococci Thermoplasmata - IscA proteins are scaffolds in Fe/S cluster assembly COG0354 – Comparative genomics & post-genomic data • Co-expression in Arabidopsis • Clusters with Fe/S proteins • Only occurs if IscA is present • Associated with aerobic lifestyle COG0354 – Comparative genomics & post-genomic data ● Essential gene in: – Mycobacterium tuberculosis – Haemophilus influenzae – Pseudomonas aeruginosa ● Important gene in: – E. coli (slow growth) – Yeast (petite) ● Plant proteins both expressed ● Cyano-like protein in plastids ● E. coli protein has folate site • Co-expression in Arabidopsis • Clusters with Fe/S proteins • Only occurs if IscA is present • Associated with aerobic lifestyle • H2O2-induced in E. coli • High-throughput screens - Essentiality & phenomics - Proteomics COG0354 – Predictions & Experimental Validation COG0354 PREDICTIONS ● Is a folate-dependent enzyme ● Folate mutations abolish activity ● Combats oxidative stress ● Mutant oxidative stress-sensitive ● Helps make/repair Fe/S clusters ● Mutant many Fe/S enzyme defects ● Function is ancient & ubiquitous (like Fe/S proteins themselves) ● Complementation by all kingdoms Controls E. coli Plant & mammal Plant Protist C Vector Plant M E. coli Fungi, protist, Archaea Mammal LB + plumbagin (oxidative stress) Archaea Yeast The power of comparative genomics “The facts are known but they are insulated and unconnected…. The pearls are there but they will not hang together until some one provides the string” Hypothesis that connects and unifies observations William Whewell (1794-1866) English Scientist, Philosopher, Anglican priest An early influence on Charles Darwin Coined the term “scientist”