• Principles of comparative genomics Scale of the ‘unknown’ gene problem

advertisement
Principles of comparative genomics
•
•
•
•
Scale of the ‘unknown’ gene problem
Shared plant-prokaryote genes
Comparative genomics
•
•
•
•
When Blast tells you nothing….
The ‘guilt by association’ principle
‘Two-dimensional’ gene annotation
SEED subsystems
Plant-prokaryote examples
•
•
Filling ‘pathway holes’ – FolQ
Linking new functions to known systems – COG0354
Whole genome sequencing progress
10000
Number of genomes
9000
www.genomesonline.org
8000
7000
6000
5000
4000
3000
Ongoing
2000
Complete
1000
0
● Functional annotation of genes has nowhere near kept pace
● Functional annotations are often absent, vague, or wrong
Orphan genes
•
20-60% of genes in any
given genome have no
known function or only a
vague one (‘esterase’ etc)
Orphan enzymes
•
1437/3736 enzymes (38%)
with EC numbers have no
associated genes
The unknown protein problem in various groups
Percentage of unknown proteins encoded by diverse genomes
Known
Unknown
Percent of proteins
100
80
60
40
20
0
Bacteria
Archaea
Eukarya
Data from The SEED http://theseed.uchicago.edu/
Plants & prokaryotes share many (unknown) genes
● Estimates for Arabidopsis vary – but all are many thousands
● Functions of most shared genes are metabolic
Source of genes
Number of genes
% of genome
11170
43.4
Cyanobacteria
5470
21.0
Proteobacteria
1170
4.6
Gram+ bacteria
2280
9.1
Other bacteria
1160
4.6
Archaea
1090
4.4
Total
From de Crecy-Lagard & Hanson Trends Microbiol 15: 563 (2007)
● Shared genes identifiably from various groups
● Plants are conglomerates of microbial metabolic genes
●  Many opportunities for comparative genomics
The power of comparative genomics
● Suppose you have an unknown plant protein:
● BlastP search gives various prokaryote hits
● None of them have clear functions  Dead end
● No! This is the beginning of comparative genomics
● Predicts functions via ‘guilt by association’ principle
● Genes of related function are associated in various ways
● e.g. Enzymes in a pathway, proteins in a complex
● Whatever a gene’s associates do, it probably does too
Genomic evidence
A
B
C
Association
evidence
Post-genomic evidence
Gene W
Gene X
Gene Y
Gene Z
D
Gene clustering
Co-expression
Orf X
Orf Y
Orf XY
A
Gene fusion
XYYX
A
XYYX
Predictions
B
XYYX
XYYX
B
C
Protein-protein
interactions
V
M
Organelle proteomes
C
D
Shared regulatory sites
+
+
+
–
–
+
Testing (genetics,
biochemistry)
Essentiality & other phenome data
+
–
–
Phylogenetic occurrence
Structures
Two-dimensional gene annotation
•
•
•
‘Dimensions’ are:
•
•
Molecular function (e.g., an enzyme activity with EC no.)
Functional context (e.g., other enzymes of a pathway)
‘2-Dimensions good, 1-dimension bad’
•
•
Even an EC no. function may be wrong if pathway not there
Pathway context may be wrong if certain enzymes missing
GenBank etc annotations are 1-dimensional (mol. function)
SEED subsystems
• SSs
Subsystems
cover many
(SSs)
genomes,
capture both
have annotation
form of spreadsheet:
dimensions
• Columns
are molecular functions
• Sets
of molecular functions (e.g. enzymes) that together
• Rows areagenomes
implement
specific biological process (e.g. a pathway)
•
Each cell identifies the genes for proteins with the specific
molecular functional role in the designated genome
Folate biosynthesis subsystem
Pathway
hole
Plant – prokaryote examples
•
•
•
•
Prokaryote association evidence is mainly genomic
Plant association evidence is mainly post-genomic
Post-genomic evidence is noisier but very useful
Superb plant post-genomic resources:
•
•
•
•
Microarrays, RNAseq (organ- and environment-specific)
Organellar targeting prediction, proteomics (location can r/o function)
Phenome databases (chlorosis, lethality can support function)
Vast plant metabolism bibliome
FolQ – Filling a pathway hole
Folate synthesis pathway
FolE
GTP
FolQ
DHN-P3
[P-ase]
DHN-P
PabAB
Chrorismate
FolB
DHN
FolK
HMDHP
FolP
HMDHP-P2
FolA
DHF
THF
Glu
PabC
ADC
FolC
DHP
pABA
•
FolQ universally missing (prokaryotes, plants, fungi, protists)
•
Missing step known to be a pyrophosphohydrolase, ~17 kDa
• Search genomes for small hydrolase clustered with fol genes
• YlgG candidate in Firmicutes, Nudix hydrolase family, 19 kDa
Lactococcus lactis
folate gene cluster
•
folEK
folP
YlgG has a plant homolog – At1g68760
ylgG
folC
FolQ – Experimental tests
Folate synthesis pathway
FolE
GTP
FolQ
[P-ase]
DHN-P
DHN-P3
PabAB
Chrorismate
FolK
HMDHP
FolP
HMDHP-P2
FolC
DHP
FolA
DHF
THF
Glu
PabC
ADC
ylgG KO accumulates DHN-P3
pABA
•
YlgG & At1g68760 act on DHN-P3
Recombinant proteins release DHN-P + PPi
WT
240
KO
YlgG
DHN-P3
At1g68760
1.5
Product formation (nmol/assay)
200
Fluorescence
•
FolB
DHN
160
120
DHNP3
80
40
0.9
1.0
0.6
0.5
0.3
0
0
2
4
6
2
Minutes
4
6
0
DHNP
Pi
PPi
DHNP
Pi
PPi
COG0354
– A–folate
protein
Fe/S
cluster repair
in oxidative
stress
COG0354
Linking
a for
new
function
to known
system
Mouse
Fly
•
Yeast
- Bacteria
- Archaea
- Fungi Animals
Leishmania
At4g12130
Rickettsia
Ehrlichia
Anaplasma
Bradyrhizobium
Burkholderia
Neisseria
In all kingdoms of life
•
Plants
2 plant proteins
- 1 related to rickettsias (mitochondria)
Xanthomonas
Psychrobacter
- 1 related to cyanobacteria (plastids)
E. coli
Shewanella
Thermus
Deinococcus
Synechocystis
At1g60990
Synechococcus
Nostoc
•
Homolog of GcvT protein
- But clearly a distinct clade
Haloarcula
Natronomonas
Corynebacterium
Folate-dependent
Streptomyces
Solibacter
Blastopirellula
Pirellula
GcvT
Yeast GcvT
Mouse GcvT
Arabidopsis GcvT
Rice GcvT
COG0354 – Comparative genomics & post-genomic data
Arabidopsis Transcriptome DB
(Max Planck Institute, Golm)
Developmental series
Mitochondrial COG0354
Mitochondrial Frataxin
Ferritin 2
Mitochondrial COG0354
•
Co-expression in Arabidopsis
- Mitochondrial COG0354 expression
correlates with frataxin (Fe/S assembly)
- And with ferritin 2 (Fe storage)
COG0354 – Comparative genomics & post-genomic data
COG0354
Fe/S protein
Fe/S partner
● Nif cluster in Methylococcus capsulatus
0354 nifQ fd
nifX
nifN
nifE
fd
nifK
nifD
nifH
● Suf cluster in Rubrobacter xylanophilus
0354
sufC
sufB
sufD
sufS
thiC
● Sdh operon in Stenotrophomonas maltophila
0354 sdhC sdhD
sdhB
sdhA
● NAD synthesis cluster in Pelagibacter ubique
0354
nadA
nadC
● MiaB (Radical SAM) in Buchnera aphidicola
0354
MiaB
•
Co-expression in Arabidopsis
•
Clusters with Fe/S proteins
COG0354 – Comparative genomics & post-genomic data
•
Co-expression in Arabidopsis
•
Clusters with Fe/S proteins
•
Only occurs if IscA is present
Bacteria
Firmicutes
Clostridiales
Mollicutes
Lactobacillales
Staphylococcaceae
Listeriaceae
Bacillaceae
Fusobacteria
Actinobacteria
Bifidobacterium
Cyanobacteria
Acidobacteria
δ/ε-Proteobacteria
α-Proteobacteria
β-Proteobacteria
γ-Proteobacteria
Magnetococcus
Spirochaetes
Planctomycetes
Chlamydiales
Chlorobi
Bacteroidetes
Campylobacterales
Bdellovibrionales
Desulfobacterales
Desulfovibrionales
Desulfuromonadales
Myxococcales
Syntrophobacterales
Bacteroidales
Flavobacteria
Sphingobacteria
Deinococcus/Thermus
Chloroflexi
Thermotogae
Archaea
Nanoarcheota
Crenarchaeota
Euryarchaeota
Gene present
Gene absent
Archaeoglobi
Halobacteria
Methanobacteria
Methanococci
Methanomicrobia
Methanopyri
Thermococci
Thermoplasmata
- IscA proteins are scaffolds in Fe/S
cluster assembly
COG0354 – Comparative genomics & post-genomic data
•
Co-expression in Arabidopsis
•
Clusters with Fe/S proteins
•
Only occurs if IscA is present
•
Associated with aerobic lifestyle
COG0354 – Comparative genomics & post-genomic data
● Essential gene in:
– Mycobacterium tuberculosis
– Haemophilus influenzae
– Pseudomonas aeruginosa
● Important gene in:
– E. coli (slow growth)
– Yeast (petite)
● Plant proteins both expressed
● Cyano-like protein in plastids
● E. coli protein has folate site
•
Co-expression in Arabidopsis
•
Clusters with Fe/S proteins
•
Only occurs if IscA is present
•
Associated with aerobic lifestyle
•
H2O2-induced in E. coli
•
High-throughput screens
- Essentiality & phenomics
- Proteomics
COG0354 – Predictions & Experimental Validation
COG0354 PREDICTIONS
● Is a folate-dependent enzyme
● Folate mutations abolish activity
● Combats oxidative stress
● Mutant oxidative stress-sensitive
● Helps make/repair Fe/S clusters
● Mutant many Fe/S enzyme defects
● Function is ancient & ubiquitous
(like Fe/S proteins themselves)
● Complementation by all kingdoms
Controls
E. coli
Plant & mammal
Plant Protist
C
Vector
Plant M
E. coli
Fungi, protist, Archaea
Mammal
LB + plumbagin (oxidative stress)
Archaea
Yeast
The power of comparative genomics
“The facts are known but they are insulated and
unconnected…. The pearls are there but they will not
hang together until some one provides the string”
Hypothesis that connects
and unifies observations
William Whewell (1794-1866)
English Scientist, Philosopher, Anglican priest
An early influence on Charles Darwin
Coined the term “scientist”
Download