Genomic Firsts 1976: RNA virus -- Phage MS2 (3 kbp) 1977: DNA virus -- Phage Φ-X174 (6 kbp) 1995: Bacteria -- Haemophilus influenzae (1.8 Mbp) 1995: Eukarya -- Saccharomyces cerevisiae (12 Mbp) 1996: Archaea -- Methanococcus jannaschii (1.6 Mbp) 2000: draft human genome -- J. Craig Venter (3 Gbp) Genome Sequencing Explosion Genome Sequencing Explosion Three domains of life 16S rRNA sequences Woese 1987 Global phylogeny of 191 organisms derived from 31 conserved protein genes. Tree is fairly well resolved and agrees mostly with rRNA tree. Ciccarelli et al (2006) Science Genomic streamlining in prokaryotes Proteobacteria (from Higgs & Attwood) ~1000 bp/gene short intergenic regions Efficiency in the Genome Small organisms care about DNA replication time. No wasted space High coding density (85-90%) 1 gene per 1000 bases in prokaryotes Haemophilus influenzae 1762 genes in 1.8 Mb Human 23000 genes in 3080 Mb Eukaryotic genomes have lots of transposons and repetitive sequences. Hou and Lin – PLoS ONE 2009 The larger organelle genomes also have a greater fraction of non-coding sequence, but small animal mitochondria fit the trend of the bacteria. large variation in genome size between bacteria Sorangium cellulosum (14000kb) 11599 codong sequences Soil bacterium Tremblaya princeps (140kb) 121 coding sequences Endosymbiont in insect cells McCutcheon and Moran Nature Reviews (2012) McCutcheon and Moran Nature Reviews (2012) Reduced size genomes evolve independently in different lineages. Usually on long branches = fast sequence evolution. Subdivisions of proteobacteria identified using 16S rRNA originally proteobacteria - Agrobacterium tumefaciens - genetic engineering - Rickettsia conorii – ticks – spotted fever - Rickettsia prowazeckii – lice – typhus proteobacteria -Neisseria meningitidis - N. gonorrhoea proteobacteria -Escherichia coli – commensal – lab study - Yersinia pestis – plague - Haemophilus influenzae – respiratory pathogen. (First bacterial genome) - Xanthomonas / Xylella – plant pathogens proteobacteria - Helicobacter pylori – intestinal infections Considerable change in GC content among related genomes. Short genomes are derived from longer genomes – lots of deletions in cases of intracellular parasites and endosymbionts. Pathogens and intracellular bacteria have low GC content – May be a result of metabolic cost of synthesis of G and C being higher (Rocha and Danchin, 2002) These genomes are also small – use it or lose it! This may explain correlation of GC content with genome size It has also been argued that there is a general mutation bias towards AT, and that selection for GC keeps this from going to very low GC in most organisms. This stabilizing selection might be weaker in smaller intracellular organisms. Therefore smaller genomes have more AT. ...However, two extremely small genomes break the trend. Maybe these have a mutation bias in the other direction (towards GC) – this is not yet measured. Circular representation of the R. conorii genome (strain Malish 7). The outermost circle indicates the nucleotide positions. The second and third circles locate the ORFs on the plus and minus strands, respectively. Function categories are color-coded [see Web fig. 1 (10)]. The fourth and fifth circles locate tRNAs. The locations of three rRNAs are indicated by black arrows. The sixth and seventh circles indicate the locations of repeats. The eighth circle shows the G-C skew (GC/G+C) with a window size of 10 kb. The region locally breaking the genome colinearity with R. prowazekii is indicated by a shaded sector. The four major genomic segments involved in this rearrangement are colored in blue, yellow, green, and red. Ogata et al – Science (2001) Illustration of the colinearity. Three distinct segments from the R. conorii genome aligned with the homologous segments from the R. prowazekii genome are shown. These segments were chosen to show three types of gene alteration: split genes in R. prowazekii (top), a split gene in R. conorii (middle), and a gene remnant in R. prowazekii (bottom). Comparison of genomes of related organisms shows synteny – but relatively rapid evolution of gene order Mycoplasma genitalium and M. pneumoniae Each dot shows a high-scoring BLAST match between a gene of one species and a gene of the other species Gene gain via Horizontal Gene Transfer (mostly prokaryotes) Gene gain via Gene Duplication (mostly eukaryotes) Genomic streamlining in symbionts and pathogens McCutcheon & Moran (2012) Free-living bacteria • Selection to maintain reasonably large set of functional genes. • Gene acquisition balances gene loss • HGT mediated by viruses and plasmids gain of functions • Some cells are competent for DNA uptake (transformation) • Homologous recombination can eliminate some deleterious mutations Host-restricted parasites and endosymbionts • Fewer essential genes because of environment provided by host • Smaller effective population size (bottlenecks) • Reduced selection against slightly deleterious mutations & Reduced opportunity for homologous recombination faster sequence evolution, reduced functionality and stability of proteins (need for high level of chaperones) • Reduced selection against the deletion of slightly beneficial genes, inherent bias toward deletions, & reduced opportunity to acquire genes horizontally gene loss much faster than gene gain. Balance between selection and mutation in a large population nk = number of individuals with k deleterious mutations N = total population size U = number of deleterious mutations per genome per generation Assume no advantageous mutations. Back-mutations are very rare. Fitness w = (1-s)k For a very large population, selection balances mutation. There is a stationary state: nk N (U / s ) k! k exp( U / s ) Muller’s Ratchet – Acumulation of deleterious mutations in asexual species with small populations If N is fairly small, then the number of individuals in the fittest class, n0, can be very small. This fluctuates, and eventually goes to zero. If there are no back-mutations, the fittest class is gone forever. This is one click of the ratchet. More and more deleterious mutations with time until “mutational meltdown” kills the species fitness Muller’s Ratchet is stopped by recombination mutation After one click of the ratchet, every chromosome has at least one deleterious mutation, but they don’t all have the same one. Initial population recombination Cross-over can recreate the fittest class. This is much more likely than back-mutation in sexual species. Muller’s Ratchet and the Evolution of Sex • Two-fold cost of males in sexual species must be a big benefit of sex to outweigh this cost • A few parthenogenetic species are derive from sexual ancestors. These do not do well in the long term. • The ability of recombination to stop Muller’s ratchet is one large advantage of sex, and is one possible reason for the prevalence of sexual species. • Host-parasite co-evolution is probably another important reason. • Maybe most free-living bacteria should be thought of as sexual, not asexual. • Uptake of fragments of DNA from similar cells gives the possibility of homologous recombination. This functions like sex in eukaryotes. It can remove deleterious mutations. • Uptake of DNA from distantly related organisms (Horizontal Gene Transfer) can lead to the spread of beneficial genes • When bacteria become obligate parasites or endosymbionts, they become truly asexual. • Consequences are gene loss and accumulation of deleterious mutations. Global phylogeny of 191 organisms derived from 31 conserved protein genes. Tree is fairly well resolved and agrees mostly with rRNA tree. Ciccarelli et al (2006) Science Need to consider Eukaryotes separately for 2 reasons. (i) Almost everyone believes there is a tree for Eukaryotes. (ii) Origin of Eukaryotes is a later unique event that is very likely not tree-like. Do prokaryotic taxa mean anything? -Proteobacteria? Enterobacteriaceae? E. coli? Criticisms of the Prokaryotic Tree of Life (Bapteste et al. 2009) “Belief in the universal tree of life is stronger than the evidence from genomes that supports it.” 1. Circularity of tree methods – Phylogenetic methods always produce a tree of some kind. 2. Statistical problems – weak signals from many individual genes. Failure to reject the consensus tree is not necessarily support for it. 3. Systematic biases in phylogenetic methods. 4. Large-scale exclusion of conflicting data. Core genes not necessarily representative of a species tree. 5. Closely related species may exchange genes more frequently. 6. Unrelated species in similar niches may exchange genes more frequently. Convergent evolution? This is an interesting paper but take it with a pinch of salt Spectrum of Opinions 1. The tree of rRNA and translational genes is the species tree. Other genes appear to give different trees just because of noise and phylogenetic errors. HGT is unimportant. 2. The tree of rRNA and translational genes is the best information we have about the tree of cell divisions and speciations. Most genes follow this tree most of the time, even if most genes may have been horizontally transferred at some point in their history. 3. The tree of rRNA and translational genes tells us only about the history of these genes, and is therefore not particularly important. There are other essential groups of genes that follow other evolutionary paths. We need a network representation, not a single tree. 4. HGT is so frequent that all genes follow different histories. Therefore tree-building is a waste of time. We only get results that look like trees because our methods are designed to produce trees. Gene Content Variation among E. coli genomes. Evidence for horizontal transfer – Welch et al (2002). Core genome = intersection of sets Pangenome = union of sets Core and Pan-genome of E. coli Core genome Pan-genome Rasko et al (2008) J. Bacteriol. Rapid Gain and Loss of genes among closely related genomes of Bacillus Hao and Golding (2006) Genome Research • Assumes a tree to begin with (many conserved genes) • Only two of the patterns shown require more than one character change • Does not distinguish HGT from innovation Tree of Archaea based on signature genes Gao and Gupta (2007) BMC Genomics • Signature genes are those that are shared by all members of a group and are not posessed by any other speies. • Can the tree be constructed from gene content alone? • Does not show events that do not fit the hierarchical tree. • What about transfers within niches? Groups of genes confer metabolic activity Phylogeny of three domains of life based on shared gene content SHOT – Korbel et al (2002) S = fraction of genes that are orthologues between two species d = -lnS Input d to NJ method Major domains and groups of bacteria are obtained the same as for rRNA Does not work for very reduced genomes of parasites & symbionts Always possible to explain a presence/absence pattern by either multiple deletions or by horizontal transfer. Examples from Dagan et al (2007) (a) Loss only, (b) Single origin, (c) Origin + 1 HGT, (d) Orign + 2 HGTs The problem is, we don’t know the ratio of HGT to deletions…. Reconstructing ancestral genomes using parsimony (Dagan et al 2007) If HGT is disallowed or penalized too much, then ancestral genomes must have been far larger than any current genomes. If HGT is too frequent then ancestral genomes are apparently too small. This helps to find a moderate value for the ratio of HGT to deletions. Method of Collins & Higgs (2012) Collect genomes from NCBI All-vs-All BLASTP Single-linkage clustering Identification of universal single-copy clusters Global amino acid alignment Concatenation of alignments Phylogenetic reconstruction using Maximum Likelihood Core and Pangenomes Closed – means that pangenome size tends to a maximum as number of genomes increases Open – means that pangenome keeps increasing as you add new genomes Fitting the data suggests that the pangenome is open for most groups of bacteria and that Gpan (n) increases in proportion to ln(n). This is expected on a tree like a coalescent (a). On a star tree (b), it would increase linearly with n. Gene Frequency Spectra 9 Prochlorococcus genomes Baumdicker et al (2009) G(k) is the number of genes found in k genomes from a group of n. There is a U-shape: many genes found in only 1 or 2 genomes, a certain number of core genes in (almost) all n, and fewer genes in between. The U-shape applies at all scales from species to the full bacerial domain. 293 Bacterial genomes Lapierre and Gogarten (2009) Collins and Higgs (2012) Core, Shell and Cloud genes (Koonin and Wolf – 2012) The role of gene duplication: Gene family size distributions Collins and Higgs (2011) Modelling duplication and deletion of genes u 0 1 2 2 2 3 3 3 etc. 4 Origin of Mitochondria Sequence similarity to Rickettsia – within proteobacteria Also conserved gene order between Rickettsia and the mitochondrial genome of the protist Reclinomonas (one of the largest mitochondrial genomes). Gene order and phylogeny for Hodgkinia (very small endosymbiont – see assignment 3) Shows it has evolved independently of the lineage leading to Rickettsia and mitochondria Derived change in Rickettsia not shared with Hodgkinia Hodgkinia placed within Rhizobiales – raises questions of GC content bias and long branch attraction Long Branch Attraction - An artefact of phylogenetic methods that tends to put unrelated species with rapid evolution together. It can also draw long branch species closer to the root, because they are attracted to the outgroup. Rooting the tree of life using ancient gene duplications Long Branch attraction and the tree of rRNA (Gribaldo and Philippe 2002) Typical tree in older papers shows many lineages on long branches close to the roots of Bacteria and Eukarya Are there any eukaryotes that never had mitochondria? Were ancestral organisms hyperthermophiles? Root is usually inferred from ancient gene duplications – eg EFTu and EFG After correcting for long branch attraction... Microsporidia are now related to fungi. They have small genomes with lots of gene loss and rapid sequence evolution. Current thought says there may never have been eukaryotes without mitochondria. Eukaryotes evolved by fusion of an protobacterium with an archaeon. The event that created the mitochondria also created the nucleus. Phylogeny of major bacterial groups is still uncertain. Deduction of temperature at base of tree is difficult. Most papers still argue for hyperthermophiles at common ancestor of archaea and bacteria. Seems strange! This would make prokaryotes monophyletic after all Root is still most likely here, although this paper questions it. Growth temperature mapped onto the rRNA tree Or was there a mesophilic origin after all? Competing hypotheses for the origin of the eukaryotic host cell. TA Williams, et al. Nature 504, 231-236 (2013) doi:10.1038/nature12779 Eocyte hypothesis: Standard picture: The root is on the bacterial branch There is a common ancestor of archaea amd eukaryotes The root is (still) on the bacterial branch Eukaryotes fall within the archaea. They have a common ancestor with Eocytes/Crenarchaeota. Only Two Domains! Maybe Giant Viruses are a Fourth Domain? RNA polymerase sequences from Global Ocean SurveyGOS Wu et al. (2011)