Phylogenetics workshop: Protein sequence phylogeny week 2 Darren Soanes • • • • • • Species trees Interpretation of trees Taxon sampling Tools Lateral (horizontal) gene transfer Fast evolving genes Using DNA sequence to construct trees TGCTATT TGCTTTT TGCTTTT TGCTTTT – sequence change due to mutation TGCTATT – ancestral DNA sequence Reversals can confuse phylogenies TGCTATT TGCTTTT TGCTTTT TGCTTTT TGCTATT TGCTATT reversal TGCTTTT – sequence change TGCTATT – ancestral DNA sequence To minimise the effect of reversals • Use DNA sequences that are evolving slowly – mutations happen rarely. • Use long stretches of DNA. • Align sequences, use the parts of the alignment that show a high degree of conservation. • rDNA sequences (genes that encode ribosomal RNA) are often used. Species tree constructed using ribosomal DNA (rDNA) sequence Using protein sequences to create species trees • Advantages – protein sequences evolve more slowly than DNA sequences (many DNA mutations are neutral – they do not change amino acid sequences) – reversals are less common than in DNA • Single copy protein encoding genes identified • Protein sequences joined together to create a multiple protein sequence for each species • Sequences aligned • Disadvantage – need sequenced genomes Fungal species trees – more proteins = better resolution oomycete (not fungi) microsporidia 30 proteins plant zygomycete basidiomycetes ascomycetes yeasts 60 proteins filamentous ascomycetes Fungal Species Tree (based on 153 concatenated protein sequences) Clades A clade consists of an ancestor organism and all its descendants. Gene trees • The evolutionary history of genes can be represented as phylogenetic trees based on alignment of protein sequences. • Gene duplication and loss can be inferred from phylogenetic trees. • Protein sequences evolve more slowly that DNA sequences (due to redundancy in genetic code) Gene duplication • Gene duplication due to unequal crossing over during meiosis can create gene families. • Sequence and function of different members of a gene family can diverge. Gene duplication Sequence homology (1) • Genes are said to be homologous if they share a common evolutionary ancestor. • Orthologues are genes in different species that evolved from a common ancestral gene by speciation. Normally, orthologues retain the same function in the course of evolution. (e.g. myoglobin in mammals). Sequence homology (2) • Paralogous genes are related by duplication within a genome. Paralogues often evolve new functions, even if these are related to the original one. • In-paralogues, paralogues that were duplicated after a speciation and are therefore in the same species • Out-paralogues, paralogues that were duplicated before a speciation. Not necessarily in the same species. Orthology and paralogy Paralogues A, B and C are different species α and β are different paralogues of the same gene Out-paralogues In-paralogues Evolution of globin superfamily in human lineage TOR gene duplication events in fungi TOR: protein kinase, subunit of a complex that regulate cell growth in response to nutrient availability and cellular stresses Taxon sampling methods • BLAST easiest – though subjective • Occurence of Pfam (protein family) motif • Clustering e.g. – INPARANOID http://inparanoid.sbc.su.se/cgibin/index.cgi – orthoMCL http://www.orthomcl.org/cgibin/OrthoMclWeb.cgi Minimum bootstrap • 70% bootstrap is thought to be broadly similar to P-value 0.05 • Minimum bootstrap used depends on study • To improve bootstrap support – remove poorly aligned sequences if possible, can be due to mis-annotation of genomes. – Change taxon sampling Collapse branches with bootstrap less than defined value Lateral gene transfer (purine-cytosine permease) oomycete fungi Eukaryotic Tree of Life Phytophthora sojae Aspergillus oryzae Genes that evolve quickly (1) • Synonymous substitution – change in DNA sequence that does not affect the amino acid sequence, often in the third position of a codon, e.g. CCG (Pro)→CCA (Pro). • Non-synonymous substitution - change in DNA sequence that does affect the amino acid sequence, often in the first or second position of a codon, e.g. CCG (Pro)→CAG (Gln). Genes that evolve quickly (2) • For a given protein encoding gene (comparison between orthologues in more than one species) • dN=number of non-synonomous mutations • dS=number of synonomous mutations • We can calculate the ratio dN/dS. • For most genes this is < 1 • Genes under evolutionary pressure to change protein sequence (diversify), dN/dS > 1 Genes that evolve quickly (3) • CodeML (part of the PAML package) will calculate dN/dS for a set of orthologues from different (closely related) species. • Human vs Chimpanzee – rapidly evolving genes involved in immunity, reproduction and olfaction (smell). • Genes with very low dN/dS (under purifying selection) involved in metabolism, intracellular signalling, nerve / brain function.