Phylogenetic analysis A brief introduction in 2 x 4 hours brigitte.boeckmann@isb-sib.ch What you can learn today • • • • • • • • Understand trees Different types of gene relationships The difference between a cladogram and a phylogram Phylogenetic analysis methods Steps performed during a phylogenetic analysis Search strategies for tree topologies Measures for tree robustness Gene relationships and function prediction © 2009 SIB Outline • Introduction to phylogenetic analysis • Application: Protein function prediction • Databases, servers and software • TP5 © 2009 SIB Introduction Phylogeny is the study of evolutionary relationships. Phylogenetic analysis is the means of inferring evolutionary relationships. Ancestral genome Polymorphisms - CNV Gene duplication – Gene loss – gene fusion – gene fission - exon shuffling – retroposition – mobile elements – de novo gene origination HGT © 2009 SIB Genome species 1 Genome species 2 HGT Trees A B C D E F G End nodes Internal nodes Branches Roots © 2009 SIB A B C D E F G Phylogenetic trees • Cladogram • Phylogram The branch length represents the number of character changes Molecular clock © 2009 SIB Phylogenetic trees • • A phylogenetic tree is a model about the evolutionary relationship between operational taxonomic units (OTUs) based on homologous characters. But not all trees are phylogenetic trees – Dendrogram: general term for a branching diagram – Cladogram: branching diagram without branch length estimates – Phylogram or phylogenetic tree: branching diagram with branch length estimates Please note: Guide trees produced during multiple sequence alignment have no phylogenetic meaning: the dendrograms are based on distances derived from pair-wise alignments; they are used to determine in what order sequences are aligned during the construction of the MSA. © 2009 SIB Rooted and unrooted trees Outgroup © 2009 SIB How many distinct trees? Solved (bifurcating) and un(re)solved (multifurcating) trees © 2009 SIB A A B B C C D D E E F F G G Speciation and gene duplication A1 A1 B1 B1 Gene duplication C1 B2 A2 C B2 D C2 E D F Gene duplication © 2009 SIB Relationships within homologs Frog gene 1 Human gene 1 Orthologs Mouse gene 1 Gene duplication Paralogs Mouse gene 2 Ancestral gene Human gene 2 Frog gene 2 Drosophila gene © 2009 SIB Homologs Orthologs Relationships between orthologs and paralogs Frog gene 1 Human gene 1 Orthologs (Group 1) Mouse gene 1 Gene duplication Inparalogs of Group 2 Mouse gene 2 Ancestral gene Human gene 2 Frog gene 2 Drosophila gene © 2009 SIB Outparalogs of Group 1 Co-orthologs of the Drosophila gene Orthologs (Group 2) Gene trees versus species trees … © 2009 SIB Gene relationships Homologs = Genes of common origin Orthologs = 1. Genes resulting from a speciation event, 2. Genes originating from an ancestral gene in the last common ancestor of the compared genomes Co-orthologs = Orthologs that have undergone lineage-specific gene duplications subsequent to a particular speciation event Paralogs = Genes resulting from gene duplication Inparalogs = Paralogs resulting from lineage-specific duplication(s) subsequent to a particular speciation event Outparalogs = Paralogs resulting from gene duplication(s) preceding a particular speciation event One-to-one (1:1) orthologs = Orthologs with no (known) lineage-specific gene duplications subsequent to a particular speciation event One-to-many (1:n) orthologs: Orthologs of which at least one - and at most all but one - has undergone lineage-specific gene duplication subsequent to a particular speciation event Many-to-many (n:n) orthologs = Orthologs which have undergone lineagespecific gene duplications subsequent to a particular speciation event Pseudo-orthologs = Paralogs with lineage-specific gene loss of orthologs Xenologs = Orthologs derived by horizontal gene transfer from another lineage © 2009 SIB Phylogenetic analysis – an approach I Sequence data of actin-related protein 2 Species are: >Species A - RecName: Full=Actin-related protein 2; MDSQGRKVVV CDNGTGFVKC GYAGSNFPEH IFPALVGRPI ASELRSMLEV NYPMENGIVR NWDDMKHLWD YTFGPEKLNI EKIVEVMFET YQFSGVYVAI QAVLTLYAQG LLTGVVVDSG RLDIAGRDIT RYLIKLLLLR GYAFNHSADF ETVRMIKEKL VESYTLPDGR IIKVGGERFE APEALFQPHL INVEGVGVAE IVLSGGSTMY PGLPSRLERE LKQLYLERVL KGDVEKLSKF LADIMKDKDN FWMTRQEYQE KGVRVLEKLG VTVR IRSTTKVGNI DTRNCKILLT DGVTHICPVY CYVGYNIEQE LLFNTIQAAD KIRIEDPPRR EIKDLMVGDE EPPMNPTKNR EGFSLPHLTR QKLALETTVL IDTRSEFYKH KHMVFLGGAV >Species B - RecName: Full=Actin-related protein 2; MDSQGRKVVV CDNGTGFVKC GYAGSNFPEH IFPALVGRPI ASELRSMLEV NYPMENGIVR NWDDMKHLWD YTFGPEKLNI EKIVEVMFET YQFSGVYVAI QAVLTLYAQG LLTGVVVDSG RLDIAGRDIT RYLIKLLLLR GYAFNHSADF ETVRMIKEKL VESYTLPDGR IIKVGGERFE APEALFQPHL INVEGVGVAE IVLSGGSTMY PGLPSRLERE LKQLYLERVL KGDVEKLSKF LADIMKDKDN FWMTRQEYQE KGVRVLEKLG VTVR IRSTTKVGNI DTRNCKILLT DGVTHICPVY CYVGYNIEQE LLFNTIQAAD KIRIEDPPRR EIKDLMVGDE EPPMNPTKNR EGFSLPHLTR QKLALETTVL IDTRSEFYKH KHMVFLGGAV …. © 2009 SIB Caenorhabditis briggsae Drosophila melanogaster Homo sapiens Mus musculus Schizosaccharomyces pombe • • • • • • ARP2_A ARP2_B ARP2_C ARP2_D ARP2_E MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE *:* :* ******** *** *** . **::****::*: . *::::**:***:* • • • • • • ARP2_A ARP2_B ARP2_C ARP2_D ARP2_E AEAVRSLLQVKYPMENGIIRDFEEMNQLWDYTF-FEKLKIDPRGRKILLTEPPMNPVANR CSQLRQMLDINYPMDNGIVRNWDDMAHVWDHTFGPEKLDIDPKECKLLLTEPPLNPNSNR ASQLRSLLEVSYPMENGVVRNWDDMCHVWDYTFGPKKMDIDPTNTKILLTEPPMNPTKNR ASELRSMLEVNYPMENGIVRNWDDMKHLWDYTFGPEKLNIDTRNCKILLTEPPMNPTKNR ASELRSMLEVNYPMENGIVRNWDDMKHLWDYTFGPEKLNIDTRNCKILLTEPPMNPTKNR .. :*.:*::.***:**::*::::* ::**:** :*:.**. *:******:** ** • • • • • • ARP2_A ARP2_B ARP2_C ARP2_D ARP2_E EKMCETMFERYGFGGVYVAIQAVLSLYAQGLSSGVVVDSGDGVTHIVPVYESVVLNHLVG EKMFQVMFEQYGFNSIYVAVQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFALHHLTR EKMIEVMFEKYGFDSAYIAIQAVLTLYAQGLISGVVIDSGDGVTHICPVYEEFALPHLTR EKIVEVMFETYQFSGVYVAIQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFSLPHLTR EKIVEVMFETYQFSGVYVAIQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFSLPHLTR **: :.*** * *.. *:*:****:****** :***:********* **** . * **. • • • • • • ARP2_A ARP2_B ARP2_C ARP2_D ARP2_E RLDVAGRDATRYLISLLLRKGYAFNRTADFETVREMKEKLCYVSYDLELDHKLSEETTVL RLDIAGRDITKYLIKLLLQRGYNFNHSADFETVRQMKEKLCYIAYDVEQEERLALETTVL RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRIMKEKLCYIGYDIEMEQRLALETTVL RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRMIKEKLCYVGYNIEQEQKLALETTVL RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRMIKEKLCYVGYNIEQEQKLALETTVL ***:**** *.***.*** .** **.:******* :******:.*::* : .*: ***** • • • • • • ARP2_A ARP2_B ARP2_C ARP2_D ARP2_E MRNYTLPDGRVIKVGSERYECPECLFQPHLVGSEQPGLSEFIFDTIQAADVDIRKYLYRA SQQYTLPDGRVIRLGGERFEAPEILFQPHLINVEKAGLSELLFGCIQASDIDTRLDFYKH VESYTLPDGRVIKVGGERFEAPEALFQPHLINVEGPGIAELAFNTIQAADIDIRPELYKH VESYTLPDGRIIKVGGERFEAPEALFQPHLINVEGVGVAELLFNTIQAADIDTRSEFYKH VESYTLPDGRIIKVGGERFEAPEALFQPHLINVEGVGVAELLFNTIQAADIDTRSEFYKH .*******:*.:*.**:*.** ******:. * *::*: *. ***:*:* * :*. • • • • • • ARP2_A ARP2_B ARP2_C ARP2_D ARP2_E IVLSGGSSMYAGLPSRLEKEIKQLWFERVLHGDPARLPNFKVKIEDAPRRRHAVFIGGAV IVLSGGTTMYPGLPSRLEKELKQLYLDRVLHGNTDAFQKFKIRIEAPPSRKHMVFLGGAV IVLSGGSTMYPGLPSRLEREIKQLYLERVLKNDTEKLAKFKIRIEDPPRRKDMVFIGGAV IVLSGGSTMYPGLPSRLERELKQLYLERVLKGDVEKLSKFKIRIEDPPRRKHMVFLGGAV IVLSGGSTMYPGLPSRLERELKQLYLERVLKGDVEKLSKFKIRIEDPPRRKHMVFLGGAV ******::**.*******.*:***:::***:.: : :**:.** .* *. **:**** • • • • • • ARP2_A ARP2_B ARP2_C ARP2_D ARP2_E LADIMAQND-HMWVSKAEWEEYGV-RALDKLGPRTT LANLMKDRDQDFWVSKKEYEEGGIARCMAKLGIKALAEVTKDRD-GFWMSKQEYQEQGL-KVLQKLQKISH LADIMKDKD-NFWMTRQEYQEKGV-RVLEKLGVTVR LADIMKDKD-NFWMTRQEYQEKGV-RVLEKLGVTVR **:: :.* :*::. *::* *: . : ** Species are: Caenorhabditis briggsae Drosophila melanogaster Homo sapiens Mus musculus Schizosaccharomyces pombe Which sequence is likely to correspond to which species? • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • ARP2_A ARP2_B MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE ARP2_A ARP2_C MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_A ARP2_D MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_A ARP2_E MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE ARP2_B ARP2_C MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_B ARP2_D MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_B ARP2_E MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE ARP2_C ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ARP2_C ARP2_E MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE ARP2_D ARP2_E MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE © 2009 SIB Species are: Caenorhabditis briggsae Drosophila melanogaster Homo sapiens Mus musculus Schizosaccharomyces pombe Distance matrix © 2009 SIB A B C D E A 0 - - - - B 158 0 - - - C 143 107 0 - - D 139 97 73 0 - E 139 97 73 0 0 Expected species tree for … • • • • • Caenorhabditis briggsae Drosophila melanogaster Homo sapiens Mus musculus Schizosaccharomyces pombe © 2009 SIB Phylogenetic analysis 1. 2. 3. 4. 5. 6. © 2009 SIB Data selection Data comparison Selection of a data model Selection of an evolutionary model Tree-building Tree evaluation What data types can be used to infer phylogenies? • • • • • • © 2009 SIB Morphological characters Physiological characters Gene order Sequence data (nucleotide sequences, amino acid sequences) Mixed characters …. Data selection • To be considered: – Input data must be homolog! – Taxonomic range and ~ distribution (balance, avoid LB) – Content of phylogenetic information – Number of character states – Size of the dataset – etc © 2009 SIB Phylogenetic analysis 1. 2. 3. 4. 5. 6. © 2009 SIB Data selection Data comparison Selection of a data model Selection of an evolutionary model Tree-building Tree evaluation Data comparison • To be considered: – Prediction of characters that are derived from a common ancestor – Chose a suitable alignment method – Highly diverged sequences • Domain/family predictions • Structures © 2009 SIB Alignment • • • Pairwise alignment versus MSA MSA methods – ClustalW (very fast) – Muscle (very fast) – MAFFT (fast) – Probcons – T-coffee – … When to use which method and why? © 2009 SIB Phylogenetic analysis 1. 2. 3. 4. 5. 6. © 2009 SIB Data selection Data comparison Selection of a data model Selection of an evolutionary model Tree-building Tree evaluation Selection of a data model • Characters to be selected for the analysis • To be considered: – Each position in the alignment should be homolog! – Missing data (in some OTU) – Number of characters – etc © 2009 SIB Selection of a data model • Common methods – Gap removal – GBLOCKS © 2009 SIB Phylogenetic analysis 1. 2. 3. 4. 5. 6. © 2009 SIB Data selection Data comparison Selection of a data model Selection of an evolutionary model Tree-building Tree evaluation Evolutionary models • Phylogenetic tree-building presumes particular evolutionary models • The model chosen influences the outcome of the analysis and should be considered in the interpretation of the analysis results © 2009 SIB Evolutionary models • © 2009 SIB Which aspects are to be considered? – … – … – … – … – etc Evolutionary models • © 2009 SIB Which aspects are to be considered? 1. Frequencies of aa exchange – … – … – … – etc http://www.russell.embl-heidelberg.de/aas/other_images/lb3.gif Frequencies of aa exchange • Substitution matrices – Empirically derived from alignment datasets • PAM (Dayhoff, 1968) • JTT (Jones, Taylor, Thornton, 1992) • Gonnet et al. (1992) • WAG (Whelan, Goldman, 2001) • mtrev (Hadachi, Hasegawa, 1996, specific for mitochondrial data) – Estimated rate matrix -> series of replacement probability matrices (e.g. PAM1 … PAM250) © 2009 SIB Evolutionary models • Which aspects are to be considered? 1. Frequencies of aa exchange 2. Change of aa frequencies during evolution – … – … – etc Why? © 2009 SIB © 2009 SIB Evolutionary models • © 2009 SIB Which aspects are to be considered? 1. Frequencies of aa exchange 2. Change of aa frequencies during evolution • GC content – Differs between species (20-72%) – Differs within a genome (isochores) – Biased recombination-associated DNA repair – Temperature Evolutionary models • © 2009 SIB Which aspects are to be considered? 1. Frequencies of aa exchange 2. Change of aa frequencies during evolution • Exchangeability matrix can be build for a particular dataset • JTT + F Evolutionary models • © 2009 SIB Which aspects are to be considered? 1. Frequencies of aa exchange 2. Change of aa frequencies during evolution 3. Between-site rate variation or Among-site substitution rate heterogenity Alignment © 2009 SIB Evolutionary models • © 2009 SIB Which aspects are to be considered? 1. Frequencies of aa exchange 2. Change of aa frequencies during evolution 3. Between-site rate variation or Among-site substitution rate heterogenity • Variation in substitution rates among different positions • Mostly discrete gamma model Gamma distribution is a continuous probability density function Alpha parameter Scaling factor Infinitely large alpha value, rate variation is the same for all sites Probability density alpha = 1, extensive rate variation alpha < 1, many invariable sites Relative evolutionary rate http://upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Gamma_distribution_pdf.png Evolutionary models • © 2009 SIB Which aspects are to be considered? 1. Frequencies of aa exchange 2. Change of aa frequencies during evolution 3. Between-site rate variation or Among-site substitution rate heterogenity • Variation in substitution rates among different positions • Mostly discrete gamma model • Select the number of categories (4/8) Evolutionary models • © 2009 SIB Which aspects are to be considered? 1. Frequencies of aa exchange 2. Change of aa frequencies during evolution 3. Between-site rate variation or Among-site substitution rate heterogenity 4. Presence of invariable sites Evolutionary models Notation, e.g. JTT JTT + F JTT + F + gamma (4 ) JTT + F + gamma (8 ) + I (under discussion) JTT + F + I • • © 2009 SIB It is not always the most complex model that produces the best result. The more complex the model, the more complex the explanation of the results. Evolutionary models • Selection of best-fit models (statistically) of evolution – ProtTest • AIC (Akaike Information Criterion); – simple relationship between the likelihood and the number of parameters to estimate the distance of a model from truth • BIC (Bayesian Information Criterion) – includes a penalty for the number of parameters to avoid overfitting of the selected model © 2009 SIB Phylogenetic analysis 1. 2. 3. 4. 5. 6. © 2009 SIB Data selection Data comparison Selection of a data model Selection of an evolutionary model Tree-building Tree evaluation Tree-building methods • • © 2009 SIB Distance (matrix) methods 1. Calculate distances for all pairs of taxa based on the sequence alignment 2. Construct a phylogenetic tree based on a distance matrix Character-based (Sequence) methods 1. Constructs a phylogenetic tree based on the sequence alignment Step 1: Compute distances Simple measure for the extend of sequence divergence: p distance: ^p=n /n d p = proportion (p distance) nd= number of aa differences n = number of aa used © 2009 SIB Step 1: Compute distances Number of substitutions per site • Relationship of p with t (time) 1.0 0.5 25 © 2009 SIB 50 75 Time in million years Step 1: Compute distances • • Nonlinear relationship of p with t (time) Estimate the true number of amino acid substitutions between sequence pairs – – © 2009 SIB Poisson correction (PC distance) Gamma correction (Gamma distance) Step 1: Compute distances © 2009 SIB Step 2: Tree-building Common distance methods • Neighbor Joining (NJ) • (Un)-Weighted pair-group method using arithmetic averages (UPGMA / WPGMA) • Least Square (LS) • Minimal Evolution (ME) © 2009 SIB Neighbor Joining (NJ) • • Saitou, Nei (1987) Principle – Bottom-up clustering method – Neighbours are defined as taxa connected by a single node in an unrooted tree; closest neighbours are successively joined by a new node until the tree is resolved. – Result: A single, unrooted tree with branch length estimates © 2009 SIB Neighbor Joining (NJ) Mol Biol Evol. 1987 Jul;4(4):406-25. © 2009 SIB Neighbor Joining (NJ) Neighbor Joining (NJ) • Advantage – Very efficient – Also for large datasets • Disadvantage – Does not examine all possible topologies © 2009 SIB Character- (Sequence-) based methods Most common: • Maximum Parsimony (MP) • Maximum Likelihood (ML) • Baysian Inference © 2009 SIB Maximum Parsimony (MP) • • • • • © 2009 SIB Henning, 1966 Originally developed for morphological characters William of Ockham (1285-1349, Franciscan friar): the best hypothesis is the one that requires the smallest number of assumptions The topology of the result tree is the one that requires the smallest number of evolutionary changes Group of related methods Maximum Parsimony (MP) • © 2009 SIB Principle: – Estimate the minimum number of substitutions for a given topology – Parsimony-informative sites (shared-derived characters, exclude invariable sites and singletons) – Searching MP trees • Exhaustive search • Branch-and-bound (Hendy-Penny, 1982) – Good but time-consuming, if m>20 • Heuristic search – Result tree might not be the most parsimonious tree – Result • Multiple result trees are possible (consensus tree) • Most parsimonious tree vs true tree • Unrooted result trees Maximum Parsimony (MP) • • © 2009 SIB Advantages – Free from assumptions (model-free) Disadvantages – Generally produces multiple result trees – Does not take into account homoplasy – Long-branch attraction (LBA): creates wrong topologies, if the substitution rate varies extensively between lineages Maximum Likelihood (ML) • • • • • © 2009 SIB Cavalli-Sforza, Edwards (1967), gene frequency data Felsenstein (1981), nucleotide sequences Kishino (1990), proteins Principle – Calculates likelihoods for each position in the alignment and for all possible topologies (gaps generally removed) – Result = tree with the highest likelihood – Maximizes the likelihood of observing the sequence data for a specific model of character state changes – Maximized to estimate branch lengths, not topologies Search strategies: rarely exhaustive, mostly heuristic • NNI (Nearest neighbor interchanges) • TBR (Tree bisection-reconnection) • SPR (Subtree pruning and regrafting) Number of possible trees Leaves © 2009 SIB Unrooted Rooted Maximum Likelihood (ML) • Software – – – – – © 2009 SIB PhyML (fast) ProML (Phylip) ProtML RaxML (very fast) … Bayesian estimation of phylogenies • • probability probability Very time-intensive Programs: MrBayes, PhyloBayes © 2009 SIB Tree topology 1 Tree topology 2 Tree topology 3 1.0 Prior distribution Data (observations) 1.0 Posterior distribution Phylogenetic analysis 1. 2. 3. 4. 5. 6. © 2009 SIB Data selection Data comparison Selection of a data model Selection of an evolutionary model Tree-building Tree evaluation Tree evaluation Analyze how well the data supports the result tree Tests 1. Topology • Tree reconciliation (comparison of the gene tree with the species tree) • Robustness, e.g. bootstrap, aLRT (PhyML) 2. Branch lengths tests © 2009 SIB Bootstrap • • • • by Bradley Efron (1979) Felsenstein (1985) Used to test the robustness of a tree topology Principle: – new MSA datasets are created by choosing randomly N columns from the original MSA; where N is the length of the original MSA – Phylogenetic analysis is then performed on all bootstrap replicates – The consensus tree indicates bootstrap support for each node • Mostly 1000 replicates (100 copies for large datasets) • Bootstrap support values: min. 98% (strict), min. 95% (accepted) © 2009 SIB Create a bootstrap replicate Seq_1 Seq_2 Seq_3 Seq_4 Seq_5 ILKAEEK IVRSTQR IIRSSTK IIRSTTK LLKTTSR © 2009 SIB Bootstrap and Bayesian support values. PhyML aLRT • • approximate Likelihood-Ratio Test (aLRT) aLRT is a statistical test to compute branch supports: It uses the likelihood score of a branch to calculate the approximate probability that a particular branch really exist in the true tree. It is much faster than bootstrapping. 1. aLRT 2. Chi2: parametric branch support 3. aLRT-SH: non-parametric branch support based on a Shimodaira-Hasegawa-like procedure 4. aLRT Chi2 and SH: calculates parametric and non-parametric branch support; result is the minimum support of both methods © 2009 SIB Application Phylogenetic analysis for function prediction © 2009 SIB Gene duplication • Prokaryots: at least 50% • Eukaryots: >90% © 2009 SIB After gene duplication • Coexistence (normally only for a short while) • Mostly, only one copy is retained – becomes nonfunctional (non-functionalization), – becomes a pseudogene (pseudogenization) – is lost • Both copies are retained – Distinct expression pattern – Distinct subcellular location (rare) – One copy keeps the original function, the other copy acquires a new function (neofunctionalization) – Deleterious mutations in both entries (subfunctionalization) After gene duplication • Synfunctionalization 1. Functional divergence of the paralogs (e.g. expression) 2. One paralog takes over ( in part / fully ) the function of the other paralog, which leads either to 1. Orthologs that are not functionally equivalent 2. gene loss © 2009 SIB Gene duplication followed by lineage-specific a) gene loss, b) function shuffling © 2009 SIB Steroid hormone receptors Cephalochordate Branchiostoma floridae diverged from other chordates after duplication of the ancestral SR gene. BfER, the ortholog of vertebrate estrogen receptors, negatively regulates BfSR. BfSR is specifically activated by estrogens and recognizes estrogen response elements. Bridgham JT, et al: PLoS Genet. 2008 Sep 12;4(9); © 2009 SIB ancestral function: Phylogenomic databases Some phylogenomic databases • • • • • • • • • © 2009 SIB COG/KOG eggNOG Ensembl (Compara) HOGENOM InParanoid OMA browser OrthoDB OrthoMCL PhylomeDB Phylogenomic databases differ in their • Goals • Methodologies • Number of species • Taxonomic range • Hierarchies • Result presentation • Update frequencies Software for phylogenetic analysis Examples of software packages • • • • • • • • © 2009 SIB Phylip BioNJ PhyML PAML MEGA PAUP Tree Puzzle MrBayes Servers for phylogenetic analysis • • • • • © 2009 SIB http://www.phylogeny.fr/ http://bioweb.pasteur.fr/seqanal/phylogeny/intro-uk.html http://atgc.lirmm.fr/phyml/ http://phylobench.vital-it.ch/raxml-bb/ http://power.nhri.org.tw/power/home.htm Take home • Phylogenetic trees are models - not knowledge • Data selection is a very important step and can largely facilitate phylogenetic analysis • It is not always the most complex evolutionary model that leads to the best results – but complex models make the interpretation of the results more difficult! • The most applied tree-building method is ML • Tree evaluation is the major step in phylogenetic analysis • Orthology prediction is helpful for function assignment, but the function is only known when confirmed by wet lab experiments. © 2009 SIB Further reading … • • Masatoshi Nei, Sudhir Kumar. Molecular Evolution and Phylogenetics. Oxford University Press 2000. Dan Graur and Wen-Hsiung Li. Fundamentals of Molecular Evolution. Sinauer Associates, Massachusetts. © 2009 SIB TP5 1/2: Phylogenetic analysis http://education.expasy.org/cours/phylo/MPB10_phylo_TP5.html © 2009 SIB TP5 2/2 - Analysis Refinement, Interpretation of Results Preparation for the next course (if you wish you can work in groups up to 5) Phylogenetic analysis of the X,K-ATPase beta subunit family • • • • • • Collect homologs from chordates (human, macaque, mouse, rat, chicken, zebrafinch, frog, zebrafish, fugu, Ciona intestinalis, C. savignyi); outgroup: Drosophila, Caenorhabditis elegans Perform a multiple sequence alignment Construct a data model Reconstruct a phylogenetic tree using ML Create one or more slides to present your analysis results to your colleagues next Monday. Easiest, if you paste links to the analysis servers for alignments, trees, etc Please send me your slides by Friday morning; if you work in groups, please indicate the names of your colleagues © 2009 SIB Thank You Remember: Monday 17 December 2012 The course and practicals will take place in the computer room located in 10-12 Passage Baud-Bovy, behind UniMail