more sequence or more individuals, to combine or not? 14.4. Tue Introduction to models (Jarno) 16.4. Thu Distance-based methods (Jarno) 17.4. Fri ML analyses (Jarno) 20.4. Mon 21.4. Tue (Jarno) 23.4. Thu 24.4. Fri Assessing hypotheses (Jarno) Problems with molecular data Problems with molecular data (Jarno) Phylogenomics Search algorithms, visualization, and other computational aspects (Jarno) J The trivial truth ◦ All extant species ◦ The whole genome Impractical? Well, then ◦ As many species as possible ◦ As much data as possible Finite constraints on resources (time, money) ◦ Know your group – which taxa are the most relevant for your study? ◦ Know what gene sequences are available from previous studies The days of single gene datasets are over Mitochondrial and chloroplast DNA have been popular because they are easy to amplify and sequence It is worth increasing the number of nuclear genes One should aim for at least 3 genes, preferably more (maybe 10?) It is now possible to increase the number of genes being sequenced significantly Whole genome analyses will allow us to understand: ◦ Intron-exon boundary dynamics ◦ Gene duplication-deletion dynamics ◦ Gene transfer dynamics Soon we will have a good understanding of the regions of the genome that are most suitable for systematics Sometimes not all genes amplify from all samples ◦ Should these samples be discarded? Increased taxon sampling, despite missing data, increases resolution All possible data should be used! Can separate independent data sets be combined for analysis? How can we assess the possibility of conflict between different data? What does the potential conflict then mean? For instance ◦ Different genes may have different phylogenetic signal (different history?) If both genes have equally strong signal If one gene has a stronger signal than the other If one gene has a stronger signal than the other Never combine Combine sometimes Always combine The different data sets may represent different evolutionary histories (e.g. different selection pressures) Big data sets dominate small data sets When analyzed separately, the different data sets can be tests of each others phylogenetic hypotheses Data set A Data set B + Their consensus = My own experience: A B C D E F G H Would be fantastic to get genealogical histories of individual genes But! ◦ Single genes generally short 1000-2000 bases ◦ Lots of homoplasy ◦ Unreliable phylogenies If the data sets are congruent, combine them If the data sets are incongruent, don’t combine them One can use the ILD test to decide whether data sets are incongruent If there is no conflict between data sets: ◦ The length of most parsimonious tree from the combined data [L(x+y)] is equal to the sum of the lengths of the MP trees from the separately analyzed data [L(x) + L(y)] Dxy = L(x+y) – (L(x) + L(y)) Dxy = 0 (Farris et al 1994) Combining the data sets leads to increased homoplasy But is it statistically significant? Can be tested with the Mann-Whitney U test, where the null hypothesis is that the data sets are combinable Data set x Data set y Combine data Data sets x + y Data set p Original Data set q Sample randomly to get equally large data sets Search for MP trees and calculate Dpq values Repeat many times (e.g. 1000), which gives us a distribution for the value of D Compare whether Dxy differs from random distribution at P < 0.05 However: ◦ ILD-test is sensitive to relative sizes of compared data sets and to the evolutionary history of the different data sets Combining all available data leads to more resolved trees = the combined data has higher explanatory power ”Hidden support” can only be detected through combined analysis Conflicts at different nodes can only be discovered in a combined analysis framework The effects of combined analysis can be investigated using indices related to Bremer support Partitioned Bremer Support (PBS) ◦ Baker & DeSalle 1997: Syst Biol 46:654 Partition Congruence Index (PCI) ◦ Brower 2006: Cladistics 22:378 Hidden Bremer Support (HBS) ◦ Gatesy et al 1999: Cladistics 15:271 The different data partitions in a data set contribute to the Bremer support in an additive way For each node: ◦ A negative Partitioned Bremer support value indicates conflict ◦ A positive Partitioned Bremer support value indicates congruence 7 3,4 7 -6,13 Bremer Support Morpholgy, COI, EF1a, Wgl Tells us about the magnitude of conflict between data partitions in a combined analysis PCI is always equal to or less than BS for a given branch PCI = BS when there is no conflict PCI is negative when there is low BS because of strong conflicts between data partitions Brower 2006: Cladistics 22:378-386 Underlying phylogenetic signal can be confounded by homoplasy in separate analyses Combining datasets can bring out this signal, as homoplasy is largely random noise Can be measured using HBS and Partitioned HBS Hidden support can be defined as increased support for the node of interest in the simultaneous analysis of all data partitions relative to the sum of support for that node in the separate analyses of each partition For a particular combined data set and a particular node, HBS is the difference between BS for that node in the combined analysis and the sum of BS values for that node from each data partition With a small dataset, it is probably always best to combine everything With large datasets (10 or 20 gene regions?) one can find sets of congruent genes and combine them But! ◦ Is there a biological reason for incongruence, or is it just a property of the data? Niklas Wahlberg Saturation Bias in nucleotide composition Orthology vs paralogy Lineage sorting Lateral Gene Transfer Saturation Saturation is due to multiple changes at the same site subsequent to lineage splitting Models of evolution attempt to infer the missing information through correcting for “multiple hits” Most data will contain some fast evolving sites which are potentially saturated (e.g. in proteins often position 3) In severe cases the data becomes essentially random and all information about relationships can be lost Multiple changes at a single site - hidden changes Ancest GGCGCG Seq 1 AGCGAG Seq 2 GCGGAC Number of changes 1 Seq 1 C Seq 2 C 3 2 G T 1 A A Time since divergence from sequences Pairwise distance calculated Homoplasy is a problem with molecular data Elevated rates of molecular evolution in unrelated lineages Sparse taxon sampling leading to long branches The classical long-branch attraction example Based on one gene 18S Nardi et al. 2003: Science 299: 1887-1889 Taxon sampling is important For divergent taxa with few extant species, can be a problem More data from different sources ◦ Could be that molecular data are not able to resolve the position of some taxa ◦ Morphological data! Biased base composition Do sequences manifest biased base compositions (e.g thermophilic convergence) or biased codon usage patterns which may obscure phylogenetic signal? % Guanine + Cytosine in 16S rRNA genes Thermophiles: Thermotoga maritima Thermus thermophilus Aquifex pyrophilus %GC variable parsimony all sites sites sites 62 64 65 72 72 73 73 70 71 Deinococcus radiodurans 55 Bacillus subtilis 55 52 50 48 38 Mesophiles: A case study in phylogenetic analysis: Deinococcus and Thermus Deinococcus are radiation resistant bacteria Thermus are thermophilic bacteria BUT: ◦ Both have the same very unusual cell wall based upon ornithine ◦ Both have the same menaquinones (Mk 9) ◦ Both have the same unusual polar lipids Congruence between these complex characters supports a phylogenetic relationship between Deinococcus and Thermus An appropriate method can correct for GC bias Jukes & Cantor Tree Parsimony tree Aquifex Aquifex Thermotoga Thermus Aquifex Thermotoga Thermotoga Deinococcus Bacillus Log Det Tree Deinococcus Bacillus Thermus Deinococcus Thermus Bacillus Orthology and paralogy Are the sequences being generated from different species the same (homologous)? Gene duplication ◦ duplicate gene degenerates ◦ duplicate gene aquires new function A problem particular accute currently as we search for new genes Orthology: gene trees and species trees Gene phylogeny Organism phylogeny a A b B c C ORTHOLOGY Darwin’s theory reinterpreted homology as common ancestry. Ancestral sequence ATCGGCCACTTTCGCGATCA ATCGGCCACTTTCGCGATCG ATCGGCCACTTTCGTGATCG ATCGGCCACGTTCGTGATCG ATCGGCCACGTTCGCGATCG ATAGGCCACTTTCGCGATCA ATAGGCCACTTTCGCGATTA ATAGGGCAGTTTCGCGATTA ATAGGGCAGTTTTGCGATTA ATCGGCCACCTTCGCGATCG ATAGGGCAGTTTCGCGATTA ATAGGGCAGTCTCGCGATTA ACCGGCCACCTTCGCGATCG Homologous sequences ACCGGCCACCTTCGCGATCG ATAGGGCAGTCTCGCGATTA Orthologs arise by speciation Speciation event ATCGGCCACTTTCGCGATCA ATAGGGCAGTCTCGCGATTA ACCGGCCACCTTCGCGATCG Modern species A Sequence in ancestral Organism Orthologous sequences Modern species B Orthologs are “evolutionary counterparts” – Koonin (2001) Paralogs arise by duplications Duplication event ATCGGCCACTTTCGCGATCA ATAGGGCAGTCTCGCGATTA ACCGGCCACCTTCGCGATCG Modern duplicate A Sequence in ancestral Organism Paralogous sequences Modern duplicate B An evolutionary tale… Duplication of A in human Duplication of A in worm Sonnhammer & Koonin (2002) TIGs 18 619-220 Evolutionary Relationships The yeast gene is orthologous to all worm and human genes, which are all co-orthologous to the yeast gene Sonnhammer & Koonin (2002) TIGs 18 619-220 Evolutionary Relationships all genes in the HA* set are co-orthologous to all genes in the WA* set Sonnhammer & Koonin (2002) TIGs 18 619-220 Evolutionary Relationships The genes HA* are hence ‘inparalogs’ to each other when comparing human to worm. Sonnhammer & Koonin (2002) TIGs 18 619-220 Evolutionary Relationships duplication speciation By contrast, the genes HB and HA* are ‘outparalogs’ when comparing human with worm Sonnhammer & Koonin (2002) TIGs 18 619-220 speciation Evolutionary Relationships duplication HB and HA*, and WB and WA* are inparalogs when comparing with yeast, because the animal– yeast split pre-dates the HA*–HB duplication Sonnhammer & Koonin (2002) TIGs 18 619-220 Paralogy can produce misleading trees Gene phylogenies Organism phylogeny a1* A b1 B c1* C Misleading tree a2 b2* a1 A c2 c1 C b2 B gene duplication PARALOGY Ancient gene duplications can be used to root the tree of life Ancestral Elongation Factor Gene Gene Duplication Prior To Split Into 3 Domains Of Life EF-Tu/ 1-alpha EF-2/ EF-G EF-Tu/ 1-alpha + = paralogues of each other EF-2/ EF-G Sequences from one paralogue can be used to root a tree formed using sequences from the other and vice versa Lineage sorting Gene trees may not be the same as species trees Extant populations may retain ancestral polymorphisms Species level phylogenies should never sample single individuals of different species Implicit assumption in many studies using mtDNA The mode of speciation can now be studied using DNA sequences Theoretical studies predict that DNA lineages pass through several phases in a species The assumption: monophyly A B Time Ancestral gene pool The assumption: monophyly A Time B Paraphyly can occur when one population in a set of locally panmictic populations speciates Polyphyly occurs when a highly polymorphic population is subdivided Can be highly informative of the history of divergence Paraphyly A B Time Ancestral gene pool Paraphyly A Time B Polyphyly A Time B Polyphyly A Time B Polyphyly 100 100 99 100 77 100 80 91 88 An empirical example: Phyciodes butterflies Wahlberg et al. 2003. Syst Ent 28:257-273 vesta (67-9) Mexico vesta (41-1) TX vesta (41-2) TX picta canace (44-11, 44-12) AZ picta picta (34-7) CO pallescens (64-2) Mexico pallescens (64-1) Mexico orseis orseis (67-3) CA1 100 orseis orseis (37-1) CA1 73 51 orseis orseis (67-4) CA1 orseis orseis (67-6) CA1 pallida pallida (34-6, 47-9, 47-10, 47-11) CO3 pallida barnesi (58-5, 58-6) BC1 mylitta arida (67-10) Mexico 86 mylitta mylitta (11-10, 11-11, 58-1, 58-2) BC1 71 mylitta arizonensis (32-1) AZ1, (47-1) NM 63 mylitta mylitta (32-3) NV mylitta mylitta (32-6) MT phaon phaon (25-17) FL phaon jalapeno (35-11) Mexico pulchella pulchella (47-6, 49-14, 50-6) CA3 100 pulchella pulchella (49-13) CA3 batesii apsaalooke (35-8) WY cocyta selenis (47-12) CO1 cocyta cocyta (72-8) ONT tharos orantain (52-9) AB4 95 52 tharos tharos (47-3) MN tharos orantain (35-6) CO4 tharos orantain (47-2) CO7, (60-6, 60-7) AB6 56 tharos riocolorado (35-9) CO8 tharos tharos (25-18) FL tharos tharos (34-2) MN 74 tharos tharos (53-8) MD tharos tharos (44-3, 44-4) NY tharos tharos (44-2) NY tharos distincta (73-4) Mexico tharos tharos (44-1) NY 78 tharos tharos (47-4) MN 62 tharos tharos (47-8) MN tharos tharos (54-9) MD cocyta cocyta (72-9) ONT 100 batesii batesii (73-9) MN batesii batesii (72-1) ONT cocyta selenis (47-13) CO1 cocyta selenis (48-3) CO1 cocyta selenis (58-8) BC1 95 cocyta selenis (11-5) BC1 100 99 pulchella owimba (24-10) MT 62 batesii maconensis (60-13, 60-15) NC batesii maconensis (69-1, 69-2) NC batesii lakota (52-7, 52-8) AB3 batesii anasazi (34-1) CO2 cocyta selenis (47-14, 48-6) CO1 74 cocyta diminutor (49-9) MN cocyta selenis (55-2) AB7 68 cocyta selenis (58-7) BC1 cocyta selenis (60-12) BC2 batesii lakota (60-5) AB6 cocyta cocyta (72-10) ONT 52 cocyta selenis (11-4) BC1, (55-8) AB7 61 cocyta selenis (11-6) BC1 cocyta selenis (48-10) CO1 cocyta diminutor (49-8) MN cocyta selenis (55-6) AB6 probably (52-2) batesiiAB1 lakota pulchella inornata (67-11) OR pulchella montana (27-5) CA2 91 72 75 pulchella montana (67-15) OR pulchella montana (67-16) OR 80 pulchella owimba (52-14, 55-7) AB5 89 pulchella owimba (54-1) AB5 68 pulchella tutchone (23-11) Alaska pulchella owimba (56-1, 56-5, 56-7, 60-2) BC2 pulchella inornata (67-13) OR 100 pulchella inornata (67-14) OR 62 pulchella inornata (73-1) OR 99 pulchella inornata (73-2) OR pulchella camillus (48-14) CO1 pulchella camillus (50-4) CO1 88 pulchella camillus (48-8, 49-12) CO1 batesii lakota (35-4) NE pulchella camillus (49-3) CO6 pulchella camillus (50-3) CO1 pulchella camillus (49-5) CO6 pulchella camillus (49-4) CO6 pulchella camillus (48-4) CO5 pulchella camillus (49-1) NM 72 pulchella camillus (49-2) CO6 pulchella camillus (35-5, 48-2, 48-7, 48-9, 48-13) CO1, (50-2) NM 100 Paraphyly of a species can be due to incomplete lineage sorting and/or secondary gene flow G = generations, starting with ten unrelated females at G = 0 Lateral gene transfer Widely spread in single celled organisms ◦ Even between distantly related lineages In multi-celled organisms more a problem in closely related species ◦ hybridization Is the Tree of Life really a Web of Life? These ”problems” are highly interesting phenomena in themselves! When taking the different factors into account, can be informative about evolutionary history ”When in doubt, get more data” - Brooks and McLennan 2002