Introduction to MEGA Download at: http://www.megasoftware.net/index.html Manual at: www.megasoftware.net/mega4 Thomas Randall, PhD tarandal@email.unc.edu Use of phylogenetic analysis software tools Bioinformatics software for biologists in the genomics era Sudhir Kumar and Joel Dudley Bioinformatics 23: 1713-1717 Fig 1(B) Relative impacts of evolutionary analysis software packages over the last 10 years. Only non-commercial software packages available on-line (without fee) are included, except for two available for a nominal fee (shown with dashed line). Data for both panels were obtained from the Web of Science (February 2007 edition). For panel B, the numbers of new citation were generated using the ‘Cited References’ facility with the search arguments for author name, cited work and citation year kindly provided by Joe Felsenstein for MEGA (www.megasoftware.net), PAUP (paup.csit.fsu.edu), PHYLIP (evolution.genetics.washington.edu/phylip.html), MrBayes (mrbayes.csit.fsu.edu), Puzzle (www.tree-puzzle.de), PhyML (atgc.lirmm.fr/phyml) andPAML (abacus.gene.ucl.ac.uk/software/paml.html). MEGA contains all elements necessary for building a tree • Import and editing sequence/chromatographs • Clustalw for alignment • Various options for contructing a phylogeny • Several options for generating statistical significance • Tree viewing function Basic steps to build a phylogeny 1. 2. 3. 4. 5. Import and Align sequences Select tree building option Select distance matrix Choose type of bootstrapping Manipulate tree with tree viewer Phylogeny options in MEGA4 • • • • UPGMA Neighbor joining Minimum evolution Maximum parsimony Distance methods General rules build tree with two independent methodologies for confirmation in MEGA - one distance method plus parsimony Maximum parsimony less effective for more distantly related sequences due to homoplasy (multiple substitutions at same site can accumulate over time) WARNING: “Phylogenetics has a long history of heated arguments about the relative merits of different methods—researchers in the field seem preadapted for ideological warfare—” Huelsenbeck et al., Syst. Biol. 51: 673 UPGMA (Unweighted Pair Group Method with Arithmatic Mean ) • UPGMA employs a sequential clustering algorithm (neighbor joining), in which pairwise distances between sequences are computed, and the phylogenetic tree is built in a stepwise manner. We first identify from among all the sequences the two that are most similar to each other and then treat these as a new single branch. Subsequently from among the remaining sequences we identify the pair with the highest similarity, and so on. • Assumes equal evolutionary rates (a clock) Neighbor-Joining An algorithm for constructing phylogenetic trees using distance data. Once a distance measurement between a set of sequences has been determined, a neighbor joining algorithm will find the two closest, group them, then look for the next closest until all sequences are fit into a tree. Different algorithms for doing this have been written that either do or do not consider evolutionary distance. Examples: clustalw, UPGMA, neighbor (phylip) Difference between this and UPGMA (also a neighbor joining method) Is it does not assume a constant evolutionary rate in all lineages Minimum evolution All possible trees are produced, the tree with the smallest total branch Length is chosen as the best tree. Branch length is proportional to the distance between each sequence. Maximum Parsimony – The selection of the phylogenetic tree requiring the least number of substitutions from among all possible phylogenetic trees as the most likely to be the true phylogenetic tree. – Usefulness declines with increasing evolutionary distance Johns Hopkins University - Fall 2003 Phylogenetics & Computational Genomics - 410.640.71 Informative sites in parsimony OTU 1 2 3 4 1 T T T T 2 C T T T 3 A A C C 4 G G G T 5 A A A A 6 T A T A 7 C C C G 8 T T G G 9 A A A A 10 Sites G G G C Invariant sites are not used in parsimony (they yield no information on character state changes) Informative sites (at least two different kinds of residues – each present at least two times) are used by parsimony because they discriminate between topologies – i.e. different topologies require different numbers of changes between residues Singleton sites can not be used to discriminate between topologies (they require 1 change for all topologies) Lecture #7 Page 3 Maximum parsimony (MP) options • Exhaustive Not an option here, but all possible trees are searched, practically this takes too much time so various shorcuts (branch and bound, heuristic) have been developed • Branch and bound This is a method of searching through tree space in order to find optimal trees. It is not exhaustive, trees with a total length longer than those already examined are not considered, reducing the complexity of the search. Guaranteed to find all MP trees. Becomes time consuming if more than 20 sequences are considered • Heuristic Another approximate search, still using a branch and bound approach but making more assumptions. More useful for larger trees but no guarantee of finding the MP tree with the shortest length • CNI (Close-Neighbor-Interchange) In any method, examining all possible topologies is very time consuming. This algorithm reduces the time spent searching by first producing a temporary tree, and then examining all of the topologies that are different from this temporary tree by a topological distance of dT = 2 and 4. If this is repeated many times, and all the topologies previously examined are avoided, one can usually obtain the tree being sought. * Statistical tests of significance Bootstrapping * This is a method of attempting to estimate confidence levels of inferred relationships. The bootstrap proceeds by resampling the original data matrix with replacement of the characters. It is analagous to cutting the data matrix into individual columns of data and throwing the characters into a hat. A character is then drawn at random from this hat and it becomes the first character of the new datamatrix. The character is then replaced in the hat, the hat is shaken and again another character is drawn from the hat. This process is repeated until our new pseudoreplicate is the same size as the original. Some characters will be sampled more than once and some will not be sampled at all. This process is repeated many times (say, 100-1,000) and phylogenies are reconstructed each time. After the bootstrap procedure is finished, a majority-rule consensus tree is constructed from the optimal tree from each bootstrap sample. The bootstrap support for any internal branch is the number of times it was recovered during the bootstrapping procedure. Interior Branch Test Similar to bootstrapping but is unwieldy with a large number of taxa. A t-test, which is computed using the bootstrap procedure, is constructed based on the interior branch length and its standard error and is available only for the NJ and Minimum Evolution trees. MEGA shows the confidence probability in the Tree Explorer; if this value is greater than 95% for a given branch, then the inferred length for that branch is considered significantly positive. Other phylogeny software PHYLIP MrBayes: Bayesian Inference of Phylogeny TREE-PUZZLE 5.2: Maximum likelihood analysis MEGA has no ability to do either maximum likelihood analysis or bayesian inference. These are more sophisticated, and computationally intensive (and can be more accurate for distantly related sequences) Distance Distance is a phylogenetic method that considers the additive differences between either nucleotides or amino acids along the entire length of sequence. A distance measurement is made considering each type of substitution (either transversion or transition) weighted differently, depending on the distance algorithm and weighting matrix used. As distances are re-computed for all possible pairs of sequence during each step of the assembly this can be computationally intensive. 12 898 Homo_sapie Pan Gorilla Pongo Hylobates Macaca_fus M_mulatta M_fascicul M_sylvanus Saimiri_sc Tarsius_sy Lemur_catt AAGCTTCACC AAGCTTCACC AAGCTTCACC AAGCTTCACC AAGCTTTACA AAGCTTTTCC AAGCTTTTCT AAGCTTCTCC AAGCTTCTCC sAAGCTTCAC aAAGTTTCAT AAGCTTCATA GGCGCAGTCA GGCGCAATTA GGCGCAGTTG GGCGCAACCA GGTGCAACCG GGCGCAACCA GGCGCAACCA GGCGCAACCA GGTGCAACTA CGGCGCAATG TGGAGCCACC GGAGCAACCA TTCTCATAAT TCCTCATAAT TTCTTATAAT CCCTCATGAT TCCTCATAAT TCCTTATGAT TCCTCATGAT CCCTTATAAT TCCTTATAGT ATCCTAATAA ACTCTTATAA TTCTAATAAT CGCCCACGGG CGCCCACGGA TGCCCACGGA TGCCCATGGA CGCCCACGGA CGCTCACGGA TGCTCACGGA CGCCCACGGG TGCCCATGGA TCGCTCACGG TTGCCCATGG CGCACATGGC CTTACATCCT CTTACATCCT CTTACATCAT CTCACATCCT CTAACCTCTT CTCACCTCTT CTCACCTCTT CTCACCTCTT CTCACCTCTT GTTTACTTCG CCTCACCTCC CTTACATCAT 12 Homo_sapie 0.288560 Pan 0.315343 Gorilla 0.291143 Pongo 0.309930 Hylobates 0.297051 Macaca_fus 0.036582 M_mulatta 0.000000 M_fascicul 0.098273 M_sylvanus 0.129816 Saimiri_sc -1.000000 Tarsius_sy -1.000000 Lemur_catt 0.393103 0.000000 0.310181 0.094328 0.339246 0.110803 0.329470 0.182639 0.330862 0.210562 0.322962 0.286715 0.088360 0.288560 0.098273 0.310181 0.000000 0.321059 0.133409 -1.000000 -1.000000 -1.000000 -1.000000 0.431062 0.407353 0.094328 0.321059 0.000000 0.311692 0.113612 0.304045 0.195508 0.302154 0.219479 0.301975 0.303507 0.135182 0.315343 0.129816 0.339246 0.133409 0.311692 0.000000 -1.000000 -1.000000 -1.000000 -1.000000 0.432920 0.390241 0.110803 -1.000000 0.113612 -1.000000 0.000000 -1.000000 0.189484 -1.000000 0.219367 -1.000000 0.292586 -1.000000 0.291143 -1.000000 0.329470 -1.000000 0.304045 -1.000000 -1.000000 0.000000 -1.000000 0.483555 0.403571 -1.000000 0.182639 -1.000000 0.195508 -1.000000 0.189484 -1.000000 0.000000 -1.000000 0.220062 -1.000000 0.306528 -1.000000 0.309930 -1.000000 0.330862 -1.000000 0.302154 -1.000000 -1.000000 0.483555 -1.000000 0.000000 0.401607 -1.000000 0.210562 0.286715 0.431062 0.219479 0.303507 0.432920 0.219367 0.292586 0.403571 0.220062 0.306528 0.401607 0.000000 0.308618 0.407699 0.308618 0.000000 0.382417 0.297051 0.036582 0.393103 0.322962 0.088360 0.407353 0.301975 0.135182 0.390241 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 0.407699 0.382417 0.000000 matrix listing all pairwise differences DNA Distance matrices A G C T Jukes-Cantor distance In the Jukes-Cantor model, the rate of nucleotide substitution is the same for all pairs of the four nucleotides A, T, C, and G. Many more models, with increasing complexity Distance matrices in Mega Kimura 2-parameter distance Kimura’s two parameter model corrects for different substitution rates between transitions (i.e. purine to purine) and transversions (i.e. purine to pyrimidine). Tamura-Nei distance The Tamura-Nei model (1993) corrects for multiple hits, taking into account the differences in substitution rate between nucleotides and the inequality of nucleotide frequencies. It distinguishes between transitional substitution rates between purines and transversional substitution rates between pyrimidines. It also assumes equality of substitution rates among sites (see related gamma model). Also: # differences Tamura 3-parameter LogDet Which DNA distance matrix is appropriate? – When the Jukes-Cantor * estimate of the number of nucleotide substitutions per site (d) between different sequences is about 0.05 or less (d < 0.05), use the Jukes-Cantor distance whether there is a transition/transversion bias or not or whether the substitution rate (l) varies with nucleotide site or not. In this case, the Kimura distance or the gamma distance gives essentially the same value as the Jukes-Cantor distance. One may also use the p-distance for constructing a topology. – When 0.05 < d < 0.3, use the Jukes-Cantor distance unless the transition/transversion ratio (R) is high, say R >5. When this ratio is high and the number of nucleotides examined is large, (>10K) use the Kimura distance or the gamma distances for Kimura's 2-parameter model. – When 0.3 < d < 1 and there is evidence that l varies extensively with site, use gamma distances. In general, one may choose different gamma distances, estimating a from data. – When 0.3 < d < 1 and the frequencies of the four nucleotides (A, T, C, G) deviate substantially from equality but there is no strong transition/transversion bias, use the TajimaNei distance. When there are strong transition/transversion and G+C content biases, use the Tamura or Tamura-Nei distance. – When d > 1 for many pairs of sequences, the phylogenetic tree estimated is not reliable for a number of reasons (e.g., large standard errors of d's and sequence alignment errors). We therefore suggest that these sets of data should not be used. Protein Distance matrices in Mega • p-distance This distance is the proportion (p) of amino acid sites at which the two sequences to be compared are different. It is obtained by dividing the number of amino acid differences by the total number of sites compared. It does not make any correction for multiple substitutions at the same site or differences in evolutionary rates among sites. • Equal Input Model (Amino acids) In real data, frequencies usually vary among different kind of amino acids. In this case, the correction based on the equal input model gives a better estimate of the number of amino acid substitutions than the Poisson correction distance. Note that this assumes an equality of substitution rates among sites and the homogeneity of substitution patterns between lineages. • Poisson correction The Poisson correction distance assumes equality of substitution rates among sites and equal amino acid frequencies while correcting for multiple substitutions at the same site. • PAM & JTT The PAM and JTT distances correct for multiple substitutions based on a model of amino acid substitution described as substitution-rate matrices. * ModelTest does a likelihood analysis on your data to determine The most appropriate DNA substitution matrix. WARNING: only for advanced users, also requires PAUP for an input FindModel – web based version of ModelTest Input is a concatenated fasta file http://hcv.lanl.gov/content/hcv-db/findmodel/findmodel.html Result: MODEL CONSIDERED: JC : Jukes-Cantor (model 1) AIC1 = 27875.89594 lnL = -13937.947970 FindModel output JC+G : Jukes-Cantor plus Gamma (model 3) AIC3 = 27877.899848 lnL = -13937.949924 F81 : Felsenstein 1981 (model 5) AIC5 = 27352.654274 lnL = -13673.327137 F81+G : Felsenstein 1981 plus Gamma (model 7) AIC7 = 27354.660556 lnL = -13673.330278 K80 : Kimura 2-parameter (model 9) AIC9 = 27871.085794 lnL = -13934.542897 AIC = Akaike Information Criterion lnL = maximum likelihood K80+G : Kimura 2-parameter plus Gamma (model 11) AIC11 = 27872.977786 lnL = -13934.488893 HKY : Hasegawa-Kishino-Yano (model 13) AIC13 = 27336.418362 lnL = -13664.209181 HKY+G : Hasegawa-Kishino-Yano plus Gamma (model 15) AIC15 = 27338.425764 lnL = -13664.212882 AICi = −2 ln Li + 2ki Model favored is the one with the lowest AIC TrN : Tamura-Nei (model 21) AIC21 = 27338.336148 lnL = -13664.168074 TrN+G : Tamura-Nei plus Gamma (model 23) AIC23 = 27340.335138 lnL = -13664.167569 GTR : General Time Reversible (model 53) AIC53 = 27342.287716 lnL = -13663.143858 GTR+G : General Time Reversible plus Gamma (model 55) AIC55 = 27344.30355 lnL = -13663.151775 AIC-SELECTED MODEL: HKY : Hasegawa-Kishino-Yano (model 13) lnL = -13664.209181 AIC = 27336.418362 DNA Substitution models in ModelFind Reduced set: JC : Jukes-Cantor (model 1) JC+G : Jukes-Cantor plus Gamma (model 3) F81 : Felsenstein 1981 (model 5) F81+G : Felsenstein 1981 plus Gamma (model 7) K80 : Kimura 2-parameter (model 9) K80+G : Kimura 2-parameter plus Gamma (model 11) HKY : Hasegawa-Kishino-Yano (model 13) HKY+G : Hasegawa-Kishino-Yano plus Gamma (model 15) TrN : Tamura-Nei (model 21) TrN+G : Tamura-Nei plus Gamma (model 23) GTR : General Time Reversible (model 53) GTR+G : General Time Reversible plus Gamma (model 55) Red indicates models available in MEGA If a model in black is suggested, use the one immediately below If GTR is suggested, use LogDet parallelized clustalw http://cbsuapps.tc.cornell.edu/clustalw.aspx non parallelized clustalw http://inquiry.unc.edu/inquiry/ Many MSA algorithms PLOS Comp. Biol. 3: e123 Alternative alignment tools FACT: in published comparisons between alignment tools, clustalw usually comes out close to the bottom T Coffee – better, more computationally intensive Muscle – better, less intensive than T Coffee Promals – designed to optimize alignment for distantly related sequences Outputs for above need to be put in Appropriate format (.aln, .phy, .nex) http://prodata.swmed.edu/promals/promals.php http://www.drive5.com/muscle/ http://cbsuapps.tc.cornell.edu/t_coffee.aspx Displaying extensions on a PC • My Computer > Tools > Folder Options > View > unclick on “Hide Extensions…” • Also, Control Panels > Folder Options > View > unclick on “Hide Extensions…” Test data sets Nature 442: 37 Science 320: 499 Computing d 1) Compute Jukes-Cantor distance; examine distance matrix. If d < 0.05 stop and use Jukes-Cantor substitution model 2) If 0.05 < d < 0.3, check R also; use Kimura 2 parameter option for computing d; change “Substitutions to Include” option from d: transitions + transversions to R = s/v and calculate 3) Choose model based on the guide on previous page Analysis Preferences: Setting up an analysis User defined options Analysis Preferences (Distance Computation) Substitution Model - In this set of options, you choose the various attributes of the substitution models. • Model - Here you select a stochastic model for estimating evolutionary distance by clicking on the ellipses to the right of the currently selected model (click on the lime square to select this row first). This will reveal a menu containing many different distance methods and models. • Substitutions to Include - Depending on the distance model or method selected, the evolutionary distance can be teased into two or more components. By clicking on the drop-down button (first click on the lime square to select this row), you will be provided with a list of components relevant to the chosen model. • Transition/Transversion Ratio - This option will be visible if the chosen model requires you to provide a value for the Transition/Transversion ratio (R). • Pattern among Lineages - This option becomes available if the selected model has formulas that allow the relaxation of the assumption of homogeneity of substitution patterns among lineages. • Rates among Sites - This option becomes available if the selected distance model has formulas that allow rate variation among sites. If you choose gamma-distributed rates, then the Gamma parameter option becomes visible. Treatment of gaps Gaps often are inserted during the alignment of homologous regions of sequences and represent deletions or insertions (indels). They introduce some complications in distance estimation. Furthermore, sites with missing information sometimes result from experimental difficulties; they present the same alignment problems as gaps. In the following discussion, both of these situations are treated in the same way. In MEGA, there are two ways to treat gaps. One is to delete all of these sites from the data analysis. This option, called the Complete-Deletion, is generally desirable because different regions of DNA or amino acid sequences evolve under different evolutionary forces. The second method is relevant if the number of nucleotides involved in a gap is small and if the gaps are distributed more or less randomly. In that case it may be possible to compute a distance for each pair of sequences, ignoring only those gaps that are involved in the comparison; this option is called Pairwise-Deletion. The following table illustrates the effect of these options on distance estimation with the following three sequences: Complete-Deletion * Pairwise-Deletion Uniform Rates vs. Gamma distribution Ignore this option as MEGA has no way to calculate “a”, the value of gamma distribution A gamma distribution reflects that there is a substitution difference between different amino acids/nucleotides; a = 1, subsitution variation is very high; a = infinity, all substitutions are equally likely Tree Explorer • Save tree as .emf file (for ppt or word) northNigeria 57 96 turkey Turkey2005 swan Czech2006 48 36 mallard B avaria2006 26 27 swan Mongolia2005 swan Astrakhan2005 turkey Suzdalka2005 swan Iran2006 mallard Italy2005 • Save tree as .nwk file (for opening in other tree viewers) ((((northNigeria:0.00240616,((turkey_Turkey2005:0.00314559,swan_Czech2006:0.00255065)0.96:0.00324648,(mallard_Bavaria2006:0.00405158, swan_Mongolia2005:0.00164881)0.27:0.00003413)0.27:0.00003360)0.57:0.00084134,swan_Astrakhan2005:0.00402428)0.49:0.00081094, turkey_Suzdalka2005:0.00717544)0.37:0.00028133,swan_Iran2006:0.00464073,mallard_Italy2005:0.09386135); • Save tree as .mts file (for opening in MEGA) LagosSO494 72 60 100 LagosSO493 chicken Egypt2006 92 swan Czech2006 55 turkey Turkey2005 northNigeria 11 26 13 swan Mongolia2005 goose Iraq2006 34 40 Mr Bayes 1,000,000 generations 1.5 hrs-cluster 98 100 4 turkey Turkey20 12 swan Czech2006 13 goose Iraq2006 swan Mongolia20 63 15 31 goose Novo2005 chicken Tula200 20 100 LagosBA209 91 27 LagosBA211 10 0 LagosBA209 99 15 goose Novo2005 37 LagosBA211 100 100 turkey Suzdalka2005 100 35 swan Iran2006 chicken Tula2005 82 LagosBA210 Gull Qinghai200 Gull Qinghai2005 39 duck Kurgan2005 swan Astrakhan2 LagosBA210 87 northNigeria mallard Bavaria 56 duck Kurgan2005 swan Astrakhan2005 80 LagosSO300 chicken Egypt20 mallard Bavaria2006 24 LagosSO493 100 LagosSO300 63 9 LagosSO452 LagosSO494 66 MEGA NJ bootstrapping <1 min laptop 1G RAM LagosSO452 16 78 swan Iran2006 turkey Suzdalka 87 chicken Thai2005 chicken Thai200 duck Jiangxi2005 duck Jiangxi200 chicken Hebei2005 chicken Hebei20 mallard Italy2005 mallard Italy20 swan Iran2 397 PHYLIP dnapars bootstrapping 30 min laptop 1G RAM 456 258 500 416 482 goose Iraq mallard It swan Astra chicken He chicken Tu duck Jiang turkey Suz chicken Th goose Novo turkey Suz Gull Qingh chicken Tu northNiger goose Iraq mallard Ba LagosBA209 247 472 56 LagosBA210 LagosBA211 134 swan Iran2 LagosSO300 303 LagosSO493 213 209 73 297 215 LagosSO494 swan Czech turkey Tur 422 161 243 65 duck Kurga 77 mallard Ba swan Mongo chicken Th 55 duck Jiang 213 405 65 Tree-Puzzle maximum likelihood 10,000 steps <1 min laptop 1G RAM mallard It LagosBA211 98 87 LagosBA210 swan Astra 98 duck Kurga 65 swan Czech turkey Tur chicken Eg 96 LagosSO493 97 LagosSO300 98 LagosSO494 95 Dataset from Nature 442: 37 Multiple introductions of H5N1 in Nigeria northNiger LagosBA209 91 chicken He chicken Eg Gull Qingh swan Mongo 74 LagosSO452 51 goose Novo 99 LagosSO452 Tree Explorer • Condensed Trees When several interior branches of a phylogenetic tree have low statistical support (PC or PB) values, it often is useful to produce a multifurcating tree by assuming that all interior branches have a branch length equal to 0. We call this multifurcating tree a condensed tree. In MEGA, condensed trees can be produced for any level of PC or PB value. For example, if there are several branches with PC or PB values of less than 50%, a condensed tree with the 50% PC or PB level will have a multifurcating tree with all its branch lengths reduced to 0. • Consensus Tree The MP method produces many equally parsimonious trees. Choosing this command produces a composite tree that is a consensus among all such trees, for example, either as a strict consensus, in which all conflicting branching patterns among the trees are resolved by making those nodes multifurcating or as a Majority-Rule consensus, in which conflicting branching patterns are resolved by selecting the pattern seen in more than 50% of the trees. Importing trees from other phylogenetic tools Work – outtrees from phylip, .dnd and .phb files from clustalw TreePuzzle, Mr Bayes (.con file needs a little processing) MEGA4 Caption View Caption function gives a publication quality summary of analysis, and suggested references for publication About authors Gene Duplication and Gene Subsitution in Evolution Masatoshi Nei Nature 221: 40 Evolution by the Birth-and-Death Process in Multigene Families of the Vertebrate Immune System Nei, M., et al. Proc. Natl. Acad. Sci USA 94: 7799 MEGA3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment Sudhir Kumar, K Tamura, and M Nei Briefings in Bioinformatics 5:150-163 The Neighbor-joining Method: A New Method for Reconstructing Phylogenetic Trees Naruya Saitou and Masatoshi Nei Mol. Biol. Evol 4: 406 Much of the material in this handout derived from: Molecular evolution and phylogenetics 2000 M Nei, S Kumar - Oxford Univ. Press, New York