PHYLIP (Phylogeny Inference Package) Version 3.57c by Joseph Felsenstein July, 1995 COPYRIGHT NOTICE (c) Copyright 1986-1995 by Joseph Felsenstein and the University of Washington. Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed. CONTENTS OF THIS DOCUMENT Copyright notice Contents of this document General description of PHYLIP Contents of this package What the programs do Overview of the input and output formats Input File Format The Options Menu The Output File The Tree File The Options and How to Invoke Them Options Information in the Input File Common Options in the Menu The U (User Tree) option The G (Global) option The J (Jumble) option The O (Outgroup) option The T (Threshold) option The M (multiple data sets) option The option to write out the trees into a tree file The (0) terminal type option Common Options Requiring Information in the Input File The Weights option The Algorithm for Constructing Trees Local Rearrangements Global Rearrangements Multiple Jumbles Strategy for Finding the Best Tree A Warning on Interpreting Results Relative Speed of Different Programs and Machines Relative speed of the different programs Speed with different numbers of species Relative speed of different machines Published benchmarks Endorsements General Comments on Adapting the Package to Different Computer Systems Compiling the programs Using "make" Getting PHYLIP onto your microcomputer Microsoft Quick C and Microsoft C Turbo C++ for PCDOS Waterloo C/386 Think C for Macintosh Unix VMS VAX systems OpenVMS DEC Alpha systems Cray IBM mainframes running CMS Other Computer Systems Frequently Asked Questions "If I copied PHYLIP from a friend without you knowing, ...?" "How do I make a citation to the PHYLIP package ...?" "How do I bootstrap? Why has DNABOOT disappeared?" "How do I specify a multi-species outgroup? ..." "How do I force certain groups to remain monophyletic ...?" "How can I reroot one of the trees written out by PHYLIP?" "Why doesn't NEIGHBOR read my DNA sequences correctly?" "What do I do about deletions and insertions in my sequences?" "Why don't your parsimony programs print out branch lengths?" "Why can't your programs handle unordered multistate characters?" "Where can I get a printed version of the PHYLIP documents" "Why have I been dropped from your newsletter mailing list?" "How many copies of PHYLIP have been distributed?" Additional Frequently Asked Questions, or: "Why didn't it occur to you to..." write these programs in Pascal?" forget about all those inferior systems and just develop PHYLIP for Unix?" write these programs in PROLOG (or Ada, or Modula-2, or Simula, or ...)?" include in the package a program to do the Distance Wagner method ... ? include in the package ordination methods and more clustering algorithms?" include in the package a program to do nucleotide sequence alignment ...?" send me the programs over the electronic network I use, BUTTERFLYNET?" let me log in to your computer in Seattle and copy the files ....?" send me a listing of your program?" write a magnetic tape in our computer center's favorite format ....?" give us a version of these in FORTRAN?" New Features in Recent Versions Coming Attractions, Future Plans References for the Documentation Files Credits Other phylogeny programs available elsewhere PAUP MacClade Hennig86 Random Cladistics RNA ClaDOS MEGA TREECON MOLPHY fastDNAml PAML PEEWEE/NONA ODEN MacT Vostorg Wetzel/Huson programs Evomony Molevol PARBOOT Zharkikh programs Turbotree/Hadtree TreeAlign ClustalW MALIGN COMPONENT CAIC ABLE CLINCH COMPROB MARKOV RAPDistance MULTICOMP RSVP PHYSYS SINCAIDEN MUST GDE TreeTool NJPlot How You Can Help Me In case of trouble PHYLIP - Phylogeny Inference Package (version 3.5) This is a FREE package of programs for inferring phylogenies and carrying out certain related tasks. At present it contains 30 programs, which carry out different algorithms on different kinds of data. The programs in the package are: ---------- Programs for molecular sequence data ---------PROTPARS Protein parsimony DNAPARS Parsimony method for DNA DNAMOVE Interactive DNA parsimony DNAPENNY Branch and bound for DNA DNACOMP Compatibility for DNA DNAINVAR Phylogenetic invariants DNAML Maximum likelihood method DNAMLK DNA ML with molecular clock DNADIST Distances from sequences PROTDIST Distances from proteins RESTML ML for restriction sites SEQBOOT Bootstraps sequence data sets ----------- Programs for distance matrix data -----------FITCH Fitch-Margoliash and least-squares methods KITSCH Fitch-Margoliash and least squares methods with evolutionary clock NEIGHBOR Neighbor-joining and UPGMA methods -------- Programs for gene frequencies and continuous characters -----CONTML Maximum likelihood method GENDIST Computes genetic distances CONTRAST Computes contrasts and correlations for comparative method studies ------------- Programs for 0-1 discrete state data ----------MIX Wagner, Camin-Sokal, and mixed parsimony criteria MOVE Interactive Wagner, C-S, mixed parsimony program PENNY Finds all most parsimonious trees by branch-and-bound DOLLOP, DOLMOVE, DOLPENNY same as preceding four programs, but for the Dollo and polymorphism parsimony criteria CLIQUE Compatibility method FACTOR recode multistate characters ---------- Programs for plotting trees and consensus trees ------DRAWGRAM Draws cladograms and phenograms on screens, plotters and printers DRAWTREE Draws unrooted phylogenies on screens, plotters and printers CONSENSE Majority-rule and strict consensus trees RETREE Reroots, changes names and branch lengths, and flips trees There is also an Unsupported Division containing two programs, makeinf and ProtML, which were contributed by others and are maintained by their authors. The package includes extensive documentation files that provide the information necessary to use and modify the programs. The programs are written in a very standard subset of C, a language that is available on most computers (including microcomputers). The programs require no modifications to run on most machines: for example they work without modification with Microsoft C, Turbo C, Think C, and on the C compilers available on Unix and VAX VMS systems. C source code is distributed in the regular version of PHYLIP. To use it, you must have a C compiler. A Pascal version can also be supplied on request. Precompiled executables are available for PCDOS, 386 PCDOS, 386 Windows, PowerMacs, and Macintoshes as described below. NETWORK DISTRIBUTION: The package is available by "anonymous ftp" over electronic networks (including the PCDOS, 386 PCDOS, 386 Windows, and Macintosh executables) from evolution.genetics.washington.edu (128.95.12.41). Contact me by electronic mail for details or start by fetching file pub/phylip/Read.Me. European users may (or may not) get faster service from bioss.sari.ac.uk, which mirrors our distribution. Look in directory pub/phylogeny. I can also send the source code and documentation files (but not executables) over Bitnet/EARN and other networks. The easiest method of network distribution is to use our World Wide Web site: http://evolution.genetics.washington.edu/phylip.html DISKETTE DISTRIBUTION: variety of The package is also distributed in a microcomputer diskette formats. You should send FORMATTED diskettes, which I will return with the package written on them. See below for how many diskettes to send. The source code of the programs on the electronic network or magnetic tape versions may of course also be moved to microcomputers and compiled there. PRECOMPILED VERSIONS: Precompiled executable programs for PCDOS, 386 Windows, 386 PCDOS, and Macintosh systems are available from me. Specify the "386 Windows executable version", "386 PCDOS executable version", "PCDOS executable version" or "Macintosh executable version" and send the number of diskettes indicated below. Source code sent will be in C unless you specify Pascal. HOW MANY DISKETTES TO SEND: The following table shows for different formats how many diskettes to send, and how many extra diskettes to send for the executable version: Diskette size Density For source code For executables send and documentation in addition 3.5 inch PCDOS 1.44 Mb 1 3 5.25 inch PCDOS 1.2 Mb 1 3 Macintosh High density 1 1 Some other formats are also available. You MUST tell me EXACTLY which of these formats you need. The diskettes MUST be formatted by you before being sent to me. Sending an extra diskette may be helpful. POLICIES: The package is distributed free. It will be written on the diskettes or tape, which will be mailed back. They can be sent to: Electronic mail addresses: joe@genetics.washington.edu U.S.A. Joe Felsenstein Department of Genetics University of Washington Box 357360 Seattle, Washington 98195-7360, CONTENTS OF THIS PACKAGE The source code and documentation of the package consists of 87 files, plus 4 more for the programs in the Unsupported Division. In the electronic mail version some of these files may be split into parts, so there may be more. The package is organized into three major parts, the source code, the documentation, and the unsupported programs. The documentation is organized hierarchically, with groups of documentation files for different kinds of data each preceded by a documentation file for the group as well. The "unsupported division" of PHYLIP contains programs contributed by others (and not supported by us) that we feel may of use to you. Files ---1 README 2 main.doc The Source code 3 Makefile "make" 4 Makefile.qc 5 Makefile.tc C 6 phylip.h 7 compile.com 8 vaxfix.c 9 protpars.c 10 dnapars.c 11 dnamove.c 12 dnapenny.c 13 dnacomp.c 14 dnainvar.c invariants 15 dnaml.c 16 dnaml2.c 17 dnamlk.c 18 dnamlk2.c 19 dnadist.c 20 protdist.c 21 restml.c 22 restml2.c part 2 23 seqboot.c resampling 24 fitch.c 25 kitsch.c Contents --------- describes the contents of the package -- this general documentation file -- the "Makefile" to be used by C's that have -- the Makefile for Microsoft C and Quick C -- the Makefile for Borland Turbo C and Borland ---------- the PHYLIP "header file" a VMS command file to compile all of PHYLIP procedures needed to fix VMS printf(" %hd ") parsimony for protein sequence data DNA parsimony program interactive DNA parsimony branch and bound method for DNA DNA compatibility program computation of Lake's and Cavender's --------- DNA maximum likelihood program, part 1 DNA maximum likelihood program, part 2 DNA maximum likelihood with molecular clock DNA maximum likelihood with clock, part 2 computes distance matrix from sequences computes distance matrix from sequences maximum likelihood for restriction sites maximum likelihood for restriction sites, -- makes multiple data sets by bootstrap -- Fitch-Margoliash and least-squares methods -- F-M, L-S methods with evolutionary clock 26 neighbor.c 27 contml.c 28 gendist.c 29 contrast.c 30 mix.c part 1 31 mix2.c part 2 32 move.c parsimony 33 penny.c 34 dollop.c 35 dolmove.c 36 dolpenny.c 37 clique.c 38 factor.c 39 drawgraphics.h 40 drawgraphics.c drawtree.c 41 interface.h 42 interface.c 43 drawgram.c 44 drawtree.c 45 font1 46 font2 Roman) ------ neighbor-joining and UPGMA methods maximum likelihood program computes genetic distances contrasts etc. for comparative method studies Wagner, Camin-Sokal parsimony and mixtures, -- Wagner, Camin-Sokal parsimony and mixtures, -- interactive Wagner, Camin-Sokal and mixed --------- finds all most parsimonious trees Dollo and polymorphism parsimony methods interactive Dollo and polymorphism parsimony branch and bound for Dollo, polymorphism compatibility program recode multistate to binary characters header file for drawgraphics.c routines used in both drawgram.c and ------- header for Mac interface Mac routines used in Mac interface makes plots of cladograms, phenograms makes plots of unrooted phylogenies digitized font (simple sans-serif Roman) digitized font (medium quality sans-serif 47 font3 48 font4 Italic) 49 font5 50 font6 51 consense.c 52 retree.c trees The Documentation 53 sequence.doc 54 protpars.doc 55 dnapars.doc 56 dnamove.doc 57 dnapenny.doc 58 dnacomp.doc 59 dnainvar.doc 60 dnaml.doc 61 dnamlk.doc 62 dnadist.doc 63 protdist.doc 64 restml.doc 65 seqboot.doc 66 distance.doc 67 fitch.doc 68 kitsch.doc 69 neighbor.doc 70 contchar.doc -- digitized font (high quality serifed Roman) -- digitized font (medium quality sans-serif ----- digitized font (high quality serifed Italic) digitized font (Russian Cyrillic) majority-rule and strict consensus trees reroots, rearranges and changes lengths on -- documentation for molecular sequence programs -- documentation for protpars.c -- documentation for dnapars.c -- documentation for dnamove.c -- documentation for dnapenny.c -- documentation for dnacomp.c -- documentation for dnainvar.c -- documentation for dnaml.c and dnaml2.c -- documentation for dnamlk.c and dnamlk2.c -- documentation for dnadist.c -- documentation for protdist.c -- documentation for restml.c and restml2.c -- documentation for seqboot.c -- documentation for distance matrix programs -- documentation for fitch.c -- documentation for kitsch.c -- documentation for neighbor.c -- documentation for gene frequency and continuous character programs 71 contml.doc -- documentation for contml.c 72 gendist.doc -- documentation for gendist.c 73 contrast.doc -- documentation for contrast.c 74 discrete.doc -- documentation for discrete character programs 75 mix.doc -- documentation for mix.c 76 move.doc -- documentation for move.c 77 penny.doc -- documentation for penny.c 78 dollop.doc -- documentation for dollop.c 79 dolmove.doc -- documentation for dolmove.c 80 dolpenny.doc -- documentation for dolpenny.c 81 clique.doc -- documentation for clique.c 82 factor.doc -- documentation for factor.c 83 draw.doc -- documentation for tree plotting programs 84 drawgram.doc -- documentation for drawgram.c 85 drawtree.doc -- documentation for drawtree.c 86 consense.doc -- documentation for consense.c 87 retree.doc -- documentation for retree.c The Unsupported Division 88 makeinf.doc -- documentation for makeinf (by Arend Sidow) 89 makeinf.c -- C source for makeinf 90 protml.doc -- documentation for ProtML (by Adachi and Hasegawa) 91 protml.pas -- Pascal source for ProtML WHAT THE PROGRAMS DO Here is a short description of each of the programs. For more detailed discussion you should definitely read the documentation file for the individual program and the documentation file for the group of programs it is in. PROTPARS. Estimates phylogenies from protein sequences (input using the standard one-letter code for amino acids) using the parsimony method, in a variant which counts only those nucleotide changes that change the amino acid, on the assumption that silent changes are more easily accomplished. DNAPARS. Estimates phylogenies by the parsimony method using nucleic acid sequences. Allows use the full IUB ambiguity codes, and estimates ancestral nucleotide states. Gaps treated as a fifth nucleotide state. DNAMOVE. Interactive construction of phylogenies from nucleic acid sequences, with their evaluation by parsimony and compatibility and the display of reconstructed ancestral bases. This can be used to find parsimony or compatibility estimates by hand. DNAPENNY. Finds all most parsimonious phylogenies for nucleic acid sequences by branch-and-bound search. This may not be practical (depending on the data) for more than 10 or 11 species. DNACOMP. Estimates phylogenies from nucleic acid sequence data using the compatibility criterion, which searches for the largest number of sites which could have all states (nucleotides) uniquely evolved on the same tree. Compatibility is particularly appropriate when sites vary greatly in their rates of evolution, but we do not know in advance which are the less reliable ones. DNAINVAR. For nucleic acid sequence data on four species, computes Lake's and Cavender's phylogenetic invariants, which test alternative tree topologies. The program also tabulates the frequencies of occurrence of the different nucleotide patterns. Lake's invariants are the method which he calls "evolutionary parsimony". DNAML. Estimates phylogenies from nucleotide sequences by maximum likelihood. The model employed allows for unequal expected frequencies of the four nucleotides, for unequal rates of transitions and transversions, and for different (prespecified) rates of change in different categories of sites, with the program inferring which sites have which rates. DNAMLK. Same as DNAML but assumes a molecular clock. The use of the two programs together permits a likelihood ratio test of the molecular clock hypothesis to be made. DNADIST. Computes four different distances between species from nucleic acid sequences. The distances can then be used in the distance matrix programs. The distances are the Jukes-Cantor formula, one based on Kimura's 2parameter method, Jin and Nei's distance which allows for rate variation from site to site, and a maximum likelihood method using the model employed in DNAML. The latter method of computing distances can be very slow. PROTDIST. Computes a distance measure for protein sequences, using maximum likelihood estimates based on the Dayhoff PAM matrix, Kimura's 1983 approximation to it, or a model based on the genetic code plus a constraint on changing to a different category of amino acid. The distances can then be used in the distance matrix programs. RESTML. Estimation of phylogenies by maximum likelihood using restriction sites data (not restriction fragments but presence/absence of individual sites). It employs the Jukes-Cantor symmetrical model of nucleotide change, which does not allow for differences of rate between transitions and transversions. This program is VERY slow. SEQBOOT. Reads in a data set, and produces multiple data sets from it by bootstrap resampling. Since most programs in the current version of the package allow processing of multiple data sets, this can be used together with the consensus tree program CONSENSE to do bootstrap (or delete-half-jackknife) analyses with most of the methods in this package. This program also allows the Archie/Faith technique of permutation of species within characters. FITCH. Estimates phylogenies from distance matrix data under the "additive tree model" according to which the distances are expected to equal the sums of branch lengths between the species. Uses the Fitch-Margoliash criterion and some related least squares criteria. Does not assume an evolutionary clock. This program will be useful with distances computed from DNA sequences, with DNA hybridization measurements, and with genetic distances computed from gene frequencies. KITSCH. Estimates phylogenies from distance matrix data under the "ultrametric" model which is the same as the additive tree model except that an evolutionary clock is assumed. The Fitch-Margoliash criterion and other least squares criteria are assumed. This program will be useful with distances computes from DNA sequences, with DNA hybridization measurements, and with genetic distances computed from gene frequencies. NEIGHBOR. An implementation by Mary Kuhner and John Yamato of Saitou and Nei's "Neighbor Joining Method," and of the UPGMA (Average Linkage clustering) method. Neighbor Joining is a distance matrix method producing an unrooted tree without the assumption of a clock. UPGMA does assume a clock. The branch lengths are not optimized by the least squares criterion but the methods are very fast and thus can handle much larger data sets. CONTML. Estimates phylogenies from gene frequency data by maximum likelihood under a model in which all divergence is due to genetic drift in the absence of new mutations. Does not assume a molecular clock. An alternative method of analyzing this data is to compute Nei's genetic distance and use one of the distance matrix programs. GENDIST. Computes one of three different genetic distance formulas from gene frequency data. The formulas are Nei's genetic distance, the CavalliSforza chord measure, and the genetic distance of Reynolds et. al. The former is appropriate for data in which new mutations occur in an infinite isoalleles neutral mutation model, the latter two for a model without mutation and with pure genetic drift. The distances are written to a file in a format appropriate for input to the distance matrix programs. CONTRAST. Reads a tree from a tree file, and a data set with continuous characters data, and produces the independent contrasts for those characters, for use in any multivariate statistics package. Will also produce covariances, regressions and correlations between characters for those contrasts. MIX. Estimates phylogenies by some parsimony methods for discrete character data with two states (0 and 1). Allows use of the Wagner parsimony method, the Camin-Sokal parsimony method, or arbitrary mixtures of these. Also reconstructs ancestral states and allows weighting of characters. MOVE. Interactive construction of phylogenies from discrete character data with two states (0 and 1). Evaluates parsimony and compatibility criteria for those phylogenies and displays reconstructed states throughout the tree. This can be used to find parsimony or compatibility estimates by hand. PENNY. Finds all most parsimonious phylogenies for discrete-character data with two states, for the Wagner, Camin-Sokal, and mixed parsimony criteria using the branch-and-bound method of exact search. May be impractical (depending on the data) for more than 10-11 species. DOLLOP. Estimates phylogenies by the Dollo or polymorphism parsimony criteria for discrete character data with two states (0 and 1). Also reconstructs ancestral states and allows weighting of characters. Dollo parsimony is particularly appropriate for restriction sites data; with ancestor states specified as unknown it may be appropriate for restriction fragments data. DOLMOVE. Interactive construction of phylogenies from discrete character data with two states (0 and 1) using the Dollo or polymorphism parsimony criteria. Evaluates parsimony and compatibility criteria for those phylogenies and displays reconstructed states throughout the tree. This can be used to find parsimony or compatibility estimates by hand. DOLPENNY. Finds all most parsimonious phylogenies for discrete-character data with two states, for the Dollo or polymorphism parsimony criteria using the branch-and-bound method of exact search. May be impractical (depending on the data) for more than 10-11 species. CLIQUE. Finds the largest clique of mutually compatible characters, and the phylogeny which they recommend, for discrete character data with two states. The largest clique (or all cliques within a given size range of the largest one) are found by a very fast branch and bound search method. The method does not allow for missing data. For such cases the T (Threshold) option of MIX may be a useful alternative. Compatibility methods are particular useful when some characters are of poor quality and the rest of good quality, but when it is not known in advance which ones are which. FACTOR. Takes discrete multistate data with character state trees and produces the corresponding data set with two states (0 and 1). Written by Christopher Meacham. DRAWGRAM. Plots rooted phylogenies, cladograms, and phenograms in a wide variety of user-controllable formats. The program is interactive and allows previewing of the tree on PC graphics screens, and Tektronix or DEC graphics terminals. Final output can be on a laser printer (such as the Apple Laserwriter or HP Laserjet), on graphics screens or terminals, on pen plotters (Hewlett-Packard or Houston Instruments) or on dot matrix printers capable of graphics (Epson, Okidata, Imagewriter, or Toshiba). DRAWTREE. Similar to DRAWGRAM but plots unrooted phylogenies. CONSENSE. Computes consensus trees by the majority-rule consensus tree method, which also allows one to easily find the strict consensus tree. Does NOT compute the Adams consensus tree. Trees are input in a tree file in standard nested-parenthesis notation, which is produced by many of the tree estimation programs in the package when the Y option is invoked. This program can be used as the final step in doing bootstrap analyses for many of the methods in the package. RETREE. Reads in a tree (with branch lengths if necessary) and allows you to reroot the tree, to flip branches, to change species names and branch lengths, and then write the result out. Can be used to convert between rooted and unrooted trees. Programs in the Unsupported Division The Unsupported Division of PHYLIP consists of two programs contributed by others that may be useful to you and have kindly been contributed by their authors. Those authors retain full copyright to their programs and documentation files. They are provided in the PHYLIP source code distribution but have not been provided as executables in the executables distribution. All questions about these programs should be whose electronic mail addresses and regular mail their documentation files. directed to their authors, addresses are given in MAKEINF. This program by Arend Sidow can be used to translate the output files from Jotun Hein's popular multiple-sequence alignment program into PHYLIP input files. It also allows you to selectively analyze different codon positions and different organisms. The output from other alignment programs can rather easily be edited into a form that it will read. PROTML. This large Pascal program from Jun Adachi and Masami Hasegawa carries out maximum likelihood estimation of phylogenies from protein sequence data. It is quite analogous to DNAML, but uses instead of a model for DNA evolution the PAM matrix model of Margaret Dayhoff. Because of the larger number of states (20 instead of 4) it is necessarily slower than DNAML by a large factor. However the authors have adopted a different, and faster, rearrangement strategy to search among tree topologies for the best one. ProtML does not yet incorporate the Categories feature of DNAML and DNAMLK which allows different rates of evolution at different sites, without the user specifying in advance which site has which rate of evolution. For support, contact them at the Internet addresses hasegawa@ism.ac.jp and adachi@sunmh.ism.ac.jp at the Institute of Statistical Mathematics, Tokyo, Japan. OVERVIEW OF THE INPUT AND OUTPUT FORMATS When you run most of these programs, a menu will appear offering you choices of the various options available for that program. The data that the program reads should be in an input file called (in most cases) "infile". If there is no such file the programs will ask you for the name of the input file. Below we describe the input file format, and then the menu. Input File Format ----- ---- -----I have tried to adhere to a rather stereotyped input and output format. For the parsimony, compatibility and maximum likelihood programs, excluding the distance matrix methods, the simplest version of the input file looks something like this: 6 13 Archaeopt CGATGCTTAC HesperorniCGTTACTCGT BaluchitheTAATGTTAAT B. virginiTAATGTTCGT BrontosaurCAAAACCCAT B.subtilisGGCAGCCAAT CGC TGT TGT TGT CAT CAC The first line of the input file contains the number of species and the number of characters, in free format, separated by blanks (not by commas). The information for each species follows, starting with a ten-character species name (which can include punctuation marks and blanks), and continuing with the characters for that species. In the discrete-character, DNA and protein sequence programs the characters are each a single letter or digit, sometimes separated by blanks. In the continuous-characters programs they are real numbers with decimal points, separated by blanks: Latimeria 2.03 3.457 100.2 0.0 -3.7 The conventions about continuing the data beyond one line per species are different between the molecular sequence programs and the others. The molecular sequence programs can take the data in "aligned" or "interleaved" format, with some lines giving the first part of each of the sequences, then lines giving the next part of each, and so on. Thus the sequences might look like this: 6 39 Archaeopt CGATGCTTAC HesperorniCGTTACTCGT BaluchitheTAATGTTAAT B. virginiTAATGTTCGT BrontosaurCAAAACCCAT B.subtilisGGCAGCCAAT TACCGCCGAT CGTTGTCGTT AATTGTTAAT CGTTGTTAAT CATCATCAAA AATCACGGCA CGCCGATGCT TGTCGTTACT TGTTAATGTT TGTTAATGTT CATCAAAACC CACGGCAGCC GCTTACCGC ACTCGTTGT GTTAATTGT GTTCGTTGT ACCCATCAT GCCAATCAC Note that in these sequences we have a blank every ten sites to make them easier to read: any such blanks are allowed. The blank line which separates the two groups of lines (the ones containing sites 1-20 and ones containing sites 21-39) may or may not be present, but if it is, it should be a line of zero length and not contain any extra blank characters (this is because of a limitation of the current versions of the programs). It is important that the number of sites in each group be the same for all species (i.e., it will not be possible to run the programs successfully if the first species line contains 20 bases, but the first line for the second species contains 21 bases). Alternatively, an option can be selected to take the data in "sequential" format, with all of the data for the first species, then all of the characters for the next species, and so on. This is also the way that the discrete characters programs and the gene frequencies and quantitative characters programs want to read the data. They do not allow the "interleaved" format. In the sequential format, the character data can run on to a new line at any time (except in a species name or in the case of continuous character and distance matrix programs where you cannot go to a new line in the middle of a real number). Thus it is legal to have: Archaeopt 001100 1101 or even: Archaeopt 0011001101 though note that the FULL ten characters of the species name MUST then be present: in the above case there must be a blank after the "t". In all cases it is possible to put internal blanks between any of the character values, so that Archaeopt 0011001101 0111011100 is allowed. If you make an error in the input file, the programs will often detect that they have been fed an illegal character or illegal numerical value and issue an error message such as "BAD CHARACTER STATE:", often printing out the bad value, and sometimes the number of the species and character in which it occurred. The program will then stop shortly after. One of the things which can lead to a bad value is the omission of something earlier in the file, or the insertion of something superfluous, which cause the reading of the file to get out of synchronization. The program then starts reading things it didn't expect, and concludes that they are in error. So if you see this error message, you may also want to look for the earlier problem that may have led to this. The other major variation on the input data format is the options information. Many options are selected using the menu, but a few are selected by including extra information in the input file. Some options are described below. The Options Menu --- ------- ---The menu is straightforward. one is for DNAPARS): It typically looks like this (this DNA parsimony algorithm, version 3.57c Setting for this run: U Search for best tree? J Randomize input order of sequences? O Outgroup root? T Use Threshold parsimony? M Analyze multiple data sets? I Input sequences interleaved? 0 Terminal type (IBM PC, VT52, ANSI)? 1 Print out the data at start of run 2 Print indications of progress of run Yes No. Use input order No, use as outgroup species No, use ordinary parsimony No Yes ANSI No Yes 1 3 4 5 6 Print out tree Print out steps in each site Print sequences at all nodes of tree Write out trees onto tree file? Yes No No Yes Are these settings correct? (type Y or the letter for one to change) If you want to accept the default settings (they are shown in the above case) you can simply type "Y" followed by a carriage-return (Enter) character. If you want to change any of the options, you should type the letter shown to the left of its entry in the menu. For example, to set a threshold type "T". Lower-case letters will also work. For many of the options the program will ask for supplementary information, such as the value of the threshold. Note the "Terminal type" entry, which you will find on all menus. It allows you to specify which type of terminal your screen is. The options are an IBM PC screen, an ANSI standard terminal (such as a DEC VT100), a DEC VT52compatible terminal, such as a Zenith Z29, or no terminal type. Choosing "0" toggles among these four options in cyclical order, changing each time the "0" option is chosen. If one of them is right for your terminal the screen will be cleared before the menu is displayed. If none works the "none" option should probably be chosen. Keep in mind that VT-52 compatible terminals can freeze up if they receive the screen-clearing commands for the ANSI standard terminal! If this is a problem it may be helpful to recompile the program, setting the constants near its beginning so that the program starts up with the VT52 option set. The other numbered options control which information the program will display on your screen or on the output files. The option to "Print indications of progress of run" will show information such as the names of the species as they are successively added to the tree, and the progress of global rearrangements. You will usually want to see these as reassurance that the program is running and to help you estimate how long it will take. But if you are running the program "in background" as can be done on multitasking and multiuser systems such as Unix, and do not have the program running in its own window, you may want to turn this option off so that it does not disturb your use of the computer while the program is running. The Output File --- ------ ---Most of the programs write their output onto a file called (usually) "outfile", and a representation of the trees found onto a file called "treefile". The exact contents of the output file vary from program to program and also depend on which menu options you have selected. For many programs, if you select all possible output information, the output will consist of (1) the name of the program and its version number, (2) the input information printed out, (3) a series of phylogenies, some with associated information indicating how much change there was in each character or on each part of the tree. A typical rooted tree looks like this: +-------------------Gibbon +----------------------------2 ! ! +------------------Orang ! +------4 ! ! +---------Gorilla +-----3 +--6 ! ! ! +---------Chimp ! ! +----5 --1 ! +-----Human ! ! ! +-----------------------------------------------Mouse ! +------------------------------------------------Bovine The interpretation of the tree is fairly straightforward: it "grows" from left to right. The numbers at the forks are arbitrary and are used (if present) merely to identify the forks. In some of the programs asterisks ("*") are used instead of numbers. For many of the programs the tree produced is unrooted. It is printed out in nearly the same form, but with a warning message: remember: this is an unrooted tree! The warning message ("remember: ...") indicates that this is an unrooted tree (mathematicians still call this a tree, though some systematists unfortunately use the term "network". This conflicts with standard mathematical usage, which reserves the name "network" for a completely different kind of graph). The root of this tree could be anywhere, say on the line leading immediately to Mouse. As an exercise, see if you can tell whether the following tree is or is not a different one from the above: +-----------------------------------------------Mouse ! +---------4 +------------------Orang ! ! +------3 ! ! ! ! +---------Chimp ---6 +----------------------------1 ! +----2 ! ! +--5 +-----Human ! ! ! ! ! +---------Gorilla ! ! ! +-------------------Gibbon ! +-------------------------------------------Bovine remember: this is an unrooted tree! (it is NOT different). It is IMPORTANT also to realize that the lengths of the segments of the printed tree may not be significant: some may actually represent branches of zero length, in the sense that there is no evidence that the branches are nonzero in length. Some of the diagrams of trees attempt to print branches approximately proportional to estimated branch lengths, while in others the lengths are purely conventional and are presented just to make the topology visible. You will have to look closely at the documentation that accompanies each program to see what it presents and what is known about the lengths of the branches on the tree. The above tree attempts to represent branch lengths approximately in the diagram. But even in those cases, some of the smaller branches are likely to be artificially lengthened to make the tree topology clearer. Here is what a tree from DNAPARS looks like, when no attempt is made to make the lengths of branches in the diagram proportional to estimated branch lengths: +--Human +--5 +--4 +--Chimp ! ! +--3 +-----Gorilla ! ! +--2 +--------Orang ! ! +--1 +-----------Gibbon ! ! --6 +--------------Mouse ! +-----------------Bovine remember: this is an unrooted tree! Some of the parsimony programs in the package can print out a table of the number of steps that different characters (or sites) require on the tree. This table may not be obvious at first. A typical example looks like this: steps in each site: 0 1 2 3 4 5 6 7 8 9 *----------------------------------------0! 2 2 2 2 1 1 2 2 1 10! 1 2 3 1 1 1 1 1 1 2 20! 1 2 2 1 2 2 1 1 1 2 30! 1 2 1 1 1 2 1 3 1 1 40! 1 The numbers across the top and down the side indicate which site is being referred to. Thus site 23 is column "3" of row "20" and has 1 step in this case. The Tree File --- ---- ---In output from most programs, a representation of the tree is also written into the tree file (usually named "treefile"). The tree is specified by the nested pairs of parentheses, enclosing names and separated by commas. If there are any blanks in the names, these must be replaced by the underscore character "_". Trailing blanks in the name may be omitted. The pattern of the parentheses indicates the pattern of the tree by having each pair of parentheses enclose all the members of a monophyletic group. The tree file for the above tree would have its first line look like this: ((Mouse,Bovine),((Orang,(Gorilla,(Chimp,Human))),Gibbon)); In the above tree the first fork separates the lineage leading to Mouse and Bovine from the lineage leading to the rest. Within the latter group there is a fork separating Gibbon from the rest, and so on. The entire tree is enclosed in an outermost pair of parentheses. The tree ends with a semicolon. In some programs such as DNAML, FITCH, and CONTML, the tree will be completely unrooted and specified by a bottommost fork with a three-way split, with three "monophyletic" groups separated by two commas: (A,(B,(C,D)),(E,F)); The three "monophyletic" groups here are A, (B,C,D), and (E,F). The single three-way split corresponds to one of the interior nodes of the unrooted tree (it can be any interior node). The remaining forks are encountered as you move out from that first node, and each then appears as a two-way split. You should check the documentation files for the particular programs you are using to see in which of these forms you can expect the user tree to be in. Note that many of the programs that estimate an unrooted tree produce trees in the treefile in rooted form! This is done for reasons of arbitrary internal bookkeeping. The placement of the root is arbitrary. For programs estimating branch lengths, these are given in the trees in the tree file as real numbers following a colon, and placed immediately after the group descended from that branch. Here is a typical tree with branch lengths: ((cat:47.14069,(weasel:18.87953,((dog:25.46154,(raccoon:19.19959, bear:6.80041):0.84600):3.87382,(sea_lion:11.99700, seal:12.00300):7.52973):2.09461):20.59201):25.0,monkey:75.85931); Note that the tree may continue to a new line at any time except in the middle of a name or the middle of a branch length, although in trees written to the tree file this will only be done after a comma. These representations of trees are a subset of the standard adopted on June 24, 1986 at the annual meetings of the Society for the Study of Evolution at an meeting (the final session in Newick's lobster restaurant -hence its name -- the Newick standard) of an informal committee consisting of Wayne Maddison (MacClade), David Swofford (PAUP), F. James Rohlf (NTSYS-PC), Chris Meacham (COMPROB and plotting programs), James Archie (character coding program), William H.E. Day, and me. This standard is a generalization of PHYLIP's format, itself based on a well-known representation of trees in terms of parenthesis patterns which has been around for almost a century. The standard is now employed by most phylogeny computer programs but unfortunately has yet to be decribed in a formal published description. THE OPTIONS AND HOW TO INVOKE THEM Most of the programs allow various options that alter the amount of information the program is provided or what it is to do with the information. Most options are selected in the menu. However a few are specified in the input file, or require part of their specification to be in the input file. Options Information in the Input File ------- ----------- -- --- ----- ---In such cases, the program is notified that an option has been invoked by the presence of one or more letters after the last number on the first line of the input file. These letters may or may not be separated from each other by blanks, though it is usually necessary to separate them from the number by a blank. They can be in any order. Thus to invoke options A and W, the input file starts with the line: 12 or: 12 20 WA 20 A W The options are described individually in the other documents of this package. For the options that require information to be in the input file, additional information must be provided. For all but one of these, this information is provided by placing a line after the first line of the file, but before the beginning of the species data. The first character of that line should match the option letter. These auxiliary information lines can be in any order. Thus if options A and W are both invoked, both of the following formats (and two others as well) are legal: 12 20 AW A 0001111000 Weights 00112221A0 (then the species information) 12 20 A W Weights 00112221A0 A 0001111000 (then the species information) One of the options requires special discussion. Many of the programs have in their menu the option U, which signals that one or more user-defined trees is to be provided for evaluation. This "user tree" is supplied in the input file (not the tree file), but AFTER the species data, rather than before it. It does not require any indication to be placed in the first line of the input file, as do the options that place information before the species data. After the data, there is a line containing the number of user-defined trees being defined. Each user-defined tree starts on a new line. It is in the same form as the trees in the tree files mentioned above, namely the New Hampshire standard. Here is an example with one user-defined tree: 6 13 Archaeopt 0011001110000 Hesperorni0001101101101 Baluchithe1111011011101 B. virgini1111011101101 Brontosaur0110100111011 B.subtilis0000000011010 1 ((B.subtilis,Baluchithe),((Brontosaur,B._virgini), (Hesperorni,Archaeopt))); In using the user tree option, check the pattern of parentheses carefully. The programs do not always detect whether the tree makes sense, and if it does not there will probably be a crash (hopefully, but not inevitably, with an error message indicating the nature of the problem). Common Options in the Menu ------ ------- -- --- ---Seven options from the menu, the U (User tree), G (Global), J (Jumble), O (Outgroup), T (Threshold), M (multiple data sets), and the tree output options, are used so widely that it is best to discuss them in this document. (1) The U (User tree) option. This option toggles between the default setting, which allows the program to search for the best tree, and the User tree setting, which reads a tree or trees ("user trees") from the input file and evaluates them. The user trees must follow the other information in the data set, and be preceded by a line specifying the number to user trees that are to be evaluated. Each user tree then is given in standard form, each starting on a new line. The form that the user trees must take is described in some detail below, under the description of the program output of tree files. In some cases a program may require that the trees fed in be rooted trees, even though the program cannot infer the placement of the root. In those cases you can place the root anywhere. Program RETREE can be used to convert between rooted and unrooted trees. (2) The G (Global) option. In the programs which construct trees (except for NEIGHBOR, the "...PENNY" programs and CLIQUE, and of course the "...MOVE" programs where you construct the trees yourself), after all species have been added to the tree a rearrangements phase ensues. In most of these programs the rearrangements are automatically global, which in this case means that subtrees will be removed from the tree and put back on in all possible ways so as to have a better chance of finding a better tree. Since this can be time consuming (it roughly triples the time taken for a run) it is left as an option in some of the programs, specifically CONTML, FITCH, and DNAML. In these programs the G menu option toggles between the default of local rearrangement and global rearrangement. The rearrangements are explained more below. (3) The J (Jumble) option. In most of the tree construction programs (except for the "...PENNY" programs and CLIQUE), the exact details of the search of different trees depend on the order of input of species. In these programs J option enables you to tell the program to use a random number generator to choose the input order of species. This option is toggled on and off by selecting option J in the menu. The program will then prompt you for a "seed" for the random number generator. The seed should be an integer between 1 and 32767, and should of form 4n+1, which means that it must give a remainder of 1 when divided by 4. This can be judged by looking at the last two digits of the number. Each different seed leads to a different sequence of addition of species. By simply changing the random number seed and rerunning the programs one can look for other, and better trees. If the seed entered is not odd, the program will not proceed, but will prompt for another seed. The Jumble option also causes the program to ask you how many times you want to restart the process. If you answer 10, the program will try ten different orders of species in constructing the trees, and the results printed out will reflect this entire search process (that is, the best trees found among all 10 runs will be printed out, not the best trees from each individual run). (4) The O (Outgroup) option. This specifies which species is to be used to root the tree by having it become the outgroup. This option is toggled on and off by choosing O in the menu. When it is on, the program will then prompt for the number of the outgroup (the species being taken in the numerical order that they occur in the input file). Responding by typing "6" and then a carriage-return (Enter) character indicates that the sixth species in the data is the outgroup. Outgroup-rooting will not be attempted if the data have already established a root for the tree from some other consideration, and may not be if it is a user-defined tree, despite your invoking the option. Thus programs such as DOLLOP that produce only rooted trees do not allow the Outgroup option. It is also not available in KITSCH, DNAMLK, or CLIQUE. When it is used, the tree as printed out is still listed as being an unrooted tree, though the outgroup is connected to the bottommost node so that it is easy to visually convert the tree into rooted form. (5) The T (Threshold) option. This sets a threshold such that if the number of steps counted in a character is higher than the threshold, it will be taken to be the threshold value rather than the actual number of steps. The default is a threshold so high that it will never be surpassed. The T menu option toggles on and off asking the user to supply a threshold. The use of thresholds to obtain methods intermediate between parsimony and compatibility methods is described in my 1981b paper. When the T option is in force, the program will prompt for the numerical threshold value. This will be a positive real number greater than 1. In programs MIX, MOVE, PENNY, PROTPARS, DNAPARS, DNAMOVE, and DNAPENNY, do not use threshold values less than or equal to 1.0, as they have no meaning and lead to a tree which depends only on considerations such as the input order of species and not at all on the character state data! In programs DOLLOP, DOLMOVE, and DOLPENNY the threshold should never be 0.0 or less, for the same reason. The T option is an important and underutilized one: it is, for example, the only way in this package (except for program DNACOMP) to do a compatibility analysis when there are missing data. It is a method of de-weighting characters that evolve rapidly. I wish more people were aware of its properties. (6) The M (Multiple data sets) option. In menu programs there is an M menu option which allows one to toggle on the multiple data sets option. The program will ask you how many data sets it should expect. The data sets have the same format as the first data set. Here is a (very small) input file with two five-species data sets: 5 Alpha Beta Gamma Delta Epsilon 5 Alpha Beta Gamma Delta Epsilon 6 CCACCA CCAAAA CAACCA AACAAC AACCCA 6 CACACA CCAACC CAACAC GCCTGG TGCAAT The main use of this option will be to allow all of the methods in these programs to be bootstrapped. Using the program SEQBOOT one can take any DNA, protein, restriction sites, or binary character data set and make multiple data sets by bootstrapping. Trees can be produced for all of these using the M option. They will be written on the tree output file if that option is left in force. Then the program CONSENSE can be used with that tree file as its input file. The result is a majority rule consensus tree which can be used to make confidence intervals. The present version of the package allows, with the use of SEQBOOT and CONSENSE and the M option, bootstrapping of many of the methods in the package. (7) The option to write out the trees into a tree file. This specifies that you want the program to write out the tree not only on its usual output, but also onto a file in nested-parenthesis notation (as described above). This option is sufficiently useful that it is turned on by default in all programs that allow it. You can optionally turn it off if you wish, by typing the appropriate number from the menu (it varies from program to program). This option is useful for creating tree files that can be directly read into the plotting programs, the consensus tree program, and can be incorporated into the input file to specify user-defined trees in many of the other programs. to (8) The (0) terminal one type option. The program will default particular assumption about your terminal (except in the case of Macintoshes, the default will be an ANSI compatible terminal). You can alternatively select it to be either an IBM PC, a DEC VT52, or nothing. This affects the ability of the programs to clear the screen when they display their menus, and the graphics characters used to display trees in the programs DNAMOVE, MOVE, DOLMOVE, and RETREE. If you are running a PCDOS system any have the ANSI.SYS driver installed in your CONFIG.SYS file, you may find that the screen clears correctly even with the default setting of ANSI. Common Options Requiring Information in the Input File ------ ------- --------- ----------- -- --- ----- ---There are a number of options (Ancestor, Factors, Categories and Weights) that are specified in the input file. Some of them must also be selected in the menu. Of these, the Ancestor and Factors options are specific to the Discrete Characters programs and are described in their group document. The Categories option is specific to some of the molecular sequence programs and is described in their group document. The Weights option is used throughout the package and is best introduced here. This allows us to specify weights on the individual characters. Weights are invoked by placing a W on the first line of the file. The weights are then specified by a line or lines which start with W and then have enough characters or blanks to complete the full length of a species name. Then they have a single character (0-9 or A-Z) for each character. Thus they look like the data for a species: Weights 0001111001112 or: W 1110000ZZZZZ1 The weights cause a character to be counted as if it were n characters, where n is the weight. The values 0-9 give weights 0 through 9, and the values A-Z give weights 10 through 35. By use of the weights we can give overwhelming weight to some characters, and drop others from the analysis. In the molecular sequence programs only two values of the weights, 0 or 1 are allowed. Weights can be used to analyze different subsets of characters (by weighting the rest as zero). Alternatively, in the discrete characters programs they can be used to force a certain group to appear on the phylogeny (in effect confining consideration to only phylogenies containing that group). This is done by adding an imaginary character that has 1's for the members of the group, and 0's for all the other species. That imaginary character is then given the highest weight possible: the result will be that any phylogeny that does not contain that group will be penalized by such a heavy amount that it will not (except in the most unusual circumstances) be considered. Of course, the new character brings extra steps to the tree, but the number of these can be calculated in advance and subtracted out of the total when reporting the results. This use of weights is an important one, and one sadly ignored by many users who could profit from it. In the case of molecular sequences we cannot use weights this way, so that to force a given group to appear we have to add a large extra segment of sites to the molecule, with (say) A's for that group and C's for every other species. THE ALGORITHM FOR CONSTRUCTING TREES All of the programs except FACTOR, DNADIST, GENDIST, DNAINVAR, SEQBOOT, CONTRAST, RETREE, and the plotting and consensus tree programs act to construct an estimate of a phylogeny. MOVE, DOLMOVE, and DNAMOVE let you construct it yourself by hand. All of the rest but NEIGHBOR, the "...PENNY" programs and CLIQUE make use of a common approach involving additions and rearrangements. They are trying to minimize or maximize some quantity over the space of all possible evolutionary trees. Each program contains a part that, given the topology of the tree, evaluates the quantity that is being minimized or maximized. The straightforward approach would be to evaluate all possible tree topologies one after another and pick the one which, according to the criterion being used, is best. This would not be possible for more than a small number of species, since the number of possible tree topologies is enormous. A review of the literature on the counting of evolutionary trees will be found one of my papers (Felsenstein, 1978a). Since we cannot search all topologies, these programs are not guaranteed to always find the best tree, although they seem to do quite well in practice. The strategy they employ is as follows: the species are taken in the order in which they appear in the input file. The first two (in some programs the first three) are taken and a tree constructed containing only those. There is only one possible topology for this tree. Then the next species is taken, and we consider where it might be added to the tree. If the initial tree is (say) a rooted tree with two species and we want the resulting three-species tree to be a bifurcating tree, there are only three places where we could add the third species. Each of these is tried, and each time the resulting tree is evaluated according to the criterion. The best one is chosen to be the basis for further operations. Now we consider adding the fourth species, again at each of the five possible places that would result in a bifurcating tree. Again, the best of these is accepted. Local Rearrangements ----- -------------The process continues in this manner, with one important exception. After each species is added, and before the next is added, a number of rearrangements of the tree are tried, in an effort to improve it. The algorithms move through the tree, making all possible local rearrangements of the tree. A local rearrangement involves an internal segment of the tree in the following manner. Each internal segment of the tree is of this form (where T1, T2, and T3 are subtrees -- parts of the tree that can contain further forks and tips): T1 \ \ T2 / / \ / \/ * * T3 / / / / / / * / * / * ! ! the segment we are discussing being indicated by the asterisks. A local rearrangement consists of switching the subtrees T1 and T3 or T2 and T3, so as to obtain one of the following: T3 \ \ \ T2 / / / \ / \ \ \ T1 / / / / / / / \ / ! T1 \ \ \ \ / \ \ T3 / / / T2 / / / / / / \ / \ / ! ! ! ! ! Each time a local rearrangement is successful in finding a better tree, the new arrangement is accepted. The phase of local rearrangements does not end until the program can traverse the entire tree, attempting local rearrangements, without finding any that improve the tree. This strategy of adding species and making local rearrangements will look at about (n-1) times (2n-3) different topologies, though if rearrangements are frequently successful the number may be larger. I have been describing the strategy when rooted trees are being considered. For unrooted trees there is a precisely similar strategy, though the first tree constructed may be a threespecies tree and the rearrangements may not start until after the addition of the fifth species. Though we are not guaranteed to have found the best tree topology, we are guaranteed that no nearby topology (i. e. none accessible by a single local rearrangement) is better. In this sense we have reached a local optimum of our criterion. Note that the whole process is dependent on the order in which the species are present in the input file. We can try to find a different and better solution by reordering the species in the input file and running the program again (or, more easily, by using the J option). If none of these attempts finds a better solution, then we have some indication that we may have found the best topology, though we can never be certain of this. Note also that a new topology is never accepted unless it is better than the previous one, so that the rearrangement process can never fall into an endless loop. This is also the way ties in our criterion are resolved, namely by sticking with the tree found first. However, the tree construction programs other than CLIQUE, CONTML, FITCH, and DNAML do keep a record of all trees found that are tied with the best one found. This gives you some immediate idea of which parts of the tree can be altered without affecting the quality of the result. Global Rearrangements ------ -------------A feature of most of the programs, such as PROTPARS, DNAPARS, DNACOMP, DNAML, DNAMLK, RESTML, KITSCH, FITCH, CONTML, MIX, and DOLLOP, is "global" optimization of the tree. In four of these (CONTML, FITCH, DNAML and DNAMLK) this is an option, 'G'. In the others it automatically applies. When it is present there is an additional stage to the search for the best tree. Each possible subtree is removed from the tree from the tree and added back in all possible places. This process continues until all subtrees can be removed and added again without any improvement in the tree. The purpose of this extra rearrangement is to make it less likely that one or more a species gets "stuck" in a suboptimal region of the space of all possible trees. The use of global optimization results in approximately a tripling (3x) of the run-time, which is why I have left it as an option in some of the slower programs. The programs doing global optimization print out a dot "." after each group is removed and re-added to the tree, to give the user some sign that the rearrangements are proceeding. A new line of dots is started whenever a new round of global rearrangements is started following an improvement in the tree. On the line before the dots are printed there is printed a bar of the form "!--------------!" to show how many dots to expect. The dots will not be printed out at a uniform rate, but the later dots, which represent removal of larger groups from the tree and trying them consequently in fewer places, will print out more quickly. With some compilers each row of dots is not printed out until it is complete. It should be noted that PENNY, DOLPENNY, DNAPENNY and CLIQUE use a more sophisticated strategy of "depth-first search" with a "branch and bound" search method that guarantees that all of the best trees will be found. In the case of PENNY, DOLPENNY and DNAPENNY there can be a considerable sacrifice of computer time if the number of species is greater than about ten: it is a matter for you to consider whether it is worth it for you to guarantee finding all the most parsimonious trees, and that depends on how much free computer time you have! CLIQUE finds all largest cliques, and does so without undue burning of computer time. Multiple Jumbles -------- ------As just mentioned, for most of these programs the search depends on the order in which the species are entered into the tree. Using the J (Jumble) option you can supply a random number seed which will allow the program to put the species in in a random order. A new feature (with version 3.5) is to allow this to be done multiple times. If you tell the program to do it 10 times, it will go through the tree-building process 10 times, each with a different random order of adding species. It will keep a record of the trees tied for best over the whole process. In other words, it does not just record the best trees from each of the 10 runs, but records the best ones overall. Of course this is slow, taking 10 times longer than a single run. But it does give us a much greater chance of finding all of the most parsimonious trees. In the terminology of Maddison (1991) it can find different "islands" of trees. The present algorithms do not guarantee us to find all trees in a given "island" from a single run, so multiple runs also help explore those "islands" that are found. STRATEGY FOR FINDING THE BEST TREE In practice, it is advisable to use the Jumble option to evaluate many different orderings of the input species. When the programs which have global branch-swapping as default (such as DNAPARS) are used or when the G option is employed in other programs IT IS ADVISABLE TO USE THE JUMBLE OPTION AND SPECIFY THAT IT BE DONE MANY TIMES (AS MANY AS TEN) to use different orderings of the input species). When the G (Global rearrangement) option is not being used I have also found it useful to do multiple Jumbles. People who want a magic "black box" program whose results they do not have to question (or think about) often are upset that these programs give results that are dependent on the order in which the species are entered in the data. To me this property is an advantage, for it permits you to try different searches for better trees, simply by varying the input order of species. If you do not use the multiple Jumble option, but do multiple individual runs instead, you can easily decide which to pay most attention to -- the one or ones that are best according to the criterion employed (for example, with parsimony, the one out of the runs that results in the tree with the fewest changes). In practice, in a single run, it usually seems best to put species that are likely to be sources of confusion in the topology last, as by the time they are added the arrangement of the earlier species will have stabilized into a good configuration, and then the last few species will by fitted into that topology. There will be less chance this way of a poor initial topology that would affect all subsequent parts of the search. However, a variety of arrangements of the input order of species should be tried, as can be done if the J option is used, and no species should be kept in a fixed place in the order of input. Note that the results of the "...PENNY" programs and CLIQUE are not sensitive to the input order of species, and NEIGHBOR is only slightly sensistive to it, so that multiple Jumbling is not possible with those programs. Note also that with global search, which is standard in many programs and in others is an option, each group (including each individual species) will be removed and re-added in all possible positions, so that a species causing confusion will have more chance of moving to a new location than it would without global rearrangement. A WARNING ON INTERPRETING RESULTS Probably the most important thing to keep in mind while running any of the parsimony or compatibility programs is not to overinterpret the result. Many users treat the set of most parsimonious trees as if it were a confidence interval. If a group appears in all of the most parsimonious trees then they treat it as well established. Unfortunately THE CONFIDENCE INTERVAL ON PHYLOGENIES APPEARS TO BE MUCH LARGER THAN THE SET OF ALL MOST PARSIMONIOUS TREES (Felsenstein, 1985b). Likewise, variation of result among different methods will not be a good indicator of the size of the confidence interval. Consider a simple data set in which, out of 100 binary characters, 51 recommend the rooted tree ((A,B),C) and 49 the tree (A,(B,C)). Many different methods will all give the same result on such a data set: they will estimate the tree as ((A,B),C). Nevertheless it is clear that the 51:49 margin by which this tree is favored is not significantly different from 50:50. So CONSISTENCY AMONG DIFFERENT METHODS IS A POOR GUIDE TO STATISTICAL SIGNIFICANCE. RELATIVE SPEED OF DIFFERENT PROGRAMS AND MACHINES Relative speed of the different programs -------- ----- -- --- --------- -------C compilers differ in efficiency of the code they generate, and some deal with some features of the language better than with others. Thus a program which is unusually fast on one computer may be unusually slow on another. Nevertheless, as a rough guide to relative execution speeds, I have tested the programs on three data sets, each of which has 10 species and 20 characters. The first is an imaginary one in which all characters are compatible ("The Willi Hennig Memorial Data Set" as J. S. Farris once called it). The second is the binary recoded form of the fossil horses data set of Camin and Sokal (1965). The third data set has data that is completely random: 10 species and 20 characters with a 50% chance that each character state is 0 or 1 (or A or G). The data sets range from a completely compatible one in which there is no homoplasy (paralellism or convergence), through the horses data set, which requires 29 steps where the possible minimum number would be 20, to the random data set, which requires 49 steps. We can thus see how this increasing messiness of the data affects running times. Here are the nucleotide sequence versions of the three data sets: 10 A B C D E F G H I J 20 CACACACAAAAAAAAAAACA CACACAACAAAAAAAAAACA CACAACAAAAAAAAAAAACA CAACAAAACAAAAAAAAACA CAACAAAAACAAAAAAAACA ACAAAAAAAACACACAAAAC ACAAAAAAAACACAACAAAC ACAAAAAAAACAACAAAAAC ACAAAAAAAAACAAAACAAC ACAAAAAAAAACAAAAACAC 10 20 MesohippusAAAAAAAAAAAAAAAAAAAA HypohippusAAACCCCCCCAAAAAAAAAC ArchaeohipCAAAAAAAAAAAAAAAACAC ParahippusCAAACAACAACAAAAAAAAC MerychippuCCAACCACCACCCCACACCC M. secunduCCAACCACCACCCACACCCC Nannipus CCAACCACAACCCCACACCC NeohippariCCAACCCCCCCCCCACACCC Calippus CCAACCACAACCCACACCCC PliohippusCCCACCCCCCCCCACACCCC 10 A B C D E F G H I J 20 CACACAACCAAACAAACCAC AAACCACACACACAAACCCA ACAAAACCAAACCACCCACA AAAAACACAACACACCAAAC AAACAACCACACACAACCAA CCCAAACACCCCCAAAAAAC ACACCCCCACACCCACCAAC AAAACAACAACCACCCCACC ACACAACAACACAAACAACC CCAAAAACACCCAACCCAAC Here are the timings of many of the version 3.5 programs on these three data sets as run after being compiled by Microsoft Quick C on an 16 MHz 80386SX computer under PCDOS 5.0. An 80387 math co-processor was present and was used by the compiled code. Hennigian Data PROTPARS DNAPARS DNAPENNY DNACOMP DNAINVAR DNAML DNAMLK DNADIST RESTML FITCH KITSCH NEIGHBOR CONTML GENDIST MIX PENNY DOLLOP DOLPENNY CLIQUE FACTOR 82.83 5.98 46.03 7.14 0.61 1928.99 2247.12 3.57 6818.34 35.92 12.42 2.20 56.85 1.00 13.62 8.41 26.69 12.25 0.77 0.39 Horses Data 86.23 5.66 23.51 6.43 0.66 2069.32 6094.81 4.50 13422.15 48.61 12.36 2.14 57.56 1.00 14.60 21.31 26.86 56.57 0.71 0.44 Random Data 148.03 11.54 5305.97 11.86 0.61 2611.48 4993.00 5.38 28418.34 38.17 13.18 2.903 59.15 1.00 25.92 3851.1 46.30 23934.22 0.77 0.44 In all cases the programs were run under the default options, except as specified here. The data sets used for the discrete characters programs have 0's and 1's instead of A's and C's. For CONTML the 0's and 1's were made into 0.0's and 1.0's and considered as 20 2-allele loci. For the distance programs 10 x 10 distance matrices were computed from the three data sets. Nor does it make much sense to benchmark MOVE, DOLMOVE, or DNAMOVE, although when there are many characters and many species the response time after each alteration of the tree should be proportional to the product of the number of species and the number of characters. For DNAML and DNAMLK the frequencies of the four bases were set to be equal rather than determined empirically as is the default. For RESTML the number of enzymes was set to 1. Several patterns will be apparent from this. The algorithms (MIX, DOLLOP, CONTML, FITCH, KITSCH, PROTPARS, DNAPARS, DNACOMP, and DNAML, DNAMLK, RESTML) that use the above-described addition strategy have run times that do not depend strongly on the messiness of the data. The only exception to this is that if a data set such as the Random data requires one extra round of global rearrangements it takes longer. The programs differ greatly in run time: the likelihood programs RESTML, DNAML and CONTML are quite a bit slower than the others. The protein sequence parsimony program, which has to do a considerable amount of bookkeeping to keep track of which amino acids each other, is also relatively slow. can mutate to Another class of algorithms includes PENNY, DOLPENNY, DNAPENNY and CLIQUE. These are branch-and-bound methods: in principle they should have execution times that rise exponentially with the number of species and/or characters, and they might be much more sensitive to messy data. This is apparent with PENNY, DOLPENNY, and DNAPENNY, which go from being reasonably fast with clean data to very slow with messy data. DOLPENNY is paritcularly slow on messy data - this is because this algorithm cannot make use of some of the lower-bound calculations that are possible with DNAPENNY and PENNY. CLIQUE is very fast on all data sets. Although in theory it should bog down if the number of cliques in the data is very large, that does not happen with random data, which in fact has few cliques and those small ones. Apparently the "worst-case" data sets are much rarer for CLIQUE than for the other branch-and-bound methods. NEIGHBOR is quite fast compared to FITCH and KITSCH, and should make it possible to run much larger cases, although the results are expected to be a bit rougher than with those programs. Speed with different numbers of species ----- ---- --------- ------- -- ------How will the speed depend on the number of species and the number of characters? For the sequential-addition algorithms, the speed should be proportional to the cube of the number of species, and to the number of characters. Thus a case that has, instead of 10 species and 20 characters, 20 species and 50 characters would take 2 x 2 x 2 x 2.5 = 20 times as long. This implies that cases with more than 20 species will be slow, and cases with more than 40 species VERY slow. This places a premium on working on small subproblems rather than just dumping a whole large data set into the programs. An exception to these rules will be some of the DNA programs that use an aliasing device to save execution time. In these programs execution time will not necessarily increase proportional to the number of sites, as sites that show the same pattern of nucleotides will be detected as identical and the calculations for them will be done only once, which does not lead to more execution time. This is particularly likely to happen with few species and many sites, or with data sets that have small amounts of evolutionary divergence. For programs FITCH and KITSCH, the distance matrix is square, so that when we double the number of species we also double the number of "characters", so that running times will go up as the fourth power of the number of species rather than the third power. Thus a 20-species case with FITCH is expected to run sixteen times more slowly than a 10-species case. For programs like PENNY and CLIQUE the run times will rise faster than the cube of the number of species (in fact, they can rise faster than any power since these algorithms are not guaranteed to work in polynomial time). In practice, PENNY will frequently bog down above 11 species, while CLIQUE easily deals with larger numbers. For NEIGHBOR the speed should vary only as the square of the number of species, so a case twice as large will take only four times as long. This will make it an attractive alternative to FITCH and KITSCH for large data sets. If you are unsure of how long a program will take, try it first on a few species, then work your way up until you get a feel for the speed and for what size programs you can afford to run. Execution time is not the most important criterion for a program, particularly as computer time gets much cheaper than your time or a programmer's time. With workstations on which background jobs can be run all night, execution speed is not overwhelmingly relevant. Some of us have been conditioned by an earlier era of computing to consider execution speed paramount. But ease of use, ease of adaptation to your computer system, and ease of modification are much more important in practice, and in these respects I think these programs are adequate. Only if you are engaged in 1960's style mainframe computing is minimization of execution time paramount. Nevertheless it would have been nice to have made the programs faster. The present speeds are a compromise between speed and effectiveness: by making them slower and trying more rearrangements in the trees, or by enumerating all possible trees, I could have made the programs more likely to find the best tree. By trying fewer rearrangements I could have speeded them up, but at the cost of finding worse trees. I could also have speeded them up by writing critical sections in assembly language, but this would have sacrificed ease of distribution to new computer systems. There are also some options included in these programs that make it harder to adopt some of the economies of bookkeeping that make other programs faster. However to some extent I have simply made the decision not to spend time trying to speed up program bookkeeping when there were new likelihood and statistical methods to be developed. Relative speed of different machines It is interesting to compare different machines using DNAPARS as the standard task. One can rate a machine on the DNAPARS benchmark by summing the times for all three of the data sets. Here are relative total timings over all three data sets (done with various versions of DNAPARS) for some machines, taking Microsoft Quick C running under PCDOS on a 16 MHz 80386 clone as the standard. Pascal benchmarks from version 3.4 of the program are also included -- they are compared only with each other and their times are in parentheses. This use of two separate standards is necessary not because of different languages but because different versions of the package are being compared. Thus, the "Time" is the ratio of the Total to that for the 386SX, for the appropriate standard, so that the Time for the Macintosh Classic for DNAPARS 3.4 on Think Pascal 3 is compared to the Time for the 386/SX running DNAPARS 3.4 on Turbo Pascal 6.0, but the Time for the Macintosh Classic running version 3.5 on Think C is compared to the Time for the 386SX running version 3.5 on Quick C. The Speed is the reciprocal of the Time. Machine Speed ---------- DOS Compiler Total Time --- -------- ----- ---- Toshiba T1100+ 0.126 Apple Mac Plus 0.193 Toshiba T1100+ 0.210 Macintosh Classic 0.212 Macintosh Classic 0.279 IBM PS2/60 0.579 80286 (12 Mhz) 0.722 Apple Mac IIcx 0.810 Apple Mac SE/30 0.810 Apple Mac IIcx 0.853 Apple Mac IIcx 0.857 Zenith Z386 (16MHz) 0.866 PCDOS Turbo Pascal 3.01A (269) 7.912 MacOS Lightspeed Pascal 2 (175.84) 5.172 PCDOS Turbo Pascal 5.0 (162) 4.765 MacOS Think Pascal 3 (160) 4.706 MacOS Think C PCDOS Turbo Pascal 5.0 (58.76) 1.728 PCDOS Turbo Pascal 5.0 (47.09) 1.385 MacOS Think Pascal 3 (42) 1.235 MacOS Think Pascal 3 (42) 1.235 MacOS Lightspeed Pascal 2 (39.84) 1.172 MacOS Lightspeed Pascal 2# (39.69) 1.167 PCDOS Turbo Pascal 5.0 (38.27) 1.155 43.0 3.58 - Macintosh SE/30 0.883 80386SX (16 MHz) 1.0 80386SX (16 MHz) 1.0 MacOS Think C PCDOS Turbo Pascal 6.0 PCDOS Microsoft Quick C 13.6 (34) 12.01 1.132 1.0 1.0 Sequent-S81 2.615 VAX 11/785 2.857 80486-33 2.967 Sun 3/60 3.056 NeXT Cube (68030) 4.605 Sequent S-81 4.612 VAXstation 3500 4.658 Sequent S-81 6.07 Unisys 7000/40 6.49 VAX 8600 8.59 Sun SPARC IPX 9.383 VAX 6000-530 13.998 VAXstation 4000 14.845 IBM RS/6000 540 14.94 NeXTstation(040/25) 16.013 Sun SPARC IPX 17.662 486DX (33 MHz) 19.063 Sun SPARCstation-1+ 20.00 DECstation 5000/200 26.69 Sun SPARC 1+ 30.025 DECstation 3100 44.16 IBM 3090-300E 44.48 DECstation 5000/125 44.98 DECstation 5000/200 44.98 Sun SPARC 4/50 48.23 DEC 3000/400 AXP 53.62 DYNIX Silicon Valley Pascal (13.0) 0.382 Unix Berkeley Pascal (11.9) 0.35 PCDOS Turbo Pascal 6.0 (11.46) 0.337 SunOS Sun C 3.93 0.327 Mach Gnu C 2.608 0.217 DYNIX Sequent Symmetry C 2.604 0.217 Unix Berkeley Pascal (7.3) 0.215 DYNIX Berkeley Pascal (5.6) 0.1647 Unix Berkeley Pascal (5.24) 0.1541 VMS DEC VAX Pascal (3.96) 0.1165 SunOS Gnu C version 2.1 1.28 0.1066 VMS DEC C 0.858 0.0714 VMS DEC C 0.809 0.0674 AIX XLP Pascal (2.276) 0.0669 Mach Gnu C 0.75 0.0624 SunOS Sun C 0.68 0.0566 Linux Gnu C # 0.63 0.0525 Unix Sun Pascal Unix DEC Ultrix C 0.45 0.0375 SunOS Sun C 0.40 0.0333 Unix DEC Ultrix RISC Pascal (0.77) 0.0226 AIX Metaware High C 0.27 0.0225 Unix DEC Ultrix RISC C 0.267 0.0222 Unix DEC Ultrix RISC C 0.256 0.0222 SunOS Sun C 0.249 0.02073 Unix DEC C 0.224 0.01865 (1.7) 0.05 DECstation 5000/240 63.58 SGI Iris R4000 65.27 IBM 3090-300E 73.28 DECstation 5000/200 87.18 Unix DEC Ultrix RISC C 0.1889 0.01573 Unix SGI C 0.184 0.1532 VM Pascal VS (0.464) 0.0136 Unix DEC Ultrix RISC Pascal (0.39) 0.0114 The Toshiba T1100+ should be exactly as fast as an 8 MHz PC clone. For a couple of the machines I am not sure that this benchmark is representative of timings on non-numerical programs in PHYLIP. This is particularly the case for the DEC 3000/400 AXP (the DEC "Alpha") which is probably quite a bit faster than indicated here. The numerical programs benchmark below gives it a fairer test. The IBM RS/6000 is probably up to ten times faster than shown here: it may have been ill-served by its Pascal compiler. Note that parallel machines like the Sequent are not really as slow as indicated by the data here, as these runs did nothing to take advantage of their parallelism. For a picture of speeds for a more numerically intensive program, here are benchmarks using DNAML, with the 16 MHz 386SX with math co-processor active as the standard. Numbers are total run times (total user time in the case of Unix) over all three data sets. Machine Speed -------386SX 16 Mhz 1.0 386SX 16 Mhz 1.0 Compudyne 486DX/33 4.096 SUN Sparcstation 1+ 4.669 Everex STEP 386/20 5.432 Operating System Compiler Seconds Time --------- -------- ------- ---- PCDOS Turbo Pascal 6 (7826) 1.0 PCDOS Quick C 6549.79 1.0 Linux Gnu C 1599.9 0.2441 SunOS Sun C 1402.8 0.2142 PCDOS Turbo Pascal 5.5 (1440.8) 0.1841 --- 486DX/33 5.916 Compudyne 486DX/33 6.263 Sun SPARCstation IPX 6.821 NeXTstation(68040/25) 7.146 PCDOS Turbo C++ 1107.2 0.1690 PCDOS Waterloo C/386 1045.78 0.1597 SunOS Gnu C 960.2 0.1466 Mach Gnu C 916.6 0.1399 486DX/33 7.607 Sun SPARCstation IPX 8.315 486DX/33 10.063 VAX 6000-530 10.282 DECstation 5000/200 15.473 IBM 3090-300E 32.46 Convex C240/1024 64.47 DEC 3000/400 AXP 66.64 PCDOS Waterloo C/386 861.0 0.1314 SunOS Sun C 787.7 0.1203 PCDOS Gnu C 650.9 0.0994 VMS DEC C 637.0 0.0973 Unix DEC Ultrix RISC C 423.3 0.0646 AIX Metaware High C 201.8 0.0308 Unix C 101.6 0.01551 Unix DEC C 98.29 0.01501 You are invited to send me figures for your machine for inclusion in future tables. Use the data sets above and compute the total times for DNAPARS and for DNAML for the three data sets (setting the frequencies of the four bases to 0.25 each for the DNAML runs). Be sure to tell me the name and version of your compiler, and the version of PHYLIP you tested. Published Benchmarks --------- ---------Some of you may have seen the "benchmark" published by Luckow and Pimentel (1985). PHYLIP's WAGNER (an immediate ancestor of MIX) did not do well in it, either in terms of the quality of result or execution speed. I do not believe that this was a fair benchmark. WAGNER was run only with one order of input species, not ten as recommended here. Had it been, perhaps the shortest tree would have been found more often. No credit was given to PHYLIP in that article for its free distribution, availability on microcomputers, availability in source code form, or portability to new computers. Pimentel's laboratory commissioned the development of a competing package, PHYSYS, which is a commercial product, and that involvement was not stated in the article. The benchmarks some by Fink (1986) are fairer, although there are impressions given by that article which do not apply to the present version. In particular, I have since added to many of the programs the ability to save multiple equally-parsimonious trees, and have changed the outputs so that reconstruction of states in the hypothetical ancestral nodes is much easier, thus answering Fink's major criticisms. I have since eliminated the Metropolis annealing method algorithms which he criticized. I disagree with Fink's view OF PHYLIP that one should "be wary of published results from an analysis using it", as I do not think that a tree slightly longer than the most parsimonious one should be rejected out of hand. Nor do I agree that "it is really too slow to use as a teaching tool", as in teaching one uses small data sets and speed is not of the essence. Rather, simplicity of user interface is paramount, and there PHYLIP does very well (so is ability to run on a variety of computers, in which respect PHYLIP is also superior). In fact, it is widely used as a teaching tool. Nevertheless MIX is undoubtably not as fast or as sophisticated as PAUP or Hennig86. The present version of PHYLIP is closer to its competitors in quality of result than was the version Fink reviewed. Platnick's (1987) benchmarks concentrated, as did the other benchmarkers (all of them members of the same school of systematists) on parsimony as the only phylogeny criterion worthy of attention. He concluded that PHYLIP could be used effectively, especially if up to ten different input orders of species were used. Again, as with the other benchmarks, no credit was given for diversity of methods, portability, price, or availability of source code. Platnick's second benchmark paper (1989) concentrates on Hennig86 and Paup, and concludes that PHYLIP has not kept up with those programs in its features. Again, the review is entirely concerned with parsimony, and only the barest mention is made of ... (you can complete this sentence). Sanderson's (1990) benchmark paper breaks with the method of the others by specifying 36 features of the packages rated and giving separate ratings in each. Like the other benchmark papers it concentrates almost exclusively on parsimony as applied to morphological characters, but does at least give some credit where credit is due. My own, obviously biased, feeling is that there is a discrepancy between the benchmarkers' projections of how satisfied users of PHYLIP will be, and how satisfied they actually are. And that this discrepancy is in PHYLIP's favor. ENDORSEMENTS Here are some comments square brackets is my own: about PHYLIP. Explanatory material in From the pages of Cladistics: "Under no circumstances can we recommend PHYLIP/WAG [their name for the Wagner parsimony option of MIX]." Luckow, M. and R. A. Pimentel (1985) "PHYLIP has not proven very effective in implementing parsimony (Luckow and Pimentel, 1985)." J. Carpenter (1987a) "... PHYLIP. This is the computer program where every newsletter concerning it is mostly bug-catching, some of which have been put there by previous corrections. As Platnick (1987) documents, through dint of much labor useful results may be attained with this program, but I would suggest an easier way: FORMAT b:" J. Carpenter (1987b) "PHYLIP is bug-infested and both less magnitude slower than other programs ...." effective and orders of "T. N. Nayenizgani" [J. S. Farris] (1990) "Hennig86 [by J. S. Farris] provides such substantial improvements over previously available programs (for both mainframes and microcomputers) that it should now become the tool of choice for practising systematists." N. Platnick (1989) and in the pages of other journals: "The availability, within PHYLIP of distance, compatibility, maximum likelihood, and generalized 'invariants' algorithms (Cavender and Felsenstein, 1987) sets it apart from other packages .... One of the strengths of PHYLIP is its documentation ...." Michael J. Sanderson (1990) (Sanderson also criticizes PHYLIP for slowness and inflexibility of its parsimony algorithms, and compliments other packages on their strengths). "This package of programs has gradually become a basic necessity to anyone working seriously on various aspects of phylogenetic inference .... The package includes more programs than any other known phylogeny package. But it is not just a collection of cladistic and related programs. The package has great value added to the whole, and for this it is unique and of extreme importance .... methods provided ...." its various strengths are in the great array of Bernard R. Baum (1989) (see also above under Benchmarks for W. Fink's critical (1986) on version 2.8 of PHYLIP). remarks GENERAL COMMENTS ON ADAPTING THE PACKAGE TO DIFFERENT COMPUTER SYSTEMS In the sections following you will find instructions on how to adapt the programs to different computers and compilers. The programs should compile without alteration on most versions of C. They use the "malloc" library or "calloc" function to allocate memory so that the upper limits on how many species or how many sites or characters they can run is set by the system memory available to that memory-allocation function. In the document file for each program, I have supplied a small input example, and the output it produces, to help you check whether the programs are running properly. Most of the programs read their data from a file called "infile" and write their output to a file called "outfile" and a tree file to a file "treefile". If "infile" does not exist the program will prompt you for its name. Compiling the programs --------- --- -------Many machines that have C compilers, particularly Unix systems, have a utility called "make" available that considerably simplifies the process of compiling these programs. I will first discuss how to compile these programs with "make" and then, after a digression on how to move PHYLIP to a microcomputer, discuss for different individual systems how to compile the programs. As we shall see below, for some DOS and Macintosh compilers one cannot simply use "make" and the standard Makefile. Using "make" ----- -----If your machine has "make" you can place all the programs for the package, together with the file "Makefile" and the header files "phylip.h", and "drawgraphics.h", in one directory. The Makefile and header files are constructed to detect, for many varieties of C, which it is dealing with, and inform the programs accordingly so that they can (by using "#ifdef") adapt to the idiosyncracies of the compiler. To compile all the programs just type: make all To compile just one program, such as DNAML, type: make dnaml After a time the compiler will finish compiling. The names of the executables will be the same as the names of the C programs, but without the ".c" suffix. Thus dnaml.c compiles to make an executable called "dnaml". If object modules ending in ".o" are found in the directory after compilation they can be removed if you need space. Getting PHYLIP onto your microcomputer ------- ------ ---- ---- ------------C is widely available on microcomputers, and in any case we also distribute executable versions for PCDOS, 386 PCDOS, and Macintosh systems. Your institution may have an Internet connection, and if so there is probably a PCDOS system or a Macintosh somewhere connected directly to it. Using that machine you could download the executables and put them directly into diskette for transfer to your own machine. You can also get the source code, documentation, and executables by sending me the appropriate number of diskettes (see the general information at the start of this document). If you cannot do this, you may be able to transfer the entire package, in the form of self-extracting archives (which is one of the ways we distribute it for microcomputers) to your system using a terminal program with file transfer capabilities. Some users are sufficiently terrified of this prospect that they prefer to mail us diskettes and wait for several weeks. But if your institution has an Internet connection it is much faster to do it that way. If you have a serial port to which a modem can be hooked, you can get a terminal program and do the transfers yourself. For most microcomputer systems, public-domain or shareware terminal programs are available, such as the widely-distributed KERMIT and MODEM families of programs. Most university computer centers have communications programs (KERMIT or XMODEM) to "talk" to KERMIT, MODEM, or PC-TALK and transfer files to and from it. Thus, if you cannot get from me a disk format readable machine, you can: by your (1) Get an account on your mainframe and learn to use its facilities for "anonymous ftp" (transfer of files over Internet) or electronic mail. (2a) If you are on Internet (Or NSFNET) use the "anonymous ftp" method to receive the self-extracting archive files (start by downloading and reading the file "pub/phylip/Read.Me" from my system whose Internet address is evolution.genetics.washington.edu (128.95.12.41)), or (2b) if your institution is not on Internet but does have Bitnet electronic mail, you can request that I send you the PHYLIP source code files and documentation as e-mail messages over BITNET/EARN (not the executables, however). (3) Make sure the files are saved on your mainframe account (you will need about 2.2 Megabytes of space) under appropriate names. (4) Use the file transfer provisions of your terminal program to transfer the archives to your microcomputer, or if they came as many e-mail messages, to transfer these to your machine individually (most file transfer programs can transfer many files with one command) for later compilation of the C source. If you cannot read the diskette formats that I can write, and if you absolutely INSIST that I distribute the package in this format, please send me the computer and thirteen diskettes. I will promptly write the diskettes and return them (but of course I will keep your computer). Now we turn to particular C compilers problems that may be encountered. and describe particular Microsoft Quick C and Microsoft C --------- ----- - --- --------- These comments apply to Microsoft Quick C but may also work with Microsoft C. A Makefile for Microsoft Quick C is included with the source code. It is called "Makefile.qc". If you copy it and call the copy "Makefile" (making sure to first save the generic Makefile that comes with this package under some name such as Makefile.old), you should be able to use "make" as described above, except that it is called "nmake". Note that the command you must use to compile (for example) DNAPARS is "nmake dnapars.exe", not "nmake dnapars", as the program that results is to be called "dnapars.exe" and the Quick C Makefile is set up that way. To compile individual programs without using the makefile, you need to do the following. For a non-graphics program use the following command (DOS> is the PCDOS prompt, so you do not type it): DOS> qcl /AH /F 4000 /FPi [source files] If the program you are trying to compile is a 1-part source (for example, neighbor only has one part, neighbor.c) you should replace "[source files]" with "neighbor.c". So the command would be: DOS> qcl /AH /F 4000 /FPi neighbor.c If the program you are trying to compile is a 2-part source (for example, mix has two parts, mix.c and mix2.c) you can replace [source files] with both of the source files. Make sure that the first source file in the list has the same name as the executable file you want. i.e. use mix.c mix2.c and not the other way around. If you reorder them, the executable file will be called "MIX2.EXE". For mix, the command would be: DOS> qcl /AH /F 4000 /FPi mix.c mix2.c to compile a graphics program (i.e. drawgram, drawtree) under quick without c using the makefile, use one of the following commands: for DRAWGRAM: DOS> qcl /AH /F 4000 /FPi drawgram.c drawgraphics.c graphics.lib [for drawgram] for DRAWTREE: DOS> qcl /AH /F 4000 /FPi drawtree.c drawgraphics.c graphics.lib [for drawtree] Turbo C++ for PCDOS ----- --- --- ----The following instructions are for Turbo C++ but may also work for Turbo C and for Borland C, perhaps with slight modifications. Under normal situations you can use the makefile. The makefile for Turbo C++ is included in the package as "Makefile.tc". Copy it and call the copy "Makefile" (it would be wise the first rename the original "Makefile" to "Makefile.old"). Then to compile, say, DNAPARS, just type: make dnapars.exe However, if for some reason you want to do it by following steps: hand, follow the For the non-graphical programs (all those other than DRAWGRAM and DRAWTREE): to compile dnapars.c type the following (DOS> is the PCDOS prompt) DOS> tcc -mh dnapars.c If the source file is sufficiently large to require two sources (for example, dnaml.c and dnaml2.c), you will need to use both dnaml.c and dnaml2.c. Examples: DOS> tcc -mh dnaml.c dnaml2.c DOS> tcc -mh neighbor.c If you would like to use the program under the TD debugger, you should add a "-v" flag as a compiler option: DOS> tcc -mh -v restml.c restml2.c For the graphical programs (DRAWGRAM and DRAWTREE): First you need to build the "BGI" drivers. The BGI drivers are included with your TURBOC compiler, and should be in the "BGI" directory (this is a subdirectory of the main turboc directory). To do this you need to use the "bgiobj" program, also in the BGI directory. The current version of PHYLIP supports the EGA/VGA, CGA, and hercules drivers. If you have modified the sources to take advantage of other drivers, you will have to include those as well. To build the BGI drivers: DOS> is] DOS> DOS> DOS> cd \tc\bgi [this should be replaced with whatever your turboc dir BGIOBJ EGAVGA BGIOBJ CGA BGIOBJ HERC this generates the files "EGAVGA.OBJ", "CGA.OBJ", and "HERC.OBJ" in the current directory. you want to copy this into your main source directory. (assume this is \phylip) DOS> CP EGAVGA.OBJ \phylip [replace this with your source directory] DOS> CP CGA.OBJ \phylip DOS> CP HERC.OBJ \phylip To compile the program, cd back to your source directory. You want to compile each source file, plus a shared graphics file called "drawgraphics.c". You also want to link it to the newly created BGI object files and to the graphics library. Examples: DOS> tcc -mh drawgram.c drawgraphics.c herc.obj egavga.obj cga.obj graphics.lib DOS> tcc -mh drawtree.c drawgraphics.c herc.obj egavga.obj cga.obj graphics.lib (to compile drawgram and drawtree, respectively) If you want to compile for the TD debugger, add the -v flag as above. Waterloo C/386 -------- ----Waterloo C/386 is the compiler we use to create the 386 PCDOS and 386 Windows versions of the executables. It has a "make" capability called "wmake". We have had problems using this so the instructions here are for individually compiling programs without wmake. Watcom C/386 is a very flexible compiler which can generate executable programs for many different environments. Following are instructions for using Watcom C/386 to compile for DOS using the DOS/4GW DOS extender (included with the Watcom distribution) and for Microsoft windows. DOS/4GW: to compile a program under watcom C/386 for the DOS/4GW dos extender use the following (the "DOS>" is the PCDOS prompt, not something you type): DOS> wcl386 /l=dos4gw /p /k65520 [source files] If the program you are trying to compile is a 1-part source (for example, neighbor only has one part, neighbor.c) you can replace [source files] with "neighbor.c". So the command would be: DOS> wcl386 /l=dos4gw /p /k65520 neighbor.c If the program you are trying to compile is a 2-part source (for example, mix has two parts, mix.c and mix2.c) you can replace [source files] with both of the source files. Make sure that the first source file in the list has the same name as the executable file you want. i.e. use mix.c mix2.c and not the other way around. If you reorder them, the executable file will be called "MIX2.EXE". For mix, the command would be: DOS> wcl386 /l=dos4gw /p /k65520 mix.c mix2.c The resultant executable file will take advantage of your system's extended memory and will not be limited to using only the first 640K. However, it needs the file "dos4gw.exe" in order to run. If you want to be able to use the program generated, make sure that this program is somewhere in your path. (To ensure this you can copy the program into the directory where the compiled program resides). This "dos extender" is bundled with the Watcom C/386 compiler and is freely redistributable. For Windows: to compile a program under watcom C/386 for windows use the following: DOS> wcl386 /l=win386 /zw /p /k65520 [source files] again, replace [source files] with either the complete program (ie neighbor.c) or both parts of the program (ie mix.c mix2.c). once you have compiled the windows program you are not quite ready to run the program under windows. The final step is to link it with the "windows supervisor". to do this do the following: DOS> wbind [program] -n i.e.: DOS> wbind mix -n this program will generate will be runnable under windows. CAVEATS: [programname].exe. this application 1. Make sure that when you use wbind that \watcom\binw is somewhere in your path. if it is not, you may have to tell wbind explicitly where the windows supervisor file is, as in the following example: DOS> wbind mix -n -s c:\watcom\binw\win386.ext which replace the c:\watcom\win386.ext with the full path of win386.ext. will 2. The draw programs (drawgram, drawtree) currently do not compile under windows. Compile them for DOS/4GW and use it in a dos shell under windows. Think C for Macintosh ----- - --- --------For Symantec's Think C compiler (formerly called Lightspeed C) a "make" utility is not available. Thus you cannot use the Makefile but must compile the programs individually. Here are the steps you should follow to compile a typical program. (1) Start up Think-C. (2) Click on "New project" in the Think C project menu. asked to enter the name of the project. You will be (3) Add the source code for the program to the project. To add sources to the project, you need to click on "add" from the source menu. You will need to add the sources from the main program (i.e. "neighbor.c" in the case of a program in 1 part or "dnaml.c" and "dnaml2.c" in the case of a 2-part program). You also need to add "interface.c" (included with the distribution) and two things which are included with the think C compiler. The first one is "MacTraps", and is contained within the Think C folder under a directory called "MacLibraries". The second one is "ANSI", and is contained within the Think C folder under a directory called "C Libraries" (4) Segment the project: After adding each of the sources to the project, you need to segment the project. This means that every source file is contained within its own 32K segment. In order to do this within Think C, you can click on a source file name in the Think C project window (the window that lists each of the sources) and drag it down to the bottom of the source list. After you have done this for each of the source files, a dotted line should appear around each source file in the project window. (5) Set up compile options: The first thing you need to do is set up what sort of project you're compiling, and some of the characteristics of how the memory is set up. To do this, select "Set project type" in the "Project" menu, and make sure it's set up to be an Application with far code and far data. Depending on the hardware you will be running on, you may want to select different compilation options. Most notably, if your machine has a 68881 math coprocessor, enable the use of the coprocessor by selecting "Options" under the "Edit" window, selecting "Compiler settings" through the list at the upper left corner of the display, and then checking the box next to "Generate 68881 instructions". (6) Compile the project: select "Make" under the source window. After this has completed (assuming that there were no compile errors), you need to generate a mac application. To do this, select "Build Application" under the project menu. Select a name for the application, and think C will create a Macintosh application. Although this is more tedious than using a Makefile, Think C works very well with the PHYLIP programs and is the compiler we use for creating the Macintosh executables. Unix ---I have already mentioned that under Unix you can use the "make" command to compile programs. This works on all Unix systems. To compile an individual program like dnapars.c you can give the command "make dnapars" or alternatively "cc dnapars.c -lm". When compiling programs that come in two parts, such as dnaml.c and dnaml2.c, you will have to issue three commands, two compile commands and one link command: cc -C dnaml.c cc -C dnaml2.c cc dnaml.o dnaml2.o -lm -o dnaml where the first two commands produced the object modules dnaml.o and dnaml2.o and the third command links them together into an executable that is called dnaml. In running the programs, you may sometimes want to put them in background so you can proceed with other work. On systems with a windowing environment they can be put in their own window, and commands like "nice" used to make them have lower priority so that they do not interfere with interactive applications in other windows. If there is no windowing environment, you will want to use an ampersand ("&") after the command file name when invoking it to put the job in the background. You will have to put all the responses to the interactive menu of the program into a file and tell the background job to take its input from that file. For example: suppose you want to run DNAPARS in a background, taking its input data from a file called sequences.dat, putting its interactive output to file called "screenout", and using a file called "input" as the place to store the interactive input. The file "input" need only contain two lines: sequences.dat Y which is what you would have typed to run the program interactively, in response to the program's request for an input file name if it did not find a file named "infile", in in response the the menu. To run the program in background, you would simply give the command: dnapars < input > screenout & which runs the program with input responses coming from "input" and interactive output being put into file "screenout". The usual output file and tree file will also be created by this run (keep that in mind as if you run any other PHYLIP program from the same directory while this one is running in background you may overwrite the output file from one program with that from the other!). If you wanted to give the program lower priority, so that it would not interfere with other work, and you have Berkeley Unix type job control facilities in your Unix, you can use the "nice" command: nice +10 dnapars < input > screenout & which lowers the priority of the run. To also time the run and put the timing at the end of "screenout", you can do this: nice +10 ( time dnapars < input ) >& screenout & which I will not attempt to explain. You may also want to explore putting the interactive output into the null file "/dev/null" so as to not be bothered with it (but then you cannot look at it to see why something went wrong. If you have problems with creating output files that are too large, you may want to explore carefully the turning off of options in the programs you run. If you are doing several runs in one, as for example when you do a bootstrap analysis using SEQBOOT, DNAPARS (say), and CONSENSE, you can use an editor to create a "batch file" with these commands: seqboot < input1 > screenout mv outfile infile dnapars < input2 >> screenout mv treefile infile consense < input3 >> screenout and then take the file (say "foofile") containing these commands and give it execute permission by using the command "chmod +x foofile" followed by the command "rehash". Then the job that foofile describes can be run as a single job in background by giving the command "foofile &". Note that you must also have the interactive input commands for SEQBOOT (including the random number seed), DNAPARS, and CONSENSE in the separate files "input1", "input2", and "input3". With Berkeley-style job control the "nice" command can be used within the batch file "foofile" before each program name to reduce the priority with which the programs run. VMS VAX systems --- --- ------- On the VMS operating system with DEC VAX VMS C the programs will compile without alteration, except that we have to add some extra routines because the "%hd" format in printf and fprintf does not work. These extra routines are in the file VAXFIX.C. The commands for compiling a typical program (DNAPARS) are: $ $ $ $ DEFINE LNK$LIBRARY SYS$LIBRARY:VAXCRTL CC DNAPARS.C CC VAXFIX.C LINK DNAPARS,VAXFIX Once you use this "$ DEFINE" statement during a given interactive session, you need not repeat it again as the symbol "LNK$LIBRARY" is thereafter properly defined. The compilation process leaves a file DNAPARS.OBJ in your directory: this can be discarded. The executable program is named DNAPARS.EXE. To run the program one then uses the command: $ R DNAPARS The compiler defaults to the filenames "INFILE.", "OUTFILE.", and "TREEFILE.". If the input file "INFILE." does not exist the program will prompt you to type in its name. Note that some commands on VMS such as "TYPE OUTFILE" will fail because the name of the file that it will attempt to type out will be not "OUTFILE." but "OUTFILE.LIS". To get it to type the write file you would have to instead issue the command "TYPE OUTFILE.". Some of the programs come in several pieces that have to be compiled and linked together. For example, DNAML comes in two pieces, dnaml.c and dnaml2.c. To compile them and link the resulting object files together into one executable, use the commands: $ $ $ $ $ DEFINE LNK$LIBRARY SYS$LIBRARY:VAXCRTL CC DNAML.C CC DNAML2.C CC VAXFIX.C LINK DNAML,DNAML2,VAXFIX This will make an executable called DNAML.EXE plus two ".OBJ" files that can be discarded. Note that when a LINK command is issued the name of the first file (in this case DNAML) becomes the name of the ".EXE" file that is produced by the linker. To make it easier to compile all of the programs on VMS systems, have supplied a command file, "compile.com" that will do this. If you install that file and issue the command "@compile" it will compile all of the programs. However it is recommended that you also know how to recompile individual programs so that they can be altered to your purposes. The programs DRAWGRAM and DRAWTREE both use drawgraphics.c. To compile (for example) DRAWGRAM, use: $ $ $ $ $ routines in DEFINE LNK$LIBRARY SYS$LIBRARY:VAXCRTL CC DRAWGRAPHICS.C CC DRAWGRAM.C CC VAXFIX.C LINK DRAWGRAM,DRAWGRAPHICS,VAXFIX which will create a file called DRAWGRAM.EXE, plus two ".OBJ" files. When you run DRAWGRAM you must have a font file present in your directory, as well as the tree file. If they are not found under their default names the program we will prompt you for these. When you are using the interactive previewing feature of DRAWGRAM (or DRAWTREE) on a Tektronix or DEC ReGIS compatible terminal, you will want before running the program to have issued the command: $ SET TERM/NOWRAP/ESCAPE so that you do not run into trouble from the VMS of 255 characters or the filtering of escape characters. line length limit Some later versions of Digital's VAX VMS operating system a C compiler that no longer needs the VAXFIX patch. If so, follow the instructions below for OpenVMS and all will be well. have OpenVMS DEC Alpha systems ------- --- ----- ------The OpenVMS operating system on Digital AlphaStations and other Digital Alpha AXP computers has many of the properties of the VAX VMS systems mentioned above except on important one. It does not need any of the VAXFIX.C corrections. Thus the programs should be compiled without this. Renove all mention of VAXFIX from COMPILE.COM (the lines compiling it and the linking of it). Also take PHYLIP.H and comment out the section in which "vax_printf_is_broken" proceed normally. is defined. Then the compilation should Cray ---A number of people (F. James Rohlf, Kent Fiala, Shan Duncan, and Ron DeBry), succeeded in various ways in adapting the Pascal version of PHYLIP to several models of Crays. Recently Cray has been adopting Unicos, a Unix clone, as the operating system for its machines, and this means the Unix instructions should work for compiling the programs on Crays. However, although the underlying algorithms of most programs, which treat sites independently, should be amenable to vector processors, there are details of the code which might best be changed. In particular within the innermost loops of the programs there are often scalar quantities that are used for temporary bookkeeping. These quantities, such as sum1, sum2, zz, z1, yy, y1, aa, bb, cc, sum, and denom in procedure makenewv of DNAML (and similar quantities in procedure nuview) are there to minimize the number of array references. For vectorizing compilers such as the Cray compilers it will be better to replace them by arrays so that processing can occur simultaneously. IBM Mainframes running CMS --- ---------- ------- --The following information applies not only to IBM mainframes, but to IBMcompatible mainframes such as Amdahls, Fujitsu, Hitachis, and ICLs when they run IBM operating systems or IBM-compatible operating systems. It does not apply to IBM mainframes running AIX (IBM's version of Unix) as for those one can simply use the Unix instructions above without modification. Because IBM is IBM, it tried to impose the EBCDIC character code on the world. There are good arguments for and against EBCDIC; in any case, the ASCII (or ISO) code is winning out. I have chosen to distribute PHYLIP in the ASCII character code, as more likely to be readable on more machines. Some characters in ASCII have no equivalent in EBCDIC and get arbitrarily changed when my ASCII files are read into an EBCDIC machine. You may find some characters which look strange when viewed on a 3270 terminal on a CMS system, but we have found none that cause trouble for the compiler. Andrew Keeffe was asked to investigate how to compile the version of PHYLIP on our IBM 3090 system, and here is what he has found. C These are the procedures for compiling the phylip package in C on an IBM mainframe. These instructions were developed using IBM C/370 on an IBM running VM/XA CMS 5.6 Service Level 201. 3090 If you fetch PHYLIP directly as an ftp binary transfer, getting a compressed tar archive file, as available from our machine, we do not know whether there is an "uncompress" and a "tar" utility available on CMS to extact the files from the archive and translate them from ASCII to EBCDIC. You should ask your computer consultants about that. Alternatively, you could fetch the files to a PCDOS or Unix machine, extract the archives there, and then move the resulting text files for the source code and documentation to the CMS system. If you that, after establishing the connection between the IBM and the other host, type will translate the text files properly. CMS prefers the names of files to have a minimum of two parts, called the filename (abbreviated fn) and the filetype (abbreviated ft), separated by a space. We have chosen "data" as the filetype, so that "infile" becomes "infile data", "outfile" becomes "outfile data" and so forth. All commands that you give to the host are shown in UPPER CASE. You can type them in upper or lower case; CMS does not care. Before compiling, give these commands to CMS: SETUP C370 GLOBAL TXTLIB EDCBASE IBMLIB It would make sense to put these commands until the compiling and linking is complete. in your profile exec To compile a single program, such as dnapars.c: CC DNAPARS If there are no errors, the compiler will produce a file with the same filename and a filetype of 'text', DNAPARS TEXT in this case. Now give these commands: LOAD DNAPARS GENMOD DNAPARS The genmod command generates an executable module file (DNAPARS MODULE) which may be invoked by typing its name on the command line. Use this procedure to compile all of the phylip programs except dnaml, dnamlk, restml, drawgram, and drawtree. The source files for dnaml, dnamlk, and restml have been split into two parts. To compile one of these programs, give these commands: CC DNAML CC DNAML2 LOAD DNAML DNAML2 GENMOD DNAML Proceed similarly for dnamlk and restml. The draw programs, drawgram and drawtree, both depend on common code which is stored in drawgraphics.c and drawgraphics.h. These names will be truncated to DRAWGRAP C and DRAWGRAP H on the CMS system. The contents of the files are not affected. Compile the drawgraphics code: CC DRAWGRAP Compile and link the draw programs: CC DRAWGRAM LOAD DRAWGRAM DRAWGRAP GENMOD DRAWGRAM CC DRAWTREE LOAD DRAWTREE DRAWGRAP GENMOD DRAWTREE If you are having trouble getting the programs running on your machine, contact me. If I can't help, I can at least find out whether there is anyone else who has adapted them to the same machine and put you in touch with them. Other Computer Systems ----- -------- ------As you can see from the variety of different systems on which these programs have been successfully run, there are no serious incompatibility problems with most computer systems. PHYLIP in various past Pascal versions has also been compiled on 8080 and Z80 C/M Systems, Apple II systems running UCSD Pascal, a variety of minicomputer systems such as DEC PDP-11's and HP 1000's, CDC Cyber systems, and so on. We hope gradually to accumulate experience on a wider variety of C compilers. If you succeed in compiling the C version of PHYLIP on a different machine or a different compiler,, I would like to hear the details so that I can include the instructions in a future version of this manual. FREQUENTLY ASKED QUESTIONS (1) "If I copied PHYLIP from a friend without you knowing, should I try to keep you from finding out?". No. It is to your advantage and mine for you to let me know. If you did not get PHYLIP "officially" from me or from someone authorized by me, but copied a friend's version, you are not in my database of users. You probably also have an old version which has since been substantially improved (see the beginning of this main document file for the date on which this version was released). I don't mind you "bootlegging" PHYLIP (it's free anyway, and that saves me the work of writing diskettes), but you should realize that you may have an outdated version. You may be able to get the latest version just as quickly over Internet. You can read about subsequent bug fixes in the electronic news bulletins the person you got it from may (or may not) have subscribed to. It will help both of us if you get onto my mailing list. If you are on it, then I will give your name to other nearby users when they get a new copy, and they are urged to contact you and update your copy. (I benefit by getting a better feel for how many distributions there have been, and having a better mailing list to use to give other users local people to contact). Send me your name and address (five lines maximum), and your phone number, with the number of the version that you have, plus the type of your computer, operating system, and C compiler, so that I can add you to the address list. Note also the listserver information which you can get, which provides news about PHYLIP by electronic mail. This is described in the next to last section of this document. (2) "How do I make a citation to the PHYLIP package in the paper I am writing?" One way is like this: Felsenstein, J. 1993. PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Department of Genetics, University of Washington, Seattle. or if the editor for whom you are writing insists that the citation must be to a printed publication, you could cite a notice for version 3.2 published in Cladistics: Felsenstein, J. 1989. PHYLIP -- Phylogeny Inference Package (Version 3.2). Cladistics 5: 164-166. For a while a printed version of the PHYLIP documentation was available and one could cite that. This is no longer true. Other than that, this is difficult, because I have never written a paper announcing PHYLIP! My 1985b paper in Evolution (see the References section below) on the bootstrap method contains a one-paragraph Appendix describing the availability of this package, and that can also be cited as a reference for the package, although it has been distributed since 1980 while the bootstrap paper is 1985. A paper on PHYLIP is needed mostly to give people something to cite, as word-of-mouth, references in other people's papers, and electronic newsgroup postings have spread the word about PHYLIP's existence quite effectively. (3) "How do I bootstrap? Why has DNABOOT disappeared?" DNABOOT, BOOT, and DOLBOOT, the previous parsimony-based bootstrap programs, have been removed from the package as there is now a more general way of bootstrapping. It involves running SEQBOOT to make multiple bootstrapped data sets out of your one data set, then running one of the tree-making programs with the Multiple data sets option to analyze them all, then running CONSENSE to make a majority rule consensus tree from the resulting tree file. Read the documentation of SEQBOOT to get further information. Before, only parsimony methods could be bootstrapped. With this new system almost any of the tree-making methods in the package can be bootstrapped. It is somewhat more tedious but you will find it much more rewarding. (4) "How do I specify a multi-species outgroup with your parsimony programs?" It's not a feature but is not too hard to do in many of the programs. In parsimony programs like MIX, for which the W (Weights) and A (Ancestral states) options are available, and weights can be larger than 1, all you need to do is: (a) In MIX, make up an extra character with states 0 for all the outgroups and 1 for all the ingroups. If using DNAPARS the ingroup can have (say) "G" and the outgroup "A". (b) Assign this character an enormous weight (such as Z for 35) using the W option, all other characters getting weight 1, or whatever weight they had before. (c) If it is available, Use the A (Ancestral states) option to designate that for that new character the state found in the outgroup is the ancestral state. (d) In MIX do not use the O (Outgroup) option. (e) After the tree is found, the designated ingroup should have been held together by the fake character. The tree will be rooted somewhere in the outgroup (the program may or may not have a preference for one place in the outgroup over another). Make sure that you subtract from the total number of steps on the tree all steps in the new character. In programs like DNAPARS, you cannot use this method as weights of sites cannot be greater than 1. But you do an analogous trick, by adding a largish number of extra sites to the data, with one nucleotide state ("A") for the ingroup and another ("G") for the outgroup. You will then have to use RETREE to manually reroot the tree in the desired place. (5) "How do I force certain groups to remain monophyletic in your parsimony programs?" By the same method, using multiple fake characters, any number of groups of species can be forced to be monophyletic. In MOVE, DOLMOVE, and DNAMOVE you can specify whatever outgroups you want without going to this trouble. (6) "How can I reroot one of the trees written out by PHYLIP?" Use the program RETREE. But keep in mind whether the tree inferred by the original program was already rooted, or whether you are free to reroot it. (7) "Why doesn't NEIGHBOR read my DNA sequences correctly?". Because it wants to have as input a distance matrix, not sequences. You have to use DNADIST to make the distance matrix first. (8) "What do I do about deletions and insertions in my sequences?" The molecular sequence programs will accept sequences that have gaps (the "-" character). They do various things with them, mostly not optimal. DNAPARS counts "gap" as if it were a fifth nucleotide state (in addition to A, C, G, and T). Each site counts one change when a gap arises or disappears. The disadvantage of this treatment is that a long gap will be overweighted, with one event per gapped site. So a gap of 10 nucleotides will count as being as much evidence as 10 single site nucleotide substitutions. If there are not overlapping gaps, one way to correct this is to recode the first site in the gap as "-" but make all the others be "?" so the gap only counts as one event. Other programs such as DNAML and DNADIST count gaps as equivalent to unknown nucleotides (or unknown amino acids) on the grounds that we don't know what would be there if something were there. This completely leaves out the information from the presence or absence of the gap itself, but does not bias the gapped sequence to be close to or far from other gapped or ungapped sequences. (9) "Why don't your parsimony programs print out branch lengths?" Because there are problems defining the branch lengths. If you look closely at the reconstructions of the states of the hypothetical ancestral nodes for almost any data set and almost any parsimony method you will find some ambiguous states on those nodes. There is then usually an ambiguity as to which branch the change is actually on. Other parsimony programs resolve this in one or another arbitrary fashion, sometimes with the user specifying how (for example, methods that push the changes up the tree as far as possible or down it as far as possible). I have preferred to leave it to the user to do this. Few programs available from others currently correct the branch lengths for multiple changes of state that may have overlain each other. One possible way to get branch lengths with nucleotide sequence data is to take the tree topology that you got, use RETREE to convert it to be unrooted, prepare a distance matrix from your data using DNADIST, and then use FITCH with that tree as User Tree and see what branch lengths it estimates. (10) "Why can't your programs handle unordered multistate characters?" Well, they can if they are 4-state characters whose states are A, C, G, and T (or U) because then one can use the DNA sequence parsimony programs. But in general the discrete characters parsimony programs can only handle two states, 0 and 1. This is mostly because I have not yet had time to modify them to do so - the modifications would have to be extensive. Ultimately I hope to get these done, but in the meantime the best I can do is suggest that you either use one of the excellent parsimony programs produced by others (PAUP or Hennig86, for example) or if you have four or fewer states recode your states to look like nucleotides and use the parsimony programs in the molecular sequence section of PHYLIP. (11) "Where can I get a printed version of the PHYLIP documents?" For the moment, you can only get a printed version by printing it yourself. For versions 3.1 to 3.3 a printed version was sold by Christopher Meacham and Tom Duncan, then at the University Herbarium of the University of California at Berkeley. But they have had to discontinue this as it was too much work. You should be able to print out the documentation files on almost any printer and make yourself a printed version of whichever of them you need. (12) "Why have I been dropped from your newsletter mailing list?" You haven't. The newsletter was dropped. It simply was too hard to mail it out to such a large mailing list. The last issue of the newsletter was Number 9 in May, 1987. I am hoping that the Listserver News Bulletins will replace the old PHYLIP Newsletter. If you have electronic mail access you should definitely sign up for these bulletins. For details see the section on the Listserver News Bulletins below. (13) "How many copies of PHYLIP have been distributed?" Currently (July, 1995) I have a bit over 2700 registered installations worldwide. Of course there are many more people who have got copies from friends. PHYLIP is the most widely distributed phylogeny package. PAUP is catching up in terms of official registrations, but PHYLIP is probably far ahead in terms of numbers of actual copies out there. In terms of phylogenies published, however, PAUP is ahead, but PHYLIP is gaining on it. In recent years magnetic tape distribution of PHYLIP has declined precipitously, electronic mail distribution is decreasing, and there has been a slow decrease of diskette distributions. But all this has been more than offset by a huge explosion of distributions by anonymous ftp over Internet (a rate of about 6 ftp sessions per day, at the moment). Because some people who get the package by anonymous ftp forget to register their copies, it is hard to estimate how many people have got it this way. ADDITIONAL FREQUENTLY ASKED QUESTIONS, OR: "Why didn't it occur to you to ... (1) ... write these programs in Pascal?" These programs started out in Pascal in 1980. In 1993 we have released both Pascal and C versions. All future versions will be C-only. I make fewer mistakes in Pascal and do like the language better than C, but C has overtaken Pascal and Pascal compilers are starting to be hard to find on some machines. Also C is a bit better standardized which makes the number of modifications a user has to make to adapt the programs to their system much less. (2) ... forgot about all those inferior systems and just develop PHYLIP for Unix?". This is self-answering, since the same people first said I should just develop it for Apple II's, then for CP/M Z-80's, then for IBM PCDOS, and now they're starting to tell me to just develop it for Macintoshes or for Sun workstations. If I had listened to them and done any one of these, I would have had a very hard time adapting the package to any of the other ones once these folks changed their mind! (3) ... write these programs in PROLOG (or Ada, or Modula-2, or SIMULA, or BCPL, or PL/I, or APL, or LISP)?" These are all languages I have considered. All have advantages, but they are not really spreading (C is). (4) ... include in the package a program to do the Distance Wagner method, (or successive approximations character weighting, or transformation series analysis)?" In most cases where I have not included other methods, it is because I decided that they had no substantial advantages over methods that were included (such as the programs FITCH, KITSCH, NEIGHBOR, the T option of MIX and DOLLOP, and the "?" ancestral states option of the discrete characters parsimony programs). (5) ... include in the package ordination methods and more clustering algorithms?" Because this is NOT a clustering package, it's a package for phylogeny estimation. Those are different tasks with different objectives and mostly different methods. Mary Kuhner has, however, included in NEIGHBOR an option for UPGMA clustering, which will be very similar to KITSCH in results. (6) ... include in the package a program to do nucleotide sequence alignment?" Well, yes, I should have, and this is scheduled to be in future releases. But multiple sequence alignment programs, in the era after Sankoff, Morel, and Cedergren's 1973 classic paper, need to use substantial computer horsepower to estimate the alignment and the tree together. So I will be slow getting this into the package and in the meantime you may want to investigate ClustalV or TreeAlign. (7) ... send me the programs over the electronic mail network I use, BUTTERFLYNET?" Well, I am trying to. Maybe there is a BUTTERFLYNET gateway hanging off FISHNET, which hangs off HAIRNET, which ... I am connected to Internet, which connects to Bitnet. I can mail to Bitnet (EARN, NetNorth) and to UUCP networks. Keep in mind that the resulting files take up about 2.2 Megabytes and that if you are not going to use them on the machine I send them to, you will have to download the files to your other machine. Also in some cases networks and gateways lose or truncate files (these can be up to about 60K long). So sometimes diskette or tape are a better medium. I hope to continually expand and solidify network distribution. For a couple of years, PHYLIP has been available over Internet by "anonymous ftp" from my machine, evolution.genetics.washington.edu (128.95.12.41). You can start by fetching file "Read.Me" from directory pub/phylip. My electronic mail addresses are given at the end of this document. Contact me by electronic mail if you are interested in getting PHYLIP over your network but cannot get ftp to work. (8) ... let me log in to your computer in Seattle and copy the files out over a phone line?" No thanks. It would cost you for over two hours of longdistance telephone time, plus a half hour of my time and yours in which I had to explain to you how to log in and do the copying. (9) ... send me a listing of your program?" Damn it, it's not "a program", it's 30 programs, in a total of 87 files. What were you thinking of doing, having 1800-line programs typed in by slaves at your end? If you were going to go to all that trouble why not try network transfer or diskettes? If you have these then you can print out all the listings you want to and add them to the huge stack of printed output in the corner of your office. (This and the following two questions, once common, are finally disappearing, I am pleased to report). (10) ... write a magnetic tape in our computer center's favorite format (inverted Lithuanian EBCDIC at 998 bpi)?" Because the ANSI standard format is the most widely used one, and even though your computer center may pretend it can't read a tape written this way, if you sniff around you will find a utility to read it. It's just a LOT easier for me to let you do that work. If I tried to put the tape into your format, I would probably get it wrong anyway. (11) ... give us a version of these in FORTRAN?" Because the programs are FAR easier to write and debug in C or Pascal, and cannot easily be rewritten into FORTRAN (they make extensive use of recursive calls and of records and pointers). In any case, C is widely available. If you don't have a C compiler or don't know how to use it, you are going to have to learn a language like C or Pascal sooner or later, and the sooner the better. NEW FEATURES IN RECENT VERSIONS Version 3.5 has many new features. They include: 1. The programs now exist in C as well as in Pascal. In the future we will support only the C versions, and as of now will not make any more improvements in the Pascal version. It will cease to be distributed with the next release of PHYLIP. A Makefile has been included in the distribution to simplify the problems of compiling the package. The existence of a C compiler on most workstations means that we have ceased to directly distribute executables for workstations, as people can easily create them themselves by following our instructions. 2. All programs now have had the upper limits on the numbers of species and numbers of sites (or characters) removed. They instead use the "malloc" and "free" functions of C to try to allocate as much memory as they need. If they fail to find it they will complain, and you will have to look for a bigger machine, or install more memory, or remove other jobs that are competing for the memory. We no longer have to guess how large a computer you have and where you want to put the tradeoff between species and sites. 3. The program SEQBOOT has now fully superseded the former programs DNABOOT, BOOT, and DOLBOOT, which have been withdrawn. SEQBOOT also now can carry out Archie-Faith permutation of characters across species. 4. The DNA likelihood programs DNAML and DNAMLK now have a revised Categories option that allows them to cope with rate variation from site to site. Instead of the user specifying in advance the rate category of each site, they need only specify how many categories there are, what their rates are, what their relative probabilities are, and how long are the patches of spread of a single rate along the molecule, on average. The program then computes the likelihood allowing for all of these, and adding up over all possibilities of rate patterns, without being dependent on assuming that it has inferred rates at individual sites correctly. This should go far to address the criticism that maximum likelihood assumes constancy of rate at all sites. 5. A new program PROTDIST has been added to compute distance matrices from protein sequences, using several different methods. This will allow protein sequence data to be analyzed by distance matrix methods as well as parsimony methods. 6. A new program, RETREE, has been added to allow users easily and interactively to reroot trees, flip branches around, change or remove branch lengths, change species names, and so on. 7. Programs that estimate a tree with branch lengths now all not only can read in a user tree that has branch lengths and the program can be told to use these rather than re-estimating the branch lengths (this was already possible for DNAML and DNAMLK) but the ones that are estimating an unrooted tree (DNAML, FITCH, RESTML and CONTML) can also read in a tree with branch lengths on some branches and not on others, and be told to hold the ones it read in constant while iterating the rest. Thus you can, for example, specify that a certain branch must have length zero. 8. DRAWTREE and DRAWGRAM can now write out a PICT file that can be read by the MacDraw drawing program. They can also write out the file format for the Xwindows drawing program XFIG, and the input format for the freelydistributed ray tracing program RAYSHADE (for trees seen in 3 dimensions floating above a landscape). In addition they allow fonts to be specified for species names when a Postscript printer is being used, and they can also make an Xwindows X-bitmap file. DRAWTREE has a new option that allows the program to (slowly) calculate node positions so as to make them avoid each other better. Both programs now, when plotting on raster devices such as dot-matrix printers, use round pens to make the lines smoother, and are faster at drawing the lines. 9. DNADIST now computes its distances much more quickly. It also can compute the Nei and Jin (1991) distance that allows for rate variation among sites. 10. The programs that estimate trees by adding species sequentially to a tree (PROTPARS, DNAPARS, DNACOMP, DNAML, DNAMLK, RESTML, FITCH, KITSCH, MIX, and DOLLOP) now allow the user the specify that multiple tries will be made with different input orders of species (using the Jumble option) with only the trees tied for best overall being reported. The trees found will be those that are tied for best among all of those found by all these runs, not the trees found as best by each run. This improves the chances of finding the best tree. 11. A program COALLIKE was added to compute likelihood functions for 4Nu, the product of 4 times the effective population size times the mutation rate, for samples of genes from a single isolated population, where the program read trees that had been sampled from the data by bootstrapping followed by maximum likelihood. This method was described by me in a paper in late 1992 in Genetical Research. Subsequent work by Richard Hudson and our lab has shown the method to be biased. It has been withdrawn from the package in version 3.57. It is replaced by a program "coalesce" in a new package, LAMARC, which is available from our ftp server. Version 3.4 also had many new features. They included: 1. All programs were given interactive menus which allow the user to see and alter option settings. The programs read from a file INFILE and write to a file OUTFILE, as well as to a treefile TREEFILE. The result should be much easier for novice users to deal with. Most of the options which once were set by altering the input file can now be selected using the menu. Only options that require separate information for each character or site, such as Weights, Ancestors, Factors, and the Categories option continued to require that information be entered into the input file (although user-defined trees are put there also). 2. The molecular sequence programs now allowed either interleaved or sequential sequence input (i.e. sequences put in in "aligned" form or by having all of one sequence followed by all of another). The choice is made using the interactive menu. 3. Three new programs were added: NEIGHBOR carried out Saitou and Nei's neighbor-joining method for distance matrix data which is much faster than FITCH and KITSCH and should be able to handle much larger data sets. It also carried out the UPGMA clustering method. SEQBOOT allowed the user to bootstrap nucleotide sequence data sets, protein sequence data sets, or discretecharacters data sets and write out to a file the multiple data sets that result. CONTRAST accepted a continuous-characters data set and a series of user trees, and wrote out the series of contrasts for each character that are independent under a Brownian motion model of character evolution, as well as regressions, correlations, and covariances between them. 4. All of the programs that inferred trees now accepted multiple data sets. This allowed us to use SEQBOOT together with this feature to analyze bootstrapped data sets and find different trees for the different bootstrap replicates. Their variation could be summarized by the consensus tree program CONSENSE. Thus almost everything in this package could now be bootstrapped. 5. A serious error that made the DNA likelihood programs and DNADIST give incorrect results when the Categories option was used and there was more than one category of rates was fixed, in version 3.31. Categories run with these programs before that should be rerun. 6. Almost all programs now printed out trees in the "phenogram" form that they grew left-to-right, rather that in the triangular diagram used before. so 7. The tree-plotting programs DRAWGRAM and DRAWTREE now supported the HewlettPackard Laserjet printers and also could produce output files compatible with the PC-Paint drawing program. The code for placement of interior nodes in DRAWGRAM was corrected, and preview of trees using Tektronix graphics was made easier by having it clear the screen more often. 8. The DNA likelihood program DNAML now ran about 60% faster. 9. The restriction sites likelihood program RESTML now data arising from digests with multiple enzymes. allowed for the COMING ATTRACTIONS, FUTURE PLANS There are some obvious deficiencies in this version. holes will be filled in the next few releases (3.6, 3.7, etc.). Some of these They include: 1. A program to align molecular sequences on a predefined User Tree may ultimately be included. This will allow alignment and phylogeny reconstruction to procede iteratively by successive runs of two programs, one aligning on a tree and the other finding a better tree based on that alignment. In the shorter run a simple two-sequence alignment program may be included. 2. An interactive "likelihood explorer" for DNA sequences written. This will allow, either with or without the assumption of clock, trees to be varied interactively so that the user can get a feel for the shape of the likelihood surface. Likelihood will be plotted against branch lengths for any branch. will be a molecular much better able to be 3. The DNAML and DNAMLK programs will reinstate the previous Categories option, where the user specified categories of rates of evolution for each site, but also retaining the present one, that infers them. The hope is to allow for variation in rate in 1st, 2nd and 3rd positions in a coding sequence (these being identified by the user) while also allowing for autocorrelated rates of evolution in adjacent codons. 4. If possible we will find some way of correcting for purine/pyrimidine richness variations among species, within the framework of the maximum likelihood programs. That they maximum likelihood programs do not allow for base composition variation is their major limitation at the moment. 5. Inclusion of some kind of protein sequence maximum likelihood program is an obvious need (right now we have Adachi and Hasegawa's program in the Unsupported Division). 6. The Categories option of DNAML and DNAMLK will be generalized to allow for rates at sites to gradually change as one moves along the tree, in an attempt to implement Fitch and Markowitz's (1970) notion of "covarions". 7. Obviously we need to start thinking about a more visual X windows interface, but only if that can be used on most systems. 8. Program PENNY and its relatives will improved so as to run faster and find all most parsimonious trees more quickly. 9. A more sophisticated compatibility program should be I can find one. included, if 10. An "evolutionary clock" version of CONTML will be done, and the same may also be done for RESTML. 12 . We hope gradually to generalize the tree structures programs to infer multifurcating trees as well as bifurcating ones. in the 13. We hope to economize on the size of the source code, and enforce some standardization of it, by putting frequently used routines in a library from which they can be linked into various programs. This will enforce a rather complete standardization of our code. 14. We may decide to gradually move our code to an object-oriented language, most lkely C++. One could describe the language that version 3.4 was written in as "Pascal", version 3.5 as "Pascal written in C", version 4.0 as "C written in C", and maybe version 4.1 as "C++ written in C" and then 4.2 as "C++ written in C++". At least that scenario is one possibility. Much of the future development of the package will be in the DNA likelihood programs and the distance matrix programs. This is for several reasons. First, I am more interested in those problems. Second, collection of molecular data is increasing rapidly, and those programs have the most promise for future development for those data. REFERENCES FOR THE DOCUMENTATION FILES In the documentation files that follow I frequently refer to papers in the literature. In order to centralize the references they are given in this section. If you want to find further papers beyond these, my Quarterly Review of Biology review of 1982 and my Annual Review of Genetics review of 1988 list many further references. The chapter by David Swofford and Gary Olsen (1990) is also an excellent review of the issues in phylogeny reconstruction. Adams, E. N. 1972. Consensus techniques and the comparison of taxonomic trees. Systematic Zoology 21: 390-397. Adams, E. N. 1986. N-trees as nestings: complexity, similarity, and consensus. Journal of Classification 3: 299-317. Archie, J. W. 1989. A randomization test for phylogenetic information in systematic data. Systematic Zoology 38: 219-252. Astolfi, P., K. K. Kidd, and L. L. Cavalli-Sforza. 1981. A comparison of methods of reconstructing evolutionary trees. Systematic Zoology 30: 156-169. Baum, B. R. 1989. PHYLIP: Phylogeny Inference Package. Version 3.2. (Software review). Quarterly Review of Biology 64: 539-541. Bron, C., and J. Kerbosch. 1973. Algorithm 457: Finding all cliques of an undirected graph. Communications of the Association for Computing Machinery 16: 575-577. Camin, J. H., and R. R. Sokal. 1965. A method for deducing branching sequences in phylogeny. Evolution 19: 311-326. Carpenter, J. 1987a. A report on the Society for the Study of Evolution workshop "Computer Programs for Inferring Phylogenies". Cladistics 3: 363-375. Carpenter, J. 1987b. Cladistics of cladists. Cladistics 3: 363-375. Cavalli-Sforza, L. L., and A. W. F. Edwards. 1967. Phylogenetic analysis: models and estimation procedures. Evolution 32: 550-570 (also Amer. J. Human Genetics 19: 233-257). Cavender, J. A. and J. Felsenstein. 1987. Invariants of phylogenies in a simple case with discrete states. Journal of Classification 4: 5771. Churchill, G.A. 1989. Stochastic models for heterogeneous DNA sequences. Bulletin of Mathematical Biology 51: 79-94. Conn, E. E. and P. K. Stumpf. 1963. Outlines of Biochemistry. John Wiley and Sons, New York. Day, W. H. E. 1983. Computationally difficult parsimony problems in phylogenetic systematics. Journal of Theoretical Biology 103: 429438. Dayhoff, M. O. 1979. Atlas of Protein Sequence and Structure, Volume 5, Supplement 3, 1978. National Biomedical Research Foundation, Washington, D.C. DeBry, R. W. and N. A. Slade. 1985. Cladistic analysis of restriction endonuclease cleavage maps within a maximum-likelihood framework. Systematic Zoology 34: 21-34. Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39: 1-38. Eck, R. V., and M. O. Dayhoff. 1966. Atlas of Protein Sequence and Structure 1966. National Biomedical Research Foundation, Silver Spring, Maryland. Edwards, A. W. F., and L. L. Cavalli-Sforza. 1964. Reconstruction of evolutionary trees. pp. 67-76 in Phenetic and Phylogenetic Classification, ed. V. H. Heywood and J. McNeill. Systematics Association Volume No. 6. Systematics Association, London. Estabrook, G. F., C. S. Johnson, Jr., and F. R. McMorris. 1976a. A mathematical foundation for the analysis of character compatibility. Mathematical Biosciences 23: 181-187. Estabrook, G. F., C. S. Johnson, Jr., and F. R. McMorris. 1976b. An algebraic analysis of cladistic characters. Discrete Mathematics16: 141-147. Estabrook, G. F., F. R. McMorris, and C. A. Meacham. 1985. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology 34: 193-200. Faith, D. P. 1990. Chance marsupial relationships. Nature 345: 393394. Faith, D. P. and P. S. Cranston. 1991. Could a cladogram this short have arisen by chance alone?: On permutation tests for cladistic structure. Cladistics 7: 1-28. Farris, J. S. 1977. Phylogenetic analysis under Dollo's Law. Systematic Zoology 26: 77-88. Farris, J. S. 1978a. Inferring phylogenetic trees from chromosome inversion data. Systematic Zoology 27: 275-284. Farris, J. S. 1981. Distance data in phylogenetic analysis. pp. 3-23 in Advances in Cladistics: Proceedings of the first meeting of the Willi Hennig Society, ed. V. A. Funk and D. R. Brooks. New York Botanical Garden, Bronx, New York. Farris, J. S. 1983. The logical basis of phylogenetic analysis. pp. 147 in Advances in Cladistics, Volume 2, Proceedings of the Second Meeting of the Willi Hennig Society. ed. Norman I. Platnick and V. A. Funk. Columbia University Press, New York. Farris, J. S. 1985. Distance data revisited. Cladistics 1: 67-85. Farris, J. S. 1986. Distances and statistics. Cladistics 2: 144-157. Farris, J. S. ["T. N. Nayenizgani"]. 1990. The systematics association enters its golden years (review of "Prospects in Systematics", ed. D. Hawksworth). Cladistics 6: 307-314. Felsenstein, J. 1973a. Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Systematic Zoology 22: 240-249. Felsenstein, J. 1973b. Maximum-likelihood estimation of evolutionary trees from continuous characters. Amer. J. Human Genetics 25: 471-492. Felsenstein, J. 1978a. The number of evolutionary trees. Systematic Zoology 27: 27-33. Felsenstein, J. 1978b. Cases in which parsimony and compatibility methods will be positively misleading. Systematic Zoology 27: 401-410. Felsenstein, J. 1979. Alternative methods of phylogenetic inference and their interrelationship. Systematic Zoology 28: 49-62. Felsenstein, J. 1981a. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Molecular Evolution 17: 368-376. Felsenstein, J. 1981b. A likelihood approach to character weighting and what it tells us about parsimony and compatibility. Biological Journal of the Linnean Society 16: 183-196. Felsenstein, J. 1981c. Evolutionary trees from gene frequencies and quantitative characters: finding maximum likelihood estimates. Evolution 35: 1229-1242. Felsenstein, J. 1982. Numerical methods for inferring evolutionary trees. Quarterly Review of Biology 57: 379-404. Felsenstein, J. 1983b. Parsimony in systematics: biological and statistical issues. Annual Review of Ecology and Systematics 14:313-333. Felsenstein, J. 1984a. Distance methods for inferring phylogenies: a justification. Evolution 38: 16-24. Felsenstein, J. 1984b. The statistical approach to inferring evolutionary trees and what it tells us about parsimony and compatibility. pp. 169-191 in: Cladistics: Perspectives in the Reconstruction of Evolutionary History, edited by T. Duncan and T. F. Stuessy. Columbia University Press, New York. Felsenstein, J. 1985a. Confidence limits on phylogenies with a molecular clock. Systematic Zoology 34: 152-161. Felsenstein, J. 1985b. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39: 783-791. Felsenstein, J. 1985c. Phylogenies from gene frequencies: a statistical problem. Systematic Zoology 34: 300-311. Felsenstein, J. 1985d. Phylogenies and the comparative method. American Naturalist 125: 1-12. Felsenstein, J. 1986. Distance methods: a reply to Farris. Cladistics 2: 130-144. Felsenstein, J. and E. Sober. 1986. Parsimony and likelihood: an exchange. Systematic Zoology 35: 617-626. Felsenstein, J. 1988a. Phylogenies and quantitative characters. Annual Review of Ecology and Systematics 19: 445-471. Felsenstein, J. 1988b. Phylogenies from molecular sequences: inference and reliability. Annual Review of Genetics 22: 521-565. Felsenstein, J. 1992a. Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates. Genetical Research 59: 139-147. Felsenstein, J. 1992b. Phylogenies from restriction sites, a maximum likelihood approach. Evolution 46: 159-173. Felsenstein, J. 1992c. Estimating effective population size from samples of sequences: a bootstrap Monte Carlo integration approach. Genetical Research, (December issue), in press. Fink, W. L. 1986. Microcomputers and phylogenetic analysis. Science 234: 1135-1139. Fitch, W. M., and E. Margoliash. 1967. Construction of phylogenetic trees. Science 155: 279-284. Fitch, W. M. 1971. Toward defining the course of evolution: minimum change for a specified tree topology. Systematic Zoology 20: 406-416. Fitch, W. M. 1975. Toward finding the tree of maximum parsimony. pp. 189-230 in Proceedings of the Eighth International Conference on Numerical Taxonomy, ed. G. F. Estabrook. W. H. Freeman, San Francisco. Fitch, W. M. and E. Markowitz. 1970. An improved method for determining codon variability and its application to the rate of fixation of mutations in evolution. Biochemical Genetics 4: 579-593. George, D. G., L. T. Hunt, and W. C. Barker. 1988. Current methods in sequence comparison and analysis. pp. 127-149 in Macromolecular Sequencing and Synthesis, ed. D. H. Schlesinger. Alan R. Liss, New York. Gomberg, D. 1966. "Bayesian" post-diction in an evolution process. unpublished manuscript: University of Pavia, Italy. Graham, R. L., and L. R. Foulds. 1982. Unlikelihood that minimal phylogenies for a realistic biological study can be constructed in reasonable computational time. Mathematical Biosciences 60: 133-142. Hasegawa, M. and T. Yano. 1984a. Maximum likelihood method of phylogenetic inference from DNA sequence data. Bulletin of the Biometric Society of Japan No. 5: 1-7. Hasegawa, M. and T. Yano. 1984b. Phylogeny and classification of Hominoidea as inferred from DNA sequence data. Proceedings of the Japan Academy 60 B: 389-392. Hasegawa, M., Y. Iida, T. Yano, F. Takaiwa, and M. Iwabuchi. 1985a. Phylogenetic relationships among eukaryotic kingdoms as inferred from ribosomal RNA sequences. Journal of Molecular Evolution 22: 32-38. Hasegawa, M., H. Kishino, and T. Yano. 1985b. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution 22: 160-174. Hendy, M. D., and D. Penny. 1982. Branch and bound algorithms to determine minimal evolutionary trees. Mathematical Biosciences 59: 277-290. Higgins, D. G. and P. M. Sharp. 1989. Fast and sensitive multiple sequence alignments on a microcomputer. Computer Applications in the Biological Sciences (CABIOS) 5: 151-153. Holmquist, R., M. M. Miyamoto, and M. Goodman. 1988. Higher-primate phylogeny -- why can't we decide? Molecular Biology and Evolution 5: 201216. Inger, R. F. 1967. The development of a phylogeny of frogs. Evolution 21: 369-384. Jin, L. and M. Nei. 1990. Limitations of the evolutionary parsimony method of phylogenetic analysis. Molecular Biology and Evolution 7: 82-102. Jukes, T. H. and C. R. Cantor. 1969. Evolution of protein molecules. pp. 21-132 in Mammalian Protein Metabolism, ed. H. N. Munro. Academic Press, New York. Kim, J. and M. A. Burgman. 1988. Accuracy of phylogenetic-estimation methods using simulated allele-frequency data. Evolution 42: 596-602. Kimura, M. 1980. A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16: 111-120. Kimura, M. 1983. The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge. Kingman, J. F. C. 1982a. The coalescent. Stochastic Processes and Their Applications 13: 235-248. Kingman, J. F. C. 1982b. On the genealogy of large populations. Journal of Applied Probability 19A: 27-43. Kishino, H. and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. Journal of Molecular Evolution 29: 170-179. Kluge, A. G., and J. S. Farris. 1969. Quantitative phyletics and the evolution of anurans. Systematic Zoology 18: 1-32. Lake, J. A. 1987. A rate-independent technique for analysis of nucleic acid sequences: evolutionary parsimony. Molecular Biology and Evolution 4: 167-191. Le Quesne, W. J. 1969. A method of selection of characters in numerical taxonomy. Systematic Zoology 18: 201-205. Le Quesne, W. J. 1974. The uniquely evolved character concept and its cladistic application. Systematic Zoology 23: 513-517. Lewis, H. R., and C. H. Papadimitriou. 1978. The efficiency of algorithms. Scientific American 238: 96-109 (January issue) Luckow, M. and D. Pimentel. 1985. An empirical comparison of numerical Wagner computer programs. Cladistics 1: 47-66. Lynch, M. 1990. Methods for the analysis of comparative data in evolutionary biology. Evolution 45: 1065-1080. Maddison, D. R. 1991. The discovery and importance of multiple islands of most-parsimonious trees. Systematic Zoology 40: 315-328. Margush, T. and F. R. McMorris. 1981. Consensus n-trees. Bulletin of Mathematical Biology 43: 239-244. Nelson, G. 1979. Cladistic analysis and synthesis: principles and definitions, with a historical not on Adanson's Familles des Plantes (1763-1764). Systematic Zoology 28: 1-21. Nei, M. 1972. Genetic distance between populations. American Naturalist 106: 283-292. Nei, M. and W.-H. Li. 1979. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proceedings of the National Academy of Sciences, USA 76: 5269-5273. Page, R. D. M. 1989. Comments on component-compatibility in historical biogeography. Cladistics 5: 167-182. Platnick, N. 1987. An empirical comparison of microcomputer parsimony programs. Cladistics 3: 121-144. Platnick, N. 1989. An empirical comparison of microcomputer parsimony programs. II. Cladistics 5: 145-161. Reynolds, J. B., B. S. Weir, and C. C. Cockerham. 1983. Estimation of the coancestry coefficient: basis for a short-term genetic distance. Genetics 105: 767-779. Rohlf, F. J. and M. C. Wooten. 1988. Evaluation of the restricted maximum likelihood method for estimating phylogenetic trees using simulated allele- frequency data. Evolution 42: 581-595. Saitou, N., Nei, M. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4: 406-425. Sanderson, M. J. 1990. Flexible phylogeny reconstruction: a review of phylogenetic inference packages using parsimony. Systematic Zoology 39: 414-420. Sankoff, D. D., C. Morel, R. J. Cedergren. 1973. Evolution of 5S RNA and the nonrandomness of base replacement. Nature New Biology 245: 232234. Sokal, R. R. and P. H. A. Sneath. 1963. Principles of Numerical Taxonomy. W. H. Freeman, San Francisco. Smouse, P. E. and W.-H. Li. 1987. Likelihood analysis of mitochondrial restriction-cleavage patterns for the human-chimpanzee-gorilla trichotomy. Evolution 41: 1162-1176. Sober, E. 1983a. Parsimony in systematics: philosophical issues. Annual Review of Ecology and Systematics 14: 335-357. Sober, E. 1983b. A likelihood justification of parsimony. Cladistics 1: 209-233. Sober, E. 1988. Reconstructing the Past: Parsimony, Evolution, and Inference. MIT Press, Cambridge, Massachusetts. Sokal, R. R., and P. H. A. Sneath. 1963. Principles of Numerical Taxonomy. W. H. Freeman, San Francisco. Studier, J. A. and K. J. Keppler. 1988. A note on the neighbor-joining algorithm of Saitou and Nei. Molecular Biology and Evolution 5: 729-731. Swofford, D. L. and G. J. Olsen. 1990. Phylogeny reconstruction. Chapter 11, pages 411-501 in Molecular Systematics, ed. D. M. Hillis and C. Moritz. Sinauer Associates, Sunderland, Massachusetts. Templeton, A. R. 1983. Phylogenetic inference from restriction endonuclease cleavage site maps with particular reference to the evolution of humans and the apes. Evolution 37: 221-244. Thompson, E. A. 1975. Human Evolutionary Trees. Cambridge University Press, Cambridge. Wu, C. F. J. 1986. Jackknife, bootstrap and other resampling plans in regression analysis. Annals of Statistics 14: 1261-1295. CREDITS Over the years various granting agencies have contributed to support of the PHYLIP project (at first without knowing it). They are: the Years Agency Grant or Contract Number 1995-1999 1992-1995 1992-1994 1989-1992 1990-1992 1987-1990 1979-1987 76EV71005 NIH NIGMS National Science Foundation NIH NIGMS Shannon Award NIH NIGMS National Science Foundation National Science Foundation U.S. Department of Energy 1 R01 GM51929-01 DEB-9207558 2 R55 GM41716-04 1 R01-GM41716-01 BSR-8918333 BSR-8614807 DE-AM06-76RLO2225 TA DE-AT06- I am particularly grateful to program administrators William Moore, Irene Eckstrand, Peter Arzberger, and Conrad Istock, who have gone beyond the call of duty to make sure that PHYLIP continued. Booby prizes for funding are awarded to: (1) The people at the U.S. Department of Energy who, in 1987, decided they were "not interested in phylogenies", (2) The members of the Systematics Panel of NSF who twice (in 1989 and 1992) positively recommended that my applications NOT be funded. I am very grateful to program director William Moore for courageously overruling their decision the first time. The current (1992) Systematics Panel can claim no credit for PHYLIP whatsoever. (3) The members of the 1992 Genetics Study Section of NIH who rated my proposal in the 53rd percentile (I don't know if that's 53rd from the top or the bottom, but does it matter?), thus denying it funding. I am, however, grateful to the NIGMS administrators who supported giving me a "Shannon award" partially funding my work for a period in spite of this rating. The original Camin-Sokal parsimony program and the polymorphism parsimony program were written by me in 1977 and 1978. They were Pascal versions of earlier FORTRAN programs I wrote in 1966 and 1967 using the same algorithm to infer phylogenies under the Camin-Sokal and polymorphism parsimony criteria. Harvey Motulsky worked for me as a programmer in 1971 and wrote FORTRAN programs to carry out the Camin-Sokal, Dollo, and polymorphism methods. But most of the work on PHYLIP other than my own was by Jerry Shurman and Mark Moehring. Jerry Shurman worked for me in the summers of 1979 and 1980, and Mark Moehring worked for me in the summers of 1980 and 1981. Both wrote original versions of many of the other programs, based on the original versions of my Camin-Sokal parsimony program and POLYM. These formed the basis of Version 1 of the Package, first distributed in October, 1980. Version 2, released in the spring of 1982, involved a fairly complete rewrite by me of many of those programs. Jerry and Mark are not to be held responsible for problems arising from use of these programs. Hisashi Horino has for version 3.3 reworked some parts of the programs CLIQUE and CONSENSE to make their output more comprehensible, and has added some code to the treedrawing programs DRAWGRAM and DRAWTREE as well. My part-time programmers Akiko Fuseki, Sean Lamont and Andrew Keeffe gave me substantial help with the current release, and their excellent work is greatly appreciated. Akiko in particular did much of the hard work of adding new features and changing old ones in the 3.4 and 3.5 releases, and Andrew prepared the Macintosh version, wrote RETREE, and added the raytracing and PICT code to the DRAW programs. Sean was central to the conversion to C, and tested it extensively. My postdoctoral fellow Mary Kuhner and her associate Jon Yamato created NEIGHBOR, the neighbor-joining and UPGMA program, for the current release, for which I am also grateful (Naruya Saitou kindly encouraged us to use some of the code from his own implementation of this method). I am very grateful to many users for algorithmic suggestions, complaints about features (or lack of features), and information about the behavior of their operating systems and compilers. Among these are: Jim Archie Mary Barkworth Yves Bertheau Vincent Bauchau Bernard Baum Mary Berbee Biff Bermingham Yves Bertheau Pierre Boursot Tom Bruns Tsan Iang Chuang Timothy Goldsmith Rees Griffiths George Gutman Linda Hardison Gene Hart Masami Hasegawa Bill Hatheway David Hillis Richard Holliday Eddie Holmes Kent Holsinger Dan Nickrent Trang Nguyen Cary O'Donnell Steve O'Kane Gary Olsen John Olsen Steve O'Neill Greg Orloff Pekka Pamilo David Penny Norman Platnick Stephen Clark Bruce Cochrane Joel Cracraft Ross Crozier Mark Dalton Dan Davison Ron DeBry Allen Delaney Terry Delaney John Devereux Tod Distotell John Doebley Ken Dodds Jim Doyle Guy Drouin Shan Duncan Tom Duncan Robert Eaglen Scott Edwards Willem Ellis Ted Emigh John Endler Laurent Excoffier James Farmer David Featherston Kent Fiala Tim Flannery Vera Ford Kurt Fristrup Douglas Futuyma Michael Garrick Don Gilbert John Gillespie Nick Goldman Dan Hough Richard Jensen Bo Johansson Quentin Kay Steve Kelem Kim Cheol-Min Joseph H. Kirkbride John Kirsch Andrew Knight Dennis Knudson Mary Kuhner Jan Kwiatowski John LaDuke Lionel Landry Franz Lang Niels Larsen Jerry Learn Rev. Arthur Lee Pierre Legendre Jack A.M. Leunissen Andrew Lloyd Wolfgang Ludwig David Maddison Wayne Maddison George McKay Brian McMahon Christopher Meacham Brook Milligan Sanzo Miyazawa Janice Moore Susumu Nakayama Jean-Marc Neuhaus Haolin Ni Mark Ragan Neil Rawlings Tom Ritch Alistair Robertson Joseph R. Rohrer Naruya Saitou Kay Schneitz Paul Sharp Arend Sidow Hans Siegismund Chuck Smart Douglas Smith Dave Spencer Lisa Steiner Per Sundberg Susan Swensen David Swofford John Sved Naoko Takezaki Eric Taylor Jeff Thorne Clive Trotman John Turnbull Hans Ullitz-Moeller Michael Vodkin Carl Wadsworth Ryk Ward Daniel Weeks Loni West George D.F. Wilson Thomas K. Wilson M. Zandee Eric Zurcher My apologies to anyone who has accidentally been left out of this Keep making suggestions and you will get on eventually. list. A growing contribution to this package has been made by others writing programs or parts of programs. Chris Meacham contributed the important program FACTOR, long demanded by users, and the even more important ones PLOTREE and PLOTGRAM. Important parts of the code in DRAWGRAM and DRAWTREE were taken over from those two programs. He is thus mostly to blame for all problems with these programs. Kent Fiala wrote PROCEDURE reroot to do outgrouprooting, which was an essential part of many programs in earlier versions. Someone at the Western Australia Institute of Technology suggested the name PHYLIP (by writing it on a magnetic tape as the tape label), but they all seem deny having done so (and I've lost the relevant letter). to Arend Sidow contributed makeinf.c to the Unsupported Division of this release, and Masami Hasegawa and Jun Adachi contributed ProtML.pas. Their generosity is much appreciated. The distribution of the package also owes much to Buz Wilson and Willem Ellis, who have put a lot of effort into the past distribution of the PCDOS and Macintosh versions respectively. Christopher Meacham and Tom Duncan for three versions distributed a printed version of these documentation files (they are no longer able to do so), and I am very grateful to them for those efforts. William H.E. Day and F. James Rohlf have been very helpful in setting up the listserver news bulletin service. I also wish to thank the people who have made computer resources available to me, mostly in the loan of use of microcomputers. These include Jeremy Field, Clem Furlong, Rick Garber, Dan Jacobson, Rochelle Kochin, Monty Slatkin, Jim Archie, Jim Thomas, and George Gilchrist. I should also acknowledge the computers used to develop this package: These include a CDC 6400, two DECSystem 1090s, my trusty old SOL-20, my old Osborne-1, a VAX 11/780, a VAX 8600, my old MicroVAX I, my old DECstation 3100, my old Toshiba 1100+, and my present mainstays, a DECstation 5000/200, a DECstation 5000/125, a Compudyne 486DX/33, a Trinity Genesis 386SX, a Zenith Z386 and a Mac Classic. (One of the reasons we have been successful in achieving compatibility between different computer systems is that I have had to run them myself under so many different operating systems and compilers). OTHER PHYLOGENY PROGRAMS AVAILABLE ELSEWHERE Here are some of the other phylogeny packages that I know about. Some of them are available over Internet from ftp server machines, or by World Wide Web. If you are on Internet you should familiarize yourself with the server machines (see entries 6 and 7 below for more information). Another major list of phylogeny software is being compiled by David Maddison and Wayne Maddison as part of their "Tree of Life" project on the World Wide Web. Its URL is: http://phylogeny.arizona.edu/tree/programs/programs.html It is still very incomplete as of this writing but may be more up-to-date than this listing can be. The programs listed below include both free and non-free ones; in some cases I do not know whether a program is free. I have listed as free those that I knew were free; for the others you have to ask their distributor. The list starts with programs and packages to estimate phylogenies, continues with alignment-and-phylogeny programs, and ends with programs to do other phylogeny-related tasks. 1. David Swofford of the Laboratory of Molecular Systematics, National Museum of Natural History, Smithsonian Instition, Washington, D.C. has written PAUP (which originally meant Phylogenetic Analysis Using Parsimony). Version 3.0 was available for Macintoshes. It is currently not available, but a new version, to be called PAUP*, will be released by Sinauer Associates, of Sunderland, Massachusetts, in a new version called PAUP*, in late 1995 or early 1996. It will have Macintosh, DOS, and Unix versions. It will include parsimony, distance matrix, invariants, and maximum likelihood methods. PAUP 3.0 was probably the most sophisticated parsimony program, with many options and close compatibility with MacClade (for which see below). The new program will become much broader with the inclusion of more methods. The price will be in the vicinity of $100 US. Sinauer Associates's e-mail address is biology@sinauer.com. 2. If you have a Macintosh computer and any interest in discrete-state parsimony methods (including DNA and protein parsimony), you should definitely get MacClade. It was written by Wayne Maddison and David Maddison of the University of Arizona. All distribution is by Sinauer Associates, Sunderland Massachusetts 01375, USA. Their phone number is: (413) 665 3722, FAX: (413) 665 7292. A disk with program, help file, and example data files, plus book (which has about 100 pages of intro to phylogenetic theory, and 250 pages of program instructions), is $75 U.S. ($40 for the book alone). Site licenses also available. An earlier and less capable Version 2 (which for example cannot read nucleic acid sequences and has fewer features for discrete characters) is also available by anonymous ftp from the EMBL, Indiana and Houston molecular biology software servers. Their addresses are given below under the descriptions of TreeAlign and ClustalV. MacClade 2.1 will be found among their Mac software, as a squeezed and then binhexed file. MacClade enables you to use the mouse-window interface to specify and rearrange phylogenies by hand, and watch the number of character steps and the distribution of states of a given character on the tree change as you do so. MacClade is positively addictive and will give you a much better feel for the tree and your data. It's the closest thing to a phylogeny video game that I have seen. It has been influential in spurring the inclusion of interaction and graphics into other phylogeny programs. (I have tried to supply this functionality in PHYLIP by incorporating the programs MOVE, DOLMOVE, and DNAMOVE, which act somewhat like MacClade). MacClade does not have a sophisticated search algorithm to find best trees: it largely relies on you to do it by hand (which is surprisingly effective), with only a local rearrangement algorithm available to improve on that tree. 3. J. S. Farris has produced Hennig86, a fast parsimony program including branch-and-bound search for most parsimonious trees and interactive tree rearrangement. Although complete benchmarks have not been published it is said to be faster than Swofford's PAUP; both are a great many times faster than the parsimony programs in PHYLIP. The program is distributed in executable object code only and costs $50, plus $5 mailing costs ($10 outside of of the U.S.). The user's name should be stated, as copies are personalized as a copyprotection measure. It is distributed by Arnold Kluge, Amphibians and Reptiles, Museum of Zoology, University of Michigan, Ann Arbor, Michigan 48109-1079, U.S.A. (Arnold.G.Kluge@um.cc.umich.edu) and by Diana Lipscomb at George Washington University (BIODL@gwuvm.gwu.edu). It runs on PCcompatible microcomputers with at least 512K of RAM and needs no math coprocessor or graphics monitor. It can handle up to 180 taxa and 999 characters. 4. Mark Siddall, of the Virginia Institute of Marine Sciences (mes@vims.edu) has released Random Cladistics, a set of programs that can carry out bootstrapping, jackknifing, and a variety of kinds of permutation tests, using Hennig86 to analyze the data. To use it you must have a copy of Hennig86 (for whose distribution see above). Random Cladistics will carry out the appropriate transformations of your data and will call Hennig86 and have it analyze them, and then it will summarize the results. Random Cladistics is available free by anonymous ftp from zoo.utoronto.ca in directory "pub" (files random.doc and random.exe). 5. J. S. Farris has recently released RNA (Rapid Nucleotide Analysis). It features rapid bootstrapping. It is available from Arnold Kluge, Amphibians and Reptiles, Museum of Zoology, University of Michigan, Ann Arbor, Michigan 48109-1079, U.S.A. (Arnold.G.Kluge@um.cc.umich.edu ) and Diana Lipscomb at George Washington University (BIODL@gwuvm.gwu.edu) who may be contacted for details. The cost is said to be about $30 US. 6. ClaDOS, an interactive program which allows rearrangement of trees and their evaluation, mapping of characters into them, and more, is available for DOS systems from Kevin Nixon, L. H. Bailey Hortorium, Cornell University, 467 Mann Library, Ithaca, New York 14853. Rumor has it that the cost is in the vicinity of $55 US. 7. MEGA (Molecular Evolutionary Genetic Analysis) has been released at the by Sudhir Kumar, Koichiro Tamura, and Masatoshi Nei of the Institute of Molecular Evolutionary Genetics, 328 Mueller Lab, Pennsylvania State University, University Park, Pennsylvania 16802, U.S.A. It is an executable program for DOS machines, and is menu-driven with context-sensitive help. It will also run under Windows in a DOS Window. It will analyze data from DNA, RNA and protein sequences, and distance matrices produced from other kinds of data as well. It will include the Neighbor-Joining method distance matrix method, a branch and bound parsimony method, and bootstrapping. It will also plot trees on many kinds of printers. The program costs $15 (for the documentation) Inquiries can also be made by mail to Joyce White at the above address or by electronic mail to imeg@@psuvm.psu.edu. 8. Yves van de Peer of the University of Antwerp (yvdp@reks.uia.ac.be) has developed TREECON 3.0, a program package for analysis of molecular data sets. It is menu driven and runs on 386 (and higher) DOS systems, and also on Windows systems. It carries out inference of phylogenies by distance matrix methods, with bootstrapping and a program to draw the trees. It is written in C and is available free by anonymous ftp from uiam3.uia.ac.be. It was described in CABIOS 9: 177-182 (1993). A fee is asked to defray expenses. For information or ordering contact Van de Peer at the above e-mail address or at the Department of Biochemistry, University of Antwerp (UIA), Universiteitsplein 1, B-2610 Antwerpen, BELGIUM. 9. Jun Adachi and Masami Hasegawa have written a package MOLPHY 2.2, carrying out maximum likelihood inference of phylogenies for either nucleotide sequences or protein sequences. Their protein sequence maximum likelihood program, ProtML, is a successor to the one they made available to me for distribution on a nonsupported basis in PHYLIP, and is much improved over that. It is the best protein maximum likelihood program available. The package is distributed free in C source code, with documentation, by ftp from sunmh.ism.ac.jp. 10. Gary Olsen, of the Department of Microbiology, University of Illinois, has developed a speeded-up version of my program DNAML coded in C, called "fastDNAml". It achieves a number of economies and also is organized so that it can be run on parallel processors -- he and his co-workers have constructed trees of very large size on a high-speed parallel processor. The program can be compiled using the "p4" portable parallel processing toolkit. It can also be run in ordinary serial mode on workstations where it is fatser than DNAML. The C program is available by anonymous ftp from the Ribosomal Database Project at info.mcs.anl.gov in directory pub/RDP/programs/fastDNAml. 11. Ziheng Yang of the Institute of Molecular Evolutionary Genetics at Pennsylvania State University (who is soon to be moving to the Department of Integrative Biology, University of California, Berkeley), (yang@imeg.bio.psu.edu) has released PAML 1.0, a program for the maximum likelihood analysis of nucleotide or protein sequences (including Hidden Markov Model analysis like the features we have in DNAML). It is available as C source code for Unix systems, and is free by anonymous ftp from the molecular biology software servers. It will be found on ftp.bio.indiana.edu, for example, in directory molbio/evolve. 12. Pablo Goloboff, of the American Museum of Natural History (goloboff@amnh.org), distributes PEWEE and NONA, to carry out weighted parsimony analyses. The programs run on DOS with versions available for both 386-486-Pentium machines and earlier 16-bit machines. Goloboff's address is Dept. of Entomology, American Museum of Natural History, Central Park West at 79th Street, New York, NY 10024. His telephone number is 212 769 5619, and fax number is 212 769 5277. 13. Yasuo Ina of the National Institute of Genetics, Mishima, Japan (yina@ddbj.nig.ac.jp) has developed ODEN, a package of programs for doing distance matrix analyses on nucleotide or protein sequences. It is described in CABIOS 10: 11-12 (1994). It is available free by anonymous ftp from directory pub/oden in bioslave.uio.no as C source code for Unix systems. 14. A. Luettke and R. Fuchs have written MacT, a package of programs for Macintoshes that compute distances and compute Neighbor-Joining phylogenies for them. The programs work code in on 4 through 26 sequences, and source Microsoft QuickBasic is provided as well as compiled executables. package is free and is available on the molecualr biology software servers. On ftp.bio.indiana.edu it will be found in directory molbio/mac. The programs are described in CABIOS 8: 591-594, 1992. The 15. Andrey A. Zharkikh, Andrey Rzhetsky, and co-workers in the Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia, Ex-USSR, have produced VOSTORG, a package of programs for alignment (both manual and automatic) and inferring phylogenies by distance methods and parsimony for molecular sequences. It runs on IBM PCcompatibles and includes some rather fancy graphics. The authors are currently in the U.S., not in Siberia. A version of the program is available free by anonymous ftp from gsbs18.gs.uth.tmc.edu in directory pub/zharkikh/vostorg. The programs are described in a paper by Zharkikh et. al. in Gene 101: 251254 (1991). 16. Rainer Wetzel and Daniel Huson have developed a Macintosh program for carrying out the "split decomposition" method of A. Bandelt and A. Dress (Molecular Phylogenetics 1: 242-252 (1992)). Contact huson@mathematik.unibielefeld.de for details. 17. James Lake distributes "Evomony", a program for using the "evolutionary parsimony" (invariants) method for inferring phylogenies from DNA or RNA sequences. It runs on 286 or higher DOS systems with at least 500k bytes of memory. A Macintosh version was also contemplated. I do not know what the current distribution arrangements are. Lake's address is Department of Biology, University of California, Los Angeles, California 90024. 18. Walter Fitch (Department of Ecology and Evolutionary Biology, University of California, Irvine, California 92717, U.S.A.) has a package "Molevol" available free (on receipt of an appropriate number of PCDOS formatted floppy disks) with about 20 FORTRAN programs for not only estimating trees by parsimony and distance methods but doing various other manipulations of data that might be needed such as format interconversions and searching for homology and secondary structure. They are available as FORTRAN source and/or as PCDOS executables. The FORTRAN programs will also run on Sun workstations (and probably others too, I would suspect). His electronic mail address is wfitch@daedalus.bio.uci.edu. 19. Pierre Roux and Tim Littlejohn of the Informatics Division of the Organelle Genome Megasequencing Program at the Universite de Montreal has made available PARBOOT, a program that takes bootstrap sampled data sets and splits them up, submitting each to a different computer, so as to run bootstrapping quickly on networks of computers. It is available free as C source code by ftp from megasun.bch.umontreal.ca in directory pub/parboot. It requires a networked system of computers with PHYLIP, a "perl" interpreter, and appropriate accounts and permissions. 20. Andrey Zharkikh of the Genetics Centers at the University of Texas Health Sciences Center in Houston has programs for bootstrapping of nucleotide sequences, including his innovative double-bootstrap method for getting less biased P values. They are available free by anonymous ftp at gsbs18.gs.uth.tmc.edu/pub/zharkikh/bootstrap or gsbs18.gs.uth.tmc.edu/pub/zharkikh/bootstrap/double-bootstrap. The programs njbootjc, njbootk2, and njbootli implement methods based on JukesCantor, Kimura, and Li distances, respectively. 21. David Penny (Department of Botany and Zoology, Massey University, Palmerston North, New Zealand) has been offering for free distribution several PCDOS programs, one a fast parsimony program, TurboTree. There are also two others, possible Hadtree which computes expected frequencies of all distributions of nucleotides among species, and Great Deluge, an approximate search for the most parsimonious tree by a quasi-random method. tells me that funding exigiencies are such that he may soon have to start charging for these. His electronic mail address is dpenny@massey.ac.nz. He 22. Jotun Hein, (Institute of Genetics and Ecology, University of Aarhus, 8000 Aarhus C, Denmark) has produced TreeAlign, a multiple sequence alignment program that builds trees as it aligns DNA or protein sequences. It uses a combination of distance matrix and approximate parsimony methods. TreeAlign uses too much memory for it to run on PC's (DOS or Mac systems) but is really designed for a workstation or mainframe. It is available by anonymous ftp at the Indiana, Houston, and EMBL molecular biology software distribution sites. Their network addresses are respectively: ftp.bio.indiana.edu, ftp.bchs.uh.edu, and ftp.ebi.ac.uk. In the Indiana archive one must enter directory molbio/align, in the Houston archive it is in directory pub/geneserver in the directories unix and vms. If you are on Internet and use molecular data it is important that you learn to use anonymous ftp and become familiar with these ftp servers. 23. Another multisequence alignment program that estimates trees as it aligns multiple sequences is ClustalW. Currently it is distributed as C source code, and in Macintosh and DOS executables by its author, Desmond Higgins. He is at the European Bioinformatics Institute in Cambridge, England. ClustalW successfully compiles and runs on many different workstations. DOS, Mac, and PowerMac executables are also available It is a complete rewrite and upgrade of the Clustal and ClustalV packages; the first was described by Higgins and Sharp (1989). New features include the ability to detect read different input formats (NBRF/ PIR, Fasta, EMBL/Swissprot); align old alignments; produce phylogenetic trees after alignment (Neighbor Joining trees with a bootstrap option); write different alignment formats (Clustal, NBRF/PIR, GCG, PHYLIP); full command line interface. The program is available by anonymous ftp at the Indiana, Houston, and EMBL molecular biology distribution sites. Their network addresses are respectively: ftp.bio.indiana.edu, ftp.bchs.uh.edu, and ftp.ebi.ac.uk. In the Indiana archive one must enter directory molbio/align, in the Houston archive it is in directory pub/gene-server in all of the four directories dos, Mac, unix, and vms (I do not know exactly where it is in the EBI machine). If you are on Internet and use molecular data it is important that you learn to use anonymous ftp and become familiar with one or more of these ftp servers. 24. Ward Wheeler and David Gladstein have written MALIGN, a parsimonybased alignment program for molecular sequences. It implements the original suggestion by Sankoff, Morel, and Cedergren (1973) that alignment and phylogenies could be done at the same time by finding that tree that minizes the total alignment score along the tree. Jotun Hein's program TreeAlign (mentioned above) is another, more approximate but probably faster, attempt to implement the Sankoff-Morel-Cedergren suggestion. MALIGN is available from Ward Wheeler at the American Museum of Natural History in New York city. His email address is wheeler@amnh.org. It comes in DOS, Mac and SUN versions. 25. Rod Page has written COMPONENT, a program for PCDOS systems for comparing cladograms for use in phylogeny and biogeography studies. It has many tree comparison and consensus methods, and far more features for biogeographic studies (such as comparing species and area cladograms) than any other package. It runs on PCDOS 286 or 386 systems under Windows 3.0 or higher. Its cost is 40 pounds U.K., and it can be ordered Liz Timpson at the Department of Botany, Natural History Museum, London (emt@nhm.ic.ac.uk). Rod's e-mail address is rod.page@zoology.oxford.ac.uk. There is a review of the program in Cladistics 9: 351-353 (1993). COMPONENT has a World Wide Web site: http://evolve.zps.ox.ac.uk/Rod/cpw.html which includes an order form. 26. Andrew Purvis and Andrew Rambaut of the Department of Zoology, University of Oxford, England, have written CAIC (Comparative Analysis of Independent Contrasts). It is a Macintosh program that carries out the contrasts method (like my CONTRAST) but with some modifications by others to cope with lack of resolution of the phylogeny. It is available free by anonymous ftp from directory packages/CAIC at evolve.zps.ox.ac.uk. It is described in CABIOS 11: 247-251 (1995). 27. Joaquin Dopazo at the Centro Nacional de Biotecnologia in Madrid, Spain, has written a program ABLE (Analysis of Branch Length Errors) which implements the method described by Adell and Dopazo in J. Mol. Evol. 38:305-309 (1994). This is a form of the parametric bootstrap. It makes use of PHYLIP. It is available as a DOS executable over World Wide Web at http://www.cnb.uam.es/www/ximo or by anonymous ftp at: ftp.cnb.uam.es in directory software/molevol. 28. Kent Fiala, now of SAS Institute, has written a compatibility (clique) program, based on an earlier program written by Kent and George Estabrook. Christopher Meacham has put the latest version of CLINCH (6.2), with Kent's permission, as a self-extracting DOS archive vailable free on Jim Beach's TAXACOM fileserver, muse.bio.cornell.edu. CLINCH 6.2 and associated files can be found by anonymous ftp in /pub/software/clinch as clinch62.exe, which is a self-extracting archive. Documentation, sample input and output, and FORTRAN source code are included. PC-CLINCH is probably the most sophisticated compatibility analysis program. The Taxacom server, by the way, also has other material related to botanical systematics, including flora information. 29. Christopher Meacham (Museum Informatics Project, University of California, Berkeley, California 94720, U.S.A.) produces COMPROB, a Pascal program to compute probabilities that characters would be compatible at random, thus telling us which clique is "most surprising". He can be contacted as meacham@violet.berkeley.edu about receiving a copy. The program is free. 30. The program MARKOV computes a distance measure between pairs of nucleotide sequences. It also constructs phylogenies from these and summarizes the 4x4 substitution matrices between the pairs of species. It uses a more general model of substitution than used in PHYLIP, the Stationary Markov Model described in the paper by Saccone et. al. in Methods in Enzymology volume 183, pages 570-583, 1990. Bootstrapping is used to analyze the statistical error of the results. Output files from CLUSTAL and PILEUP, as well as some other formats, can be used for input, and analysis can be confined to certain codon positions in coding sequences. The program is written in FORTRAN and runs on VMS and Unix systems. It was produced by Dr. Graziano Pesole and Professor Cecilia Saccone at the University of Bari, Italy, and is available (for free?) from Dr. Cecilia Lanave at CSMME-CNR, Dipartimento di Biochimica e Biologia Molecolare, Universita` di Bari, via Orabona 4, 70126 Bari, Italy. Her phone number is 39-80-243305, her fax number is 39-80-243317, and her e-mail address is lanave@vaxba0.ba.it or mvx36@ibacsata.it 31. J. S. Armstrong, A. J. Gibbs, R. Peakall and G. Weiller, of Australian National University, Canberra, have produced RAPDistance, a package for DOS and (presumably) Windows systems for computing distance matrices for RAPD analyses, for use in various phylogeny programs. RAPDistance is available free by anonymous ftp from directory pub/RAPDistance at life.anu.edu.au, or on the World Wide Web at http://life.anu.edu.au/molecular/software/rapd.html. 32. have P. R. Reeves and colleagues at Sydney University, Australia, produced sequence MULTICOMP, a program for computing various distances from data. It is described in a paper by Reeves et. al. in CABIOS 10: 281-284 (1994). I do not know what computer systems it runs on. Reeves may be contacted at reeves@angis.su.oz.au for distribution information. 33. Ken Rice of the Department of Organismal and Evolutionary Biology of Harvard University has produced RSVP (restriction site variability program) which calculates several measures of genetic variability based on restriction map data. It also produces Jukes-Cantor corrected distance matrices with standard errors from collections of restriction maps. C source code for Version 2.08 of RSVP is available free by anonymous ftp from: oeb.harvard.edu/rice or you can get it on WWW from: http://oeb.harvard.edu/~rice. It runs under Unix. 34. J. S. Farris and Mary Mickevich earlier released a package of phylogeny programs, PHYSYS, which, at about $5,000, was extremely expensive (in my opinion, which is certainly a biased one). I am not sure whether, from whom, or under what conditions it is still available. 35. Fujitsu Ltd. ("a $21 billion global leader in advanced computer, telecommunications, and electronic devices") sells for $28,000 US a Fujitsu S family workstation complete with a program, SINCAIDEN, which allows "experimental researchers, even those unfamiliar with such analyses, [to] easily create phylogenetic trees in their own laboratories." The program also allows searches of the major nucleic acid sequence and protein databases (the ad I saw does not make it clear whether these databases are provided with the workstation). The methods available are UPGMA, neighbor-joining, Farris's (Distance Wagner) and the modified Farris distance matrix methods. The workstation is SPARC compatible and runs SunOS. The SYNCAIDEN program was developed by the group at the National Institute of Genetics, Japan under Dr. Takashi Gojobori. Fujitsu Ltd. may be contacted at 21-8, NishiShinbashi 3chome, Minato-ku, Tokyo 105, Japan (phone 81-3-3437-5111 ext. 2831, fax 81-35472-4354), or in the U.S. at Fujitsu America Inc., 3055 Orchard Drive, San Jose, California 95134-2017 (phone 1-408-432-1300 ext. 5168, fax 1408-4341045). 36. MUST, a package of sequence management programs, is distributed on a shareware basis by Herve Phillippe, Laboratoire de Biologie Cellulaire (URA CNRS 1134 D), Batiment 444, Universite de Paris-Sud, 91405 Orsay cedex, France. His e-mail address is: adoutte@frciti51 on Bitnet/EARN. His phone and fax numbers are respectively 33.1.69.41.64.81 and 33.1.69.41.21.30. MUST is available on a shareware basis ($100 registration fee if you do not send diskettes) and runs on DOS systems using DOS version 3 or later. It is intended as complementary to existing phylogeny and alignment programs and can produce output files in the formats of PHYLIP, PAUP, Hennig86, and CLUSTAL. It contains a variety of sequence input, editing, checking, and storage functions, as well as a sequence editor and a phylogeny plotter. It also allows further analyses of the results from these phylogeny programs. 37. Steve Smith, formerly of the Harvard Genome Laboratory, has written an X-Windows interactive sequence editor, GDE (Genetic Data Environment) which allows the user to edit sequences and align them by hand, and to select subsets of sites and sequences and call a variety of analysis proprams including ClustalV and many of the PHYLIP 3.5 programs. The GDE 2.0 system will run on many workstations that have the X windowing system. It also includes the TreeTool tree-plotting program (see below). GDE 2.0 is free and is available for anonymous ftp transfer at the molecular biology software servers, such as ftp.bio.indiana.edu in directory molbio/unix/GDE, or at megasun.bch.umontreal.ca in directory pub/gde. At the latter location there are also Linux binaries, and at both there are Sun binaries. 38. Mike Maciukenas, at the Department of Microbiology of the University of Illinois, has written a wonderful X-windows based interactive treeplotting program called TreeTool. It takes as input a PHYLIP tree file, with branch lengths if they are provided, displays the tree in either rooted or unrooted form on any X-windows screen, and allows the user to modify the form of the tree and the placement of nodes and labels. When the tree is in final form the user can have it written to a Postscript file and/or printed to a Postscriptcompatible printer. TreeTool is free as a C program for X windows and is available for anonymous ftp from ftp.bio.indiana.edu in directory molbio/unix/GDE. It is also included in the GDE 2.0 sequence analysis environment mentioned above. 39. Manolo Gouy of the University of Lyon, France, has produced NJplot, which displays phylogenies (input in the standard form) on Macintosh screens and saves them in PICT files. It is available free and can be retrieved by anonmyous ftp from molecular biology software servers such as the European Bioninformatics Institue's server, ftp.ebi.ac.uk, where it is in directory pub/software/mac. HOW YOU CAN HELP ME Simply let me know of any problems you have had adapting the programs to your computer. I can often make "transparent" changes that, by making the code avoid the wilder, woolier, and less standard parts of C, not only help others who have your machine but even improve the chance of the programs functioning on new machines. I would like fairly detailed information on what gave trouble, on what operating system, compiler and machine, and what had to be done to make it work. I will be pleased to help do some over-thetelephone trouble-shooting, particularly if I don't pay for the call. Electronic mail is a particularly convenient way for me to be asked about problems, as you can include your input and output files so I can see what is going on. I'd really like these programs to be able to run with only routine changes on ABSOLUTELY EVERYTHING, down to and possibly including the Amana Touchmatic Radarange Microwave Oven (which is an Intel 8080 system -- early versions of this package did run successfully on Intel 8080 systems). I would also like to know timings of programs from the package, when run on the three test input files provided above, for various computer and compiler combinations, so that I can provide this information in the section on speeds of this document. For the phylogeny plotting programs DRAWGRAM and DRAWTREE, Chris Meacham and I are particularly interested in knowing what has to be done to adapt it for other common plotters, laser printers, and dot matrix printers. You can also be helpful to PHYLIP users in your part of the world by giving them the latest version of PHYLIP and helping them with any problems they may have in getting PHYLIP working on their data. Your help is appreciated. I am always happy to hear suggestions for features and programs that ought to be incorporated in the package, but please do not be upset if I turn out to have already considered the particular possibility you suggest and decided against it. I would also like to know of any applications of PHYLIP that get published: I would appreciate receiving a reprint of any paper reporting work that used PHYLIP. IN CASE OF TROUBLE READ THE (DOCUMENTATION) FILES METICULOUSLY ("RTFM"). If that doesn't solve the problem, get in touch with me. I am on electronic mail at the addresses given below. If you do ask about a problem, please specify the program name, version of the package, computer and compiler, and be prepared to send me your data file so I can test the problem. Also it helps to have the relevant input and output and documentation file nearby so that we can refer to it. I can also be reached by calling me in my office: (206)-543-0150, or at home: (206)-526-9057 (how's THAT for user support!). If I cannot be reached at either place, a message can be left at the office of the Department of Genetics, (206)-543-1657 but I prefer strongly that I not call you, as in any phone consultation the least you can do is pay the phone bill. Particularly if you are in a part of the world distant from me, you may also want to try to get in touch with other users of PHYLIP nearby. I can also, if requested, provide a list of nearby users. Joe Felsenstein Department of Genetics University of Washington Box 357360 Seattle, Washington 98195-7360, U.S.A. Electronic mail addresses (I prefer that you use the Internet address if possible): joe@genetics.washington.edu joe@evolution.genetics.washington.edu joe@128.95.12.41