document - PhylOgenetic Web Repeater (POWER)

advertisement
PHYLIP (Phylogeny Inference Package) Version 3.57c
by Joseph Felsenstein
July, 1995
COPYRIGHT NOTICE
(c) Copyright 1986-1995 by Joseph Felsenstein and the University of
Washington.
Permission is granted to copy this document provided that no fee is
charged for
it and that this copyright notice is not removed.
CONTENTS OF THIS DOCUMENT
Copyright notice
Contents of this document
General description of PHYLIP
Contents of this package
What the programs do
Overview of the input and output formats
Input File Format
The Options Menu
The Output File
The Tree File
The Options and How to Invoke Them
Options Information in the Input File
Common Options in the Menu
The U (User Tree) option
The G (Global) option
The J (Jumble) option
The O (Outgroup) option
The T (Threshold) option
The M (multiple data sets) option
The option to write out the trees into a tree file
The (0) terminal type option
Common Options Requiring Information in the Input File
The Weights option
The Algorithm for Constructing Trees
Local Rearrangements
Global Rearrangements
Multiple Jumbles
Strategy for Finding the Best Tree
A Warning on Interpreting Results
Relative Speed of Different Programs and Machines
Relative speed of the different programs
Speed with different numbers of species
Relative speed of different machines
Published benchmarks
Endorsements
General Comments on Adapting the Package to Different Computer Systems
Compiling the programs
Using "make"
Getting PHYLIP onto your microcomputer
Microsoft Quick C and Microsoft C
Turbo C++ for PCDOS
Waterloo C/386
Think C for Macintosh
Unix
VMS VAX systems
OpenVMS DEC Alpha systems
Cray
IBM mainframes running CMS
Other Computer Systems
Frequently Asked Questions
"If I copied PHYLIP from a friend without you knowing, ...?"
"How do I make a citation to the PHYLIP package ...?"
"How do I bootstrap? Why has DNABOOT disappeared?"
"How do I specify a multi-species outgroup? ..."
"How do I force certain groups to remain monophyletic ...?"
"How can I reroot one of the trees written out by PHYLIP?"
"Why doesn't NEIGHBOR read my DNA sequences correctly?"
"What do I do about deletions and insertions in my sequences?"
"Why don't your parsimony programs print out branch lengths?"
"Why can't your programs handle unordered multistate characters?"
"Where can I get a printed version of the PHYLIP documents"
"Why have I been dropped from your newsletter mailing list?"
"How many copies of PHYLIP have been distributed?"
Additional Frequently Asked Questions, or:
"Why didn't it occur to you to..."
write these programs in Pascal?"
forget about all those inferior systems and just develop PHYLIP for
Unix?"
write these programs in PROLOG (or Ada, or Modula-2, or Simula, or
...)?"
include in the package a program to do the Distance Wagner method
... ?
include in the package ordination methods and more clustering
algorithms?"
include in the package a program to do nucleotide sequence alignment
...?"
send me the programs over the electronic network I use,
BUTTERFLYNET?"
let me log in to your computer in Seattle and copy the files ....?"
send me a listing of your program?"
write a magnetic tape in our computer center's favorite format
....?"
give us a version of these in FORTRAN?"
New Features in Recent Versions
Coming Attractions, Future Plans
References for the Documentation Files
Credits
Other phylogeny programs available elsewhere
PAUP
MacClade
Hennig86
Random Cladistics
RNA
ClaDOS
MEGA
TREECON
MOLPHY
fastDNAml
PAML
PEEWEE/NONA
ODEN
MacT
Vostorg
Wetzel/Huson programs
Evomony
Molevol
PARBOOT
Zharkikh programs
Turbotree/Hadtree
TreeAlign
ClustalW
MALIGN
COMPONENT
CAIC
ABLE
CLINCH
COMPROB
MARKOV
RAPDistance
MULTICOMP
RSVP
PHYSYS
SINCAIDEN
MUST
GDE
TreeTool
NJPlot
How You Can Help Me
In case of trouble
PHYLIP - Phylogeny Inference Package (version 3.5)
This is a FREE package of programs for inferring phylogenies and
carrying
out certain related tasks. At present it contains 30 programs, which
carry out
different algorithms on different kinds of data. The programs in the
package
are:
---------- Programs for molecular sequence data ---------PROTPARS Protein parsimony
DNAPARS
Parsimony method for DNA
DNAMOVE
Interactive DNA parsimony DNAPENNY Branch and bound for DNA
DNACOMP
Compatibility for DNA
DNAINVAR Phylogenetic invariants
DNAML
Maximum likelihood method DNAMLK
DNA ML with molecular
clock
DNADIST
Distances from sequences
PROTDIST Distances from proteins
RESTML
ML for restriction sites
SEQBOOT
Bootstraps sequence data
sets
----------- Programs for distance matrix data -----------FITCH
Fitch-Margoliash and least-squares methods
KITSCH
Fitch-Margoliash and least squares methods with evolutionary
clock
NEIGHBOR Neighbor-joining and UPGMA methods
-------- Programs for gene frequencies and continuous characters -----CONTML
Maximum likelihood method GENDIST Computes genetic
distances
CONTRAST Computes contrasts and correlations for comparative method
studies
------------- Programs for 0-1 discrete state data ----------MIX
Wagner, Camin-Sokal, and mixed parsimony criteria
MOVE
Interactive Wagner, C-S, mixed parsimony program
PENNY
Finds all most parsimonious trees by branch-and-bound
DOLLOP, DOLMOVE, DOLPENNY
same as preceding four programs, but for
the Dollo and polymorphism parsimony criteria
CLIQUE
Compatibility method
FACTOR
recode multistate
characters
---------- Programs for plotting trees and consensus trees ------DRAWGRAM Draws cladograms and phenograms on screens, plotters and
printers
DRAWTREE Draws unrooted phylogenies on screens, plotters and printers
CONSENSE Majority-rule and strict consensus trees
RETREE
Reroots, changes names and branch lengths, and flips trees
There is also an Unsupported Division containing two programs,
makeinf and
ProtML, which were contributed by others and are maintained by their
authors.
The package includes extensive documentation files that provide the
information
necessary to use and modify the programs.
The programs are written in a very standard subset of C, a language
that is
available on most computers (including microcomputers). The programs
require no
modifications to run on most machines: for example they work
without
modification with Microsoft C, Turbo C, Think C, and on the C
compilers
available on Unix and VAX VMS systems. C source code is distributed
in the
regular version of PHYLIP. To use it, you must have a C compiler. A
Pascal
version can also be supplied on request. Precompiled executables are
available
for PCDOS, 386 PCDOS, 386 Windows, PowerMacs, and Macintoshes as
described
below.
NETWORK DISTRIBUTION:
The package is available by "anonymous ftp"
over
electronic networks (including the PCDOS, 386 PCDOS, 386 Windows, and
Macintosh
executables) from evolution.genetics.washington.edu (128.95.12.41).
Contact me
by electronic mail for details or start by fetching file
pub/phylip/Read.Me.
European users may (or may not) get faster service from bioss.sari.ac.uk,
which
mirrors our distribution.
Look in directory pub/phylogeny. I can
also send
the source code and documentation files (but not executables) over
Bitnet/EARN
and other networks.
The easiest method of network distribution is to
use our
World Wide Web site:
http://evolution.genetics.washington.edu/phylip.html
DISKETTE DISTRIBUTION:
variety of
The
package
is
also
distributed
in
a
microcomputer diskette formats. You should send FORMATTED diskettes,
which I
will return with the package written on them. See below for how many
diskettes
to send. The source code of the programs on the electronic network or
magnetic
tape versions may of course also be moved to microcomputers and compiled
there.
PRECOMPILED VERSIONS: Precompiled executable programs for PCDOS, 386
Windows,
386 PCDOS, and Macintosh systems are available from me. Specify
the "386
Windows executable version", "386 PCDOS executable version", "PCDOS
executable
version" or "Macintosh executable version" and send the number of
diskettes
indicated below. Source code sent will be in C unless you specify
Pascal.
HOW MANY DISKETTES TO SEND: The following table shows for different
formats how
many diskettes to send, and how many extra diskettes to send for the
executable
version:
Diskette size
Density
For source code
For executables send
and documentation
in addition
3.5 inch PCDOS
1.44 Mb
1
3
5.25 inch PCDOS
1.2 Mb
1
3
Macintosh
High density
1
1
Some other formats are also available. You MUST tell me EXACTLY which of
these
formats you need. The diskettes MUST be formatted by you before being
sent to
me. Sending an extra diskette may be helpful.
POLICIES: The package is distributed free. It will be written on the
diskettes
or tape, which will be mailed back. They can be sent to:
Electronic mail addresses:
joe@genetics.washington.edu
U.S.A.
Joe Felsenstein
Department of Genetics
University of Washington
Box 357360
Seattle, Washington 98195-7360,
CONTENTS OF THIS PACKAGE
The source code and documentation of the package consists of 87
files,
plus 4 more for the programs in the Unsupported Division. In the
electronic
mail version some of these files may be split into parts, so there may be
more.
The package is organized into three major parts, the source
code, the
documentation, and the unsupported programs. The documentation is
organized
hierarchically, with groups of documentation files for different kinds
of data
each preceded by a documentation file for the group as well. The
"unsupported
division" of PHYLIP contains programs contributed by others (and not
supported
by us) that we feel may of use to you.
Files
---1
README
2
main.doc
The Source code
3
Makefile
"make"
4
Makefile.qc
5
Makefile.tc
C
6
phylip.h
7
compile.com
8
vaxfix.c
9
protpars.c
10
dnapars.c
11
dnamove.c
12
dnapenny.c
13
dnacomp.c
14
dnainvar.c
invariants
15
dnaml.c
16
dnaml2.c
17
dnamlk.c
18
dnamlk2.c
19
dnadist.c
20
protdist.c
21
restml.c
22
restml2.c
part 2
23
seqboot.c
resampling
24
fitch.c
25
kitsch.c
Contents
--------- describes the contents of the package
-- this general documentation file
-- the "Makefile" to be used by C's that have
-- the Makefile for Microsoft C and Quick C
-- the Makefile for Borland Turbo C and Borland
----------
the PHYLIP "header file"
a VMS command file to compile all of PHYLIP
procedures needed to fix VMS printf(" %hd ")
parsimony for protein sequence data
DNA parsimony program
interactive DNA parsimony
branch and bound method for DNA
DNA compatibility program
computation of Lake's and Cavender's
---------
DNA maximum likelihood program, part 1
DNA maximum likelihood program, part 2
DNA maximum likelihood with molecular clock
DNA maximum likelihood with clock, part 2
computes distance matrix from sequences
computes distance matrix from sequences
maximum likelihood for restriction sites
maximum likelihood for restriction sites,
-- makes multiple data sets by bootstrap
-- Fitch-Margoliash and least-squares methods
-- F-M, L-S methods with evolutionary clock
26
neighbor.c
27
contml.c
28
gendist.c
29
contrast.c
30
mix.c
part 1
31
mix2.c
part 2
32
move.c
parsimony
33
penny.c
34
dollop.c
35
dolmove.c
36
dolpenny.c
37
clique.c
38
factor.c
39
drawgraphics.h
40
drawgraphics.c
drawtree.c
41
interface.h
42
interface.c
43
drawgram.c
44
drawtree.c
45
font1
46
font2
Roman)
------
neighbor-joining and UPGMA methods
maximum likelihood program
computes genetic distances
contrasts etc. for comparative method studies
Wagner, Camin-Sokal parsimony and mixtures,
-- Wagner, Camin-Sokal parsimony and mixtures,
-- interactive Wagner, Camin-Sokal and mixed
---------
finds all most parsimonious trees
Dollo and polymorphism parsimony methods
interactive Dollo and polymorphism parsimony
branch and bound for Dollo, polymorphism
compatibility program
recode multistate to binary characters
header file for drawgraphics.c
routines used in both drawgram.c and
-------
header for Mac interface
Mac routines used in Mac interface
makes plots of cladograms, phenograms
makes plots of unrooted phylogenies
digitized font (simple sans-serif Roman)
digitized font (medium quality sans-serif
47
font3
48
font4
Italic)
49
font5
50
font6
51
consense.c
52
retree.c
trees
The Documentation
53
sequence.doc
54
protpars.doc
55
dnapars.doc
56
dnamove.doc
57
dnapenny.doc
58
dnacomp.doc
59
dnainvar.doc
60
dnaml.doc
61
dnamlk.doc
62
dnadist.doc
63
protdist.doc
64
restml.doc
65
seqboot.doc
66
distance.doc
67
fitch.doc
68
kitsch.doc
69
neighbor.doc
70
contchar.doc
-- digitized font (high quality serifed Roman)
-- digitized font (medium quality sans-serif
-----
digitized font (high quality serifed Italic)
digitized font (Russian Cyrillic)
majority-rule and strict consensus trees
reroots, rearranges and changes lengths on
-- documentation for molecular sequence programs
-- documentation for protpars.c
-- documentation for dnapars.c
-- documentation for dnamove.c
-- documentation for dnapenny.c
-- documentation for dnacomp.c
-- documentation for dnainvar.c
-- documentation for dnaml.c and dnaml2.c
-- documentation for dnamlk.c and dnamlk2.c
-- documentation for dnadist.c
-- documentation for protdist.c
-- documentation for restml.c and restml2.c
-- documentation for seqboot.c
-- documentation for distance matrix programs
-- documentation for fitch.c
-- documentation for kitsch.c
-- documentation for neighbor.c
-- documentation for gene frequency
and continuous character programs
71
contml.doc
-- documentation for contml.c
72
gendist.doc
-- documentation for gendist.c
73
contrast.doc
-- documentation for contrast.c
74
discrete.doc
-- documentation for discrete character programs
75
mix.doc
-- documentation for mix.c
76
move.doc
-- documentation for move.c
77
penny.doc
-- documentation for penny.c
78
dollop.doc
-- documentation for dollop.c
79
dolmove.doc
-- documentation for dolmove.c
80
dolpenny.doc
-- documentation for dolpenny.c
81
clique.doc
-- documentation for clique.c
82
factor.doc
-- documentation for factor.c
83
draw.doc
-- documentation for tree plotting programs
84
drawgram.doc
-- documentation for drawgram.c
85
drawtree.doc
-- documentation for drawtree.c
86
consense.doc
-- documentation for consense.c
87
retree.doc
-- documentation for retree.c
The Unsupported Division
88
makeinf.doc
-- documentation for makeinf (by Arend Sidow)
89
makeinf.c
-- C source for makeinf
90
protml.doc
-- documentation for ProtML (by Adachi and
Hasegawa)
91
protml.pas
-- Pascal source for ProtML
WHAT THE PROGRAMS DO
Here is a short description of each of the programs.
For more
detailed
discussion you should definitely read the documentation file for the
individual
program and the documentation file for the group of programs it is in.
PROTPARS. Estimates phylogenies from protein sequences (input using the
standard one-letter code for amino acids) using the parsimony method,
in
a variant which counts only those nucleotide changes that change the
amino
acid, on the assumption that silent changes are more easily
accomplished.
DNAPARS. Estimates phylogenies by the parsimony method using nucleic
acid
sequences. Allows use the full IUB ambiguity codes, and estimates
ancestral nucleotide states. Gaps treated as a fifth nucleotide
state.
DNAMOVE. Interactive construction of phylogenies from nucleic acid
sequences,
with their evaluation by parsimony and compatibility and the display
of
reconstructed ancestral bases. This can be used to find parsimony or
compatibility estimates by hand.
DNAPENNY. Finds all most parsimonious phylogenies for nucleic acid
sequences
by branch-and-bound search. This may not be practical (depending on
the
data) for more than 10 or 11 species.
DNACOMP.
Estimates phylogenies from nucleic acid sequence data using
the
compatibility criterion, which searches for the largest number of
sites
which could have all states (nucleotides) uniquely evolved on the same
tree. Compatibility is particularly appropriate when sites vary
greatly in
their rates of evolution, but we do not know in advance which are the
less
reliable ones.
DNAINVAR. For nucleic acid sequence data on four species, computes
Lake's and
Cavender's phylogenetic invariants, which test alternative tree
topologies.
The program also tabulates the frequencies of occurrence of the
different
nucleotide patterns. Lake's invariants are the method which he calls
"evolutionary parsimony".
DNAML.
Estimates phylogenies from nucleotide sequences by maximum
likelihood. The model employed allows for unequal expected
frequencies of
the four nucleotides, for unequal rates of transitions and
transversions,
and for different (prespecified) rates of change in different
categories of
sites, with the program inferring which sites have which rates.
DNAMLK.
Same as DNAML but assumes a molecular clock. The use of the
two programs together permits a likelihood ratio test of the
molecular clock hypothesis to be made.
DNADIST. Computes four different distances between species from nucleic
acid
sequences. The distances can then be used in the distance matrix
programs.
The distances are the Jukes-Cantor formula, one based on Kimura's 2parameter method, Jin and Nei's distance which allows for rate
variation
from site to site, and a maximum likelihood method using the model
employed
in DNAML. The latter method of computing distances can be very slow.
PROTDIST. Computes a distance measure for protein sequences, using
maximum
likelihood estimates based on the Dayhoff PAM matrix, Kimura's 1983
approximation to it, or a model based on the genetic code plus a
constraint on changing to a different category of amino acid. The
distances can then be used in the distance matrix programs.
RESTML. Estimation of phylogenies by maximum likelihood using
restriction
sites data (not restriction fragments but presence/absence of
individual
sites). It employs the Jukes-Cantor symmetrical model of nucleotide
change, which does not allow for differences of rate between
transitions
and transversions. This program is VERY slow.
SEQBOOT. Reads in a data set, and produces multiple data sets from
it by bootstrap resampling. Since most programs in the current
version of
the package allow processing of multiple data sets, this can be used
together with the consensus tree program CONSENSE to do bootstrap (or
delete-half-jackknife) analyses with most of the methods in this
package.
This program also allows the Archie/Faith technique of permutation of
species within characters.
FITCH. Estimates phylogenies from distance matrix data under the
"additive
tree model" according to which the distances are expected to equal the
sums
of branch lengths between the species. Uses the Fitch-Margoliash
criterion
and some related least squares criteria. Does not assume an
evolutionary
clock. This program will be useful with distances computed from DNA
sequences, with DNA hybridization measurements, and with genetic
distances
computed from gene frequencies.
KITSCH. Estimates phylogenies from distance matrix data under the
"ultrametric" model which is the same as the additive tree model
except
that an evolutionary clock is assumed. The Fitch-Margoliash criterion
and
other least squares criteria are assumed. This program will be useful
with
distances computes from DNA sequences, with DNA hybridization
measurements,
and with genetic distances computed from gene frequencies.
NEIGHBOR. An implementation by Mary Kuhner and John Yamato of Saitou and
Nei's "Neighbor Joining Method," and of the UPGMA (Average Linkage
clustering) method. Neighbor Joining is a distance matrix method
producing
an unrooted tree without the assumption of a clock. UPGMA does assume
a
clock. The branch lengths are not optimized by the least squares
criterion
but the methods are very fast and thus can handle much larger data
sets.
CONTML. Estimates phylogenies from gene frequency data by maximum
likelihood
under a model in which all divergence is due to genetic drift in the
absence of new mutations. Does not assume a molecular clock. An
alternative method of analyzing this data is to compute Nei's genetic
distance and use one of the distance matrix programs.
GENDIST. Computes one of three different genetic distance formulas from
gene
frequency data. The formulas are Nei's genetic distance, the CavalliSforza chord measure, and the genetic distance of Reynolds et. al.
The
former is appropriate for data in which new mutations occur in an
infinite
isoalleles neutral mutation model, the latter two for a model without
mutation and with pure genetic drift. The distances are written to a
file
in a format appropriate for input to the distance matrix programs.
CONTRAST. Reads a tree from a tree file, and a data set with continuous
characters data, and produces the independent contrasts for those
characters, for use in any multivariate statistics package. Will also
produce covariances, regressions and correlations between characters
for
those contrasts.
MIX.
Estimates phylogenies by some parsimony methods for discrete
character
data with two states (0 and 1). Allows use of the Wagner parsimony
method,
the Camin-Sokal parsimony method, or arbitrary mixtures of these.
Also
reconstructs ancestral states and allows weighting of characters.
MOVE. Interactive construction of phylogenies from discrete character
data
with two states (0 and 1). Evaluates parsimony and compatibility
criteria
for those phylogenies and displays reconstructed states throughout the
tree. This can be used to find parsimony or compatibility estimates
by
hand.
PENNY. Finds all most parsimonious phylogenies for discrete-character
data
with two states, for the Wagner, Camin-Sokal, and mixed parsimony
criteria
using the branch-and-bound method of exact search. May be impractical
(depending on the data) for more than 10-11 species.
DOLLOP. Estimates phylogenies by the Dollo or polymorphism parsimony
criteria
for discrete character data with two states (0 and 1). Also
reconstructs
ancestral states and allows weighting of characters. Dollo parsimony
is
particularly appropriate for restriction sites data; with ancestor
states
specified as unknown it may be appropriate for restriction fragments
data.
DOLMOVE. Interactive construction of phylogenies from discrete character
data
with two states (0 and 1) using the Dollo or polymorphism parsimony
criteria. Evaluates parsimony and compatibility criteria for those
phylogenies and displays reconstructed states throughout the tree.
This
can be used to find parsimony or compatibility estimates by hand.
DOLPENNY. Finds all most parsimonious phylogenies for discrete-character
data
with two states, for the Dollo or polymorphism parsimony criteria
using the
branch-and-bound method of exact search. May be impractical
(depending on
the data) for more than 10-11 species.
CLIQUE. Finds the largest clique of mutually compatible characters, and
the
phylogeny which they recommend, for discrete character data with two
states. The largest clique (or all cliques within a given size range
of
the largest one) are found by a very fast branch and bound search
method.
The method does not allow for missing data. For such cases the T
(Threshold) option of MIX may be a useful alternative. Compatibility
methods are particular useful when some characters are of poor quality
and
the rest of good quality, but when it is not known in advance which
ones
are which.
FACTOR. Takes discrete multistate data with character state trees and
produces the corresponding data set with two states (0 and 1).
Written by
Christopher Meacham.
DRAWGRAM. Plots rooted phylogenies, cladograms, and phenograms in a
wide variety of user-controllable formats. The program is
interactive and allows previewing of the tree on PC graphics screens,
and Tektronix or DEC graphics terminals. Final output can be on
a laser printer (such as the Apple Laserwriter or HP Laserjet),
on graphics screens or terminals, on pen plotters (Hewlett-Packard or
Houston Instruments) or on dot matrix printers capable of graphics
(Epson, Okidata, Imagewriter, or Toshiba).
DRAWTREE.
Similar to DRAWGRAM but plots unrooted phylogenies.
CONSENSE. Computes consensus trees by the majority-rule consensus tree
method, which also allows one to easily find the strict consensus
tree.
Does NOT compute the Adams consensus tree. Trees are input in a tree
file
in standard nested-parenthesis notation, which is produced by many of
the
tree estimation programs in the package when the Y option is invoked.
This program can be used as the final step in doing bootstrap analyses
for
many of the methods in the package.
RETREE. Reads in a tree (with branch lengths if necessary) and allows
you to reroot the tree, to flip branches, to change species names and
branch lengths, and then write the result out. Can be used to convert
between rooted and unrooted trees.
Programs in the Unsupported Division
The Unsupported Division of PHYLIP consists of two programs
contributed by
others that may be useful to you and have kindly been contributed by
their
authors.
Those authors retain full copyright to
their
programs
and
documentation files. They are provided in the PHYLIP source code
distribution
but have not been provided as executables in the executables
distribution. All
questions about these programs should be
whose
electronic mail addresses and regular mail
their
documentation files.
directed to their authors,
addresses
are
given
in
MAKEINF. This program by Arend Sidow can be used to translate the output
files
from Jotun Hein's popular multiple-sequence alignment program into PHYLIP
input
files. It also allows you to selectively analyze different codon
positions and
different organisms.
The output from other alignment programs can
rather
easily be edited into a form that it will read.
PROTML. This large Pascal program from Jun Adachi and Masami Hasegawa
carries
out maximum likelihood estimation of phylogenies from protein sequence
data.
It is quite analogous to DNAML, but uses instead of a model for DNA
evolution
the PAM matrix model of Margaret Dayhoff. Because of the larger
number of
states (20 instead of 4) it is necessarily slower than DNAML by a large
factor.
However the authors have adopted a different, and faster,
rearrangement
strategy to search among tree topologies for the best one. ProtML does
not yet
incorporate the Categories feature of DNAML and DNAMLK which allows
different
rates of evolution at different sites, without the user specifying in
advance
which site has which rate of evolution. For support, contact them
at the
Internet addresses hasegawa@ism.ac.jp and adachi@sunmh.ism.ac.jp
at
the
Institute of Statistical Mathematics, Tokyo, Japan.
OVERVIEW OF THE INPUT AND OUTPUT FORMATS
When you run most of these programs, a menu will appear
offering you
choices of the various options available for that program. The data
that the
program reads should be in an input file called (in most cases)
"infile".
If
there is no such file the programs will ask you for the name of the input
file.
Below we describe the input file format, and then the menu.
Input File Format
----- ---- -----I have tried to adhere to a rather stereotyped input and output
format.
For the parsimony, compatibility and maximum likelihood programs,
excluding the
distance matrix methods, the simplest version of the input file looks
something
like this:
6
13
Archaeopt CGATGCTTAC
HesperorniCGTTACTCGT
BaluchitheTAATGTTAAT
B. virginiTAATGTTCGT
BrontosaurCAAAACCCAT
B.subtilisGGCAGCCAAT
CGC
TGT
TGT
TGT
CAT
CAC
The first line of the input file contains the number of species and the
number of characters, in free format, separated by blanks (not by
commas). The information for each species follows, starting with a
ten-character species name (which can include punctuation marks and
blanks),
and continuing with the characters for that species. In the
discrete-character, DNA and protein sequence programs the characters are
each a
single letter or digit, sometimes separated by blanks. In
the continuous-characters programs they are real numbers with decimal
points,
separated by blanks:
Latimeria
2.03
3.457
100.2
0.0
-3.7
The conventions about continuing the data beyond one line per
species are
different between the molecular sequence programs and the others.
The
molecular sequence programs can take the data in "aligned" or
"interleaved"
format, with some lines giving the first part of each of the
sequences, then
lines giving the next part of each, and so on. Thus the sequences might
look
like this:
6
39
Archaeopt CGATGCTTAC
HesperorniCGTTACTCGT
BaluchitheTAATGTTAAT
B. virginiTAATGTTCGT
BrontosaurCAAAACCCAT
B.subtilisGGCAGCCAAT
TACCGCCGAT
CGTTGTCGTT
AATTGTTAAT
CGTTGTTAAT
CATCATCAAA
AATCACGGCA
CGCCGATGCT
TGTCGTTACT
TGTTAATGTT
TGTTAATGTT
CATCAAAACC
CACGGCAGCC
GCTTACCGC
ACTCGTTGT
GTTAATTGT
GTTCGTTGT
ACCCATCAT
GCCAATCAC
Note that in these sequences we have a blank every ten sites to make
them
easier to read: any such blanks are allowed. The blank line which
separates
the two groups of lines (the ones containing sites 1-20 and ones
containing
sites 21-39) may or may not be present, but if it is, it should be a
line of
zero length and not contain any extra blank characters (this is because
of a
limitation of the current versions of the programs). It is important
that the
number of sites in each group be the same for all species (i.e., it will
not be
possible to run the programs successfully if the first species line
contains 20
bases, but the first line for the second species contains 21 bases).
Alternatively, an option can be selected to take the data in
"sequential"
format, with all of the data for the first species, then all of the
characters
for the next species, and so on. This is also the way that the
discrete
characters programs and the gene frequencies and quantitative
characters
programs want to read the data. They do not allow the "interleaved"
format.
In the sequential format, the character data can run on to a new
line at
any time (except in a species name or in the case of continuous
character and
distance matrix programs where you cannot go to a new line in the middle
of a
real number). Thus it is legal to have:
Archaeopt 001100
1101
or even:
Archaeopt
0011001101
though note that the FULL ten characters of the species name MUST
then be
present: in the above case there must be a blank after the "t". In all
cases
it is possible to put internal blanks between any of the character
values, so
that
Archaeopt 0011001101 0111011100
is allowed.
If you make an error in the input file, the programs will often detect
that
they have been fed an illegal character or illegal numerical value and
issue an
error message such as "BAD CHARACTER STATE:", often printing out the bad
value,
and sometimes the number of the species and character in which it
occurred.
The program will then stop shortly after. One of the things which can
lead to
a bad value is the omission of something earlier in the file, or the
insertion
of something superfluous, which cause the reading of the file to get
out of
synchronization.
The program then starts reading things it didn't
expect, and
concludes that they are in error. So if you see this error message,
you may
also want to look for the earlier problem that may have led to this.
The other major variation on the input data format is the
options
information.
Many options are selected using the menu, but a few are
selected
by including extra information in the input file. Some options are
described
below.
The Options Menu
--- ------- ---The menu is straightforward.
one is
for DNAPARS):
It typically looks like this
(this
DNA parsimony algorithm, version 3.57c
Setting for this run:
U
Search for best tree?
J
Randomize input order of sequences?
O
Outgroup root?
T
Use Threshold parsimony?
M
Analyze multiple data sets?
I
Input sequences interleaved?
0
Terminal type (IBM PC, VT52, ANSI)?
1
Print out the data at start of run
2 Print indications of progress of run
Yes
No. Use input order
No, use as outgroup species
No, use ordinary parsimony
No
Yes
ANSI
No
Yes
1
3
4
5
6
Print out tree
Print out steps in each site
Print sequences at all nodes of tree
Write out trees onto tree file?
Yes
No
No
Yes
Are these settings correct? (type Y or the letter for one to change)
If you want to accept the default settings (they are shown in the above
case)
you can simply type "Y" followed by a carriage-return (Enter)
character. If
you want to change any of the options, you should type the letter shown
to the
left of its entry in the menu. For example, to set a threshold
type "T".
Lower-case letters will also work. For many of the options the program
will
ask for supplementary information, such as the value of the threshold.
Note the "Terminal type" entry, which you will find on all
menus.
It
allows you to specify which type of terminal your screen is. The
options are
an IBM PC screen, an ANSI standard terminal (such as a DEC VT100), a DEC
VT52compatible terminal, such as a Zenith Z29, or no terminal type.
Choosing "0"
toggles among these four options in cyclical order, changing each time
the "0"
option is chosen. If one of them is right for your terminal the screen
will be
cleared before the menu is displayed. If none works the "none" option
should
probably be chosen. Keep in mind that VT-52 compatible terminals can
freeze up
if they receive the screen-clearing commands for the ANSI standard
terminal!
If this is a problem it may be helpful to recompile the program,
setting the
constants near its beginning so that the program starts up with the VT52
option
set.
The other numbered options control which information the program
will
display on your screen or on the output files.
The option to
"Print
indications of progress of run" will show information such as the names
of the
species as they are successively added to the tree, and the progress of
global
rearrangements. You will usually want to see these as reassurance
that the
program is running and to help you estimate how long it will take. But
if you
are running the program "in background" as can be done on
multitasking and
multiuser systems such as Unix, and do not have the program running in
its own
window, you may want to turn this option off so that it does not disturb
your
use of the computer while the program is running.
The Output File
--- ------ ---Most of the programs write their output onto a file called
(usually)
"outfile", and a representation of the trees found onto a file
called
"treefile".
The exact contents of the output file vary from program to
program and
also depend on which menu options you have selected. For many programs,
if you
select all possible output information, the output will consist of (1)
the name
of the program and its version number, (2) the input information
printed out,
(3) a series of phylogenies, some with associated information
indicating how
much change there was in each character or on each part of the tree. A
typical
rooted tree looks like this:
+-------------------Gibbon
+----------------------------2
!
!
+------------------Orang
!
+------4
!
! +---------Gorilla
+-----3
+--6
!
!
!
+---------Chimp
!
!
+----5
--1
!
+-----Human
!
!
!
+-----------------------------------------------Mouse
!
+------------------------------------------------Bovine
The interpretation of the tree is fairly straightforward: it "grows" from
left
to right.
The numbers at the forks are arbitrary and are used (if
present)
merely to identify the forks. In some of the programs asterisks ("*")
are used
instead of numbers.
For many of the programs the tree produced is
unrooted.
It is printed out in nearly the same form, but with a warning message:
remember: this is an unrooted tree!
The warning message ("remember: ...") indicates that this is an unrooted
tree
(mathematicians still call this a tree, though some systematists
unfortunately
use the term "network". This conflicts with standard mathematical usage,
which
reserves the name "network" for a completely different kind of
graph). The
root of this tree could be anywhere, say on the line leading
immediately to
Mouse. As an exercise, see if you can tell whether the following tree is
or is
not a different one from the above:
+-----------------------------------------------Mouse
!
+---------4
+------------------Orang
!
!
+------3
!
!
!
!
+---------Chimp
---6
+----------------------------1
! +----2
!
!
+--5
+-----Human
!
!
!
!
!
+---------Gorilla
!
!
!
+-------------------Gibbon
!
+-------------------------------------------Bovine
remember: this is an unrooted tree!
(it is NOT different). It is IMPORTANT also to realize that the lengths
of the
segments of the printed tree may not be significant: some may
actually
represent branches of zero length, in the sense that there is no evidence
that
the branches are nonzero in length. Some of the diagrams of trees
attempt to
print branches approximately proportional to estimated branch lengths,
while in
others the lengths are purely conventional and are presented just to
make the
topology visible. You will have to look closely at the documentation
that
accompanies each program to see what it presents and what is known
about the
lengths of the branches on the tree. The above tree attempts to
represent
branch lengths approximately in the diagram. But even in those cases,
some of
the smaller branches are likely to be artificially lengthened to make the
tree
topology clearer. Here is what a tree from DNAPARS looks like, when no
attempt
is made to make the lengths of branches in the diagram
proportional to
estimated branch lengths:
+--Human
+--5
+--4 +--Chimp
! !
+--3 +-----Gorilla
! !
+--2 +--------Orang
! !
+--1 +-----------Gibbon
! !
--6 +--------------Mouse
!
+-----------------Bovine
remember: this is an unrooted tree!
Some of the parsimony programs in the package can print out a table
of the
number of steps that different characters (or sites) require on the tree.
This
table may not be obvious at first. A typical example looks like this:
steps in each site:
0
1
2
3
4
5
6
7
8
9
*----------------------------------------0!
2
2
2
2
1
1
2
2
1
10!
1
2
3
1
1
1
1
1
1
2
20!
1
2
2
1
2
2
1
1
1
2
30!
1
2
1
1
1
2
1
3
1
1
40!
1
The numbers across the top and down the side indicate which site is
being
referred to.
Thus site 23 is column "3" of row "20" and has 1 step
in this
case.
The Tree File
--- ---- ---In output from most programs, a representation of the tree is also
written
into the tree file (usually named "treefile"). The tree is specified
by the
nested pairs of parentheses, enclosing names and separated by commas. If
there
are any blanks in the names, these must be replaced by the underscore
character
"_". Trailing blanks in the name may be omitted.
The pattern
of the
parentheses indicates the pattern of the tree by having each
pair of
parentheses enclose all the members of a monophyletic group. The tree
file for
the above tree would have its first line look like this:
((Mouse,Bovine),((Orang,(Gorilla,(Chimp,Human))),Gibbon));
In the above tree the first fork separates the lineage leading to
Mouse and
Bovine from the lineage leading to the rest. Within the latter group
there is
a fork separating Gibbon from the rest, and so on. The entire tree is
enclosed
in an outermost pair of parentheses. The tree ends with a semicolon.
In some
programs such as DNAML, FITCH, and CONTML, the tree will be completely
unrooted
and specified by a bottommost fork with a three-way split, with
three
"monophyletic" groups separated by two commas:
(A,(B,(C,D)),(E,F));
The three "monophyletic" groups here are A, (B,C,D), and (E,F).
The
single
three-way split corresponds to one of the interior nodes of the
unrooted tree
(it can be any interior node). The remaining forks are encountered as
you move
out from that first node, and each then appears as a two-way split. You
should
check the documentation files for the particular programs you are using
to see
in which of these forms you can expect the user tree to be in. Note
that many
of the programs that estimate an unrooted tree produce trees in the
treefile in
rooted form! This is done for reasons of arbitrary internal
bookkeeping. The
placement of the root is arbitrary.
For programs estimating branch lengths, these are given in the
trees in
the tree file as real numbers following a colon, and placed immediately
after
the group descended from that branch. Here is a typical tree with
branch
lengths:
((cat:47.14069,(weasel:18.87953,((dog:25.46154,(raccoon:19.19959,
bear:6.80041):0.84600):3.87382,(sea_lion:11.99700,
seal:12.00300):7.52973):2.09461):20.59201):25.0,monkey:75.85931);
Note that the tree may continue to a new line at any time except in the
middle
of a name or the middle of a branch length, although in trees written
to the
tree file this will only be done after a comma.
These representations of trees are a subset of the standard
adopted on
June 24, 1986 at the annual meetings of the Society for the Study of
Evolution
at an meeting (the final session in Newick's lobster restaurant -hence its
name -- the Newick standard) of an informal committee consisting of
Wayne
Maddison (MacClade), David Swofford (PAUP), F. James Rohlf (NTSYS-PC),
Chris
Meacham (COMPROB and plotting programs), James Archie (character
coding
program), William H.E. Day, and me.
This standard is a
generalization of
PHYLIP's format, itself based on a well-known representation of trees in
terms
of parenthesis patterns which has been around for almost a century.
The
standard is now employed by most phylogeny computer programs but
unfortunately
has yet to be decribed in a formal published description.
THE OPTIONS AND HOW TO INVOKE THEM
Most of the programs allow various options that alter the
amount of
information the program is provided or what it is to do with the
information.
Most options are selected in the menu. However a few are specified
in the
input file, or require part of their specification to be in the input
file.
Options Information in the Input File
------- ----------- -- --- ----- ---In such cases, the program is notified that an option has been
invoked by
the presence of one or more letters after the last number on the first
line of
the input file. These letters may or may not be separated from each
other by
blanks, though it is usually necessary to separate them from the
number by a
blank. They can be in any order. Thus to invoke options A and W, the
input
file starts with the line:
12
or:
12
20 WA
20 A W
The options are described individually in the other documents of this
package.
For the options that require information to be in the input file,
additional
information must be provided. For all but one of these, this
information is
provided by placing a line after the first line of the file, but
before the
beginning of the species data. The first character of that line should
match
the option letter.
These auxiliary information lines can be in any
order.
Thus if options A and W are both invoked, both of the following formats
(and
two others as well) are legal:
12
20 AW
A
0001111000
Weights
00112221A0
(then the species information)
12
20 A W
Weights
00112221A0
A
0001111000
(then the species information)
One of the options requires special discussion. Many of the programs
have in
their menu the option U, which signals that one or more user-defined
trees is
to be provided for evaluation. This "user tree" is supplied in the input
file
(not the tree file), but AFTER the species data, rather than before it.
It does
not require any indication to be placed in the first line of the input
file, as
do the options that place information before the species data. After the
data,
there is a line containing the number of user-defined trees being
defined.
Each user-defined tree starts on a new line. It is in the same form
as the
trees in the tree files mentioned above, namely the New Hampshire
standard.
Here is an example with one user-defined tree:
6
13
Archaeopt 0011001110000
Hesperorni0001101101101
Baluchithe1111011011101
B. virgini1111011101101
Brontosaur0110100111011
B.subtilis0000000011010
1
((B.subtilis,Baluchithe),((Brontosaur,B._virgini),
(Hesperorni,Archaeopt)));
In using the user tree option, check the pattern of parentheses
carefully.
The programs do not always detect whether the tree makes sense, and if
it does
not there will probably be a crash (hopefully, but not inevitably,
with an
error message indicating the nature of the problem).
Common Options in the Menu
------ ------- -- --- ---Seven options from the menu, the U (User tree), G (Global), J
(Jumble), O
(Outgroup), T (Threshold), M (multiple data sets), and the tree output
options,
are used so widely that it is best to discuss them in this document.
(1) The U (User tree) option. This option toggles between the
default
setting, which allows the program to search for the best tree, and
the User
tree setting, which reads a tree or trees ("user trees") from the input
file
and evaluates them.
The user trees must follow the other information
in the
data set, and be preceded by a line specifying the number to user trees
that
are to be evaluated.
Each user tree then is given in standard
form, each
starting on a new line. The form that the user trees must take is
described in
some detail below, under the description of the program output of tree
files.
In some cases a program may require that the trees fed in be rooted
trees, even
though the program cannot infer the placement of the root. In those
cases you
can place the root anywhere. Program RETREE can be used to convert
between
rooted and unrooted trees.
(2) The G (Global) option. In the programs which construct trees
(except
for NEIGHBOR, the "...PENNY" programs and CLIQUE, and of course the
"...MOVE"
programs where you construct the trees yourself), after all species have
been
added to the tree a rearrangements phase ensues. In most of these
programs the
rearrangements are automatically global, which in this case means that
subtrees
will be removed from the tree and put back on in all possible ways so
as to
have a better chance of finding a better tree.
Since this can be
time
consuming (it roughly triples the time taken for a run) it is left as an
option
in some of the programs, specifically CONTML, FITCH, and DNAML.
In
these
programs the G menu option toggles between the default of local
rearrangement
and global rearrangement. The rearrangements are explained more below.
(3) The J (Jumble) option. In most of the tree construction
programs
(except for the "...PENNY" programs and CLIQUE), the exact details
of the
search of different trees depend on the order of input of species.
In
these
programs J option enables you to tell the program to use a random
number
generator to choose the input order of species. This option is toggled
on and
off by selecting option J in the menu. The program will then prompt you
for a
"seed" for the random number generator. The seed should be an integer
between
1 and 32767, and should of form 4n+1, which means that it must give a
remainder
of 1 when divided by 4. This can be judged by looking at the last two
digits
of the number. Each different seed leads to a different sequence of
addition
of species. By simply changing the random number seed and rerunning the
programs one can look for other, and better trees. If the seed entered
is not
odd, the program will not proceed, but will prompt for another seed.
The Jumble option also causes the program to ask you how many
times you
want to restart the process.
If you answer 10, the program will
try ten
different orders of species in constructing the trees, and the results
printed
out will reflect this entire search process (that is, the best trees
found
among all 10 runs will be printed out, not the best trees from each
individual
run).
(4) The O (Outgroup) option. This specifies which species is to be
used
to root the tree by having it become the outgroup. This option is
toggled on
and off by choosing O in the menu. When it is on, the program will then
prompt
for the number of the outgroup (the species being taken in the numerical
order
that they occur in the input file).
Responding by typing "6" and
then a
carriage-return (Enter) character indicates that the sixth species in
the data
is the outgroup. Outgroup-rooting will not be attempted if the data
have
already established a root for the tree from some other consideration,
and may
not be if it is a user-defined tree, despite your invoking the option.
Thus
programs such as DOLLOP that produce only rooted trees do not
allow the
Outgroup option. It is also not available in KITSCH, DNAMLK, or CLIQUE.
When
it is used, the tree as printed out is still listed as being an unrooted
tree,
though the outgroup is connected to the bottommost node so that it is
easy to
visually convert the tree into rooted form.
(5) The T (Threshold) option. This sets a threshold such that
if the
number of steps counted in a character is higher than the threshold, it
will be
taken to be the threshold value rather than the actual number of steps.
The
default is a threshold so high that it will never be surpassed. The
T menu
option toggles on and off asking the user to supply a threshold.
The
use of
thresholds to obtain methods intermediate between parsimony and
compatibility
methods is described in my 1981b paper. When the T option is in
force, the
program will prompt for the numerical threshold value. This will be a
positive
real number greater than 1. In programs MIX, MOVE, PENNY, PROTPARS,
DNAPARS,
DNAMOVE, and DNAPENNY, do not use threshold values less than or equal
to 1.0,
as they have no meaning and lead to a tree which depends only on
considerations
such as the input order of species and not at all on the character state
data!
In programs DOLLOP, DOLMOVE, and DOLPENNY the threshold should never be
0.0 or
less, for the same reason. The T option is an important and
underutilized one:
it is, for example, the only way in this package (except for program
DNACOMP)
to do a compatibility analysis when there are missing data.
It is a
method of
de-weighting characters that evolve rapidly. I wish more people were
aware of
its properties.
(6) The M (Multiple data sets) option. In menu programs there is
an M
menu option which allows one to toggle on the multiple data sets
option. The
program will ask you how many data sets it should expect. The data sets
have
the same format as the first data set. Here is a (very small) input
file with
two five-species data sets:
5
Alpha
Beta
Gamma
Delta
Epsilon
5
Alpha
Beta
Gamma
Delta
Epsilon
6
CCACCA
CCAAAA
CAACCA
AACAAC
AACCCA
6
CACACA
CCAACC
CAACAC
GCCTGG
TGCAAT
The main use of this option will be to allow all of the methods in
these
programs to be bootstrapped. Using the program SEQBOOT one can take
any DNA,
protein, restriction sites, or binary character data set and make
multiple data
sets by bootstrapping.
Trees can be produced for all of these using
the M
option. They will be written on the tree output file if that option is
left in
force.
Then the program CONSENSE can be used with that tree file as its
input
file. The result is a majority rule consensus tree which can be used to
make
confidence intervals. The present version of the package allows, with
the use
of SEQBOOT and CONSENSE and the M option, bootstrapping of many of the
methods
in the package.
(7) The option to write out the trees into a tree file.
This
specifies
that you want the program to write out the tree not only on its usual
output,
but also onto a file in nested-parenthesis notation (as described above).
This
option is sufficiently useful that it is turned on by default in all
programs
that allow it. You can optionally turn it off if you wish, by
typing the
appropriate number from the menu (it varies from program to program).
This
option is useful for creating tree files that can be directly read
into the
plotting programs, the consensus tree program, and can be incorporated
into the
input file to specify user-defined trees in many of the other programs.
to
(8) The (0) terminal
one
type
option.
The
program
will
default
particular assumption about your terminal (except in the case of
Macintoshes,
the default will be an ANSI compatible terminal). You can alternatively
select
it to be either an IBM PC, a DEC VT52, or nothing. This affects the
ability of
the programs to clear the screen when they display their menus,
and the
graphics characters used to display trees in the programs DNAMOVE,
MOVE,
DOLMOVE, and RETREE. If you are running a PCDOS system any have the
ANSI.SYS
driver installed in your CONFIG.SYS file, you may find that the screen
clears
correctly even with the default setting of ANSI.
Common Options Requiring Information in the Input File
------ ------- --------- ----------- -- --- ----- ---There are a number of options (Ancestor, Factors, Categories and
Weights)
that are specified in the input file. Some of them must also be
selected in
the menu. Of these, the Ancestor and Factors options are specific
to the
Discrete Characters programs and are described in their group
document. The
Categories option is specific to some of the molecular sequence programs
and is
described in their group document. The Weights option is used
throughout the
package and is best introduced here.
This allows us to specify weights on the individual characters.
Weights
are invoked by placing a W on the first line of the file. The weights
are then
specified by a line or lines which start with W and then have enough
characters
or blanks to complete the full length of a species name. Then they
have a
single character (0-9 or A-Z) for each character. Thus they look like
the data
for a species:
Weights
0001111001112
or:
W
1110000ZZZZZ1
The weights cause a character to be counted as if it were n characters,
where n
is the weight.
The values 0-9 give weights 0 through 9, and the
values A-Z
give weights 10 through 35. By use of the weights we can give
overwhelming
weight to some characters, and drop others from the analysis. In the
molecular
sequence programs only two values of the weights, 0 or 1 are allowed.
Weights can be used to analyze different subsets of
characters (by
weighting the rest as zero).
Alternatively, in the discrete
characters
programs they can be used to force a certain group to appear on the
phylogeny
(in effect confining consideration to only phylogenies containing that
group).
This is done by adding an imaginary character that has 1's for the
members of
the group, and 0's for all the other species. That imaginary character
is then
given the highest weight possible: the result will be that any phylogeny
that
does not contain that group will be penalized by such a heavy amount
that it
will not (except in the most unusual circumstances) be considered. Of
course,
the new character brings extra steps to the tree, but the number of
these can
be calculated in advance and subtracted out of the total when
reporting the
results.
This use of weights is an important one, and one sadly
ignored by
many users who could profit from it. In the case of molecular
sequences we
cannot use weights this way, so that to force a given group to appear
we have
to add a large extra segment of sites to the molecule, with (say) A's for
that
group and C's for every other species.
THE ALGORITHM FOR CONSTRUCTING TREES
All of the programs except FACTOR, DNADIST, GENDIST, DNAINVAR,
SEQBOOT,
CONTRAST, RETREE, and the plotting and consensus tree programs act to
construct
an estimate of a phylogeny. MOVE, DOLMOVE, and DNAMOVE let you
construct it
yourself by hand.
All of the rest but NEIGHBOR, the "...PENNY"
programs and
CLIQUE make use of a common approach involving additions and
rearrangements.
They are trying to minimize or maximize some quantity over the space
of all
possible evolutionary trees. Each program contains a part that,
given the
topology of the tree, evaluates the quantity that is being
minimized or
maximized. The straightforward approach would be to evaluate all
possible tree
topologies one after another and pick the one which, according to the
criterion
being used, is best. This would not be possible for more than a small
number
of species, since the number of possible tree topologies is enormous. A
review
of the literature on the counting of evolutionary trees will be found one
of my
papers (Felsenstein, 1978a).
Since we cannot search all topologies, these programs are not
guaranteed
to always find the best tree, although they seem to do quite well in
practice.
The strategy they employ is as follows: the species are taken in the
order in
which they appear in the input file. The first two (in some programs the
first
three) are taken and a tree constructed containing only those. There is
only
one possible topology for this tree. Then the next species is taken,
and we
consider where it might be added to the tree. If the initial tree is
(say) a
rooted tree with two species and we want the resulting three-species tree
to be
a bifurcating tree, there are only three places where we could add the
third
species. Each of these is tried, and each time the resulting tree is
evaluated
according to the criterion. The best one is chosen to be the basis for
further
operations.
Now we consider adding the fourth species, again at each
of the
five possible places that would result in a bifurcating tree. Again, the
best
of these is accepted.
Local Rearrangements
----- -------------The process continues in this manner, with one important exception.
After
each species is added, and before the next is added, a number of
rearrangements
of the tree are tried, in an effort to improve it. The algorithms move
through
the tree, making all possible local rearrangements of the tree. A
local
rearrangement involves an internal segment of the tree in the following
manner.
Each internal segment of the tree is of this form (where T1, T2, and
T3 are
subtrees -- parts of the tree that can contain further forks and tips):
T1
\
\
T2
/
/
\ /
\/
*
*
T3
/
/
/
/
/
/
*
/
* /
*
!
!
the segment we are discussing being indicated by the asterisks.
A
local
rearrangement consists of switching the subtrees T1 and T3 or T2 and T3,
so as
to obtain one of the following:
T3
\
\
\
T2
/
/
/
\ /
\
\
\
T1
/
/
/
/
/
/
/
\ /
!
T1
\
\
\
\ /
\
\
T3
/
/
/
T2
/
/
/
/
/
/
\
/
\ /
!
!
!
!
!
Each time a local rearrangement is successful in finding a better tree,
the new
arrangement is accepted. The phase of local rearrangements does not end
until
the program can traverse the entire tree, attempting local
rearrangements,
without finding any that improve the tree.
This strategy of adding species and making local rearrangements will
look
at about (n-1) times (2n-3) different topologies, though if
rearrangements are
frequently successful the number may be larger. I have been
describing the
strategy when rooted trees are being considered. For unrooted trees
there is a
precisely similar strategy, though the first tree constructed may be a
threespecies tree and the rearrangements may not start until after the
addition of
the fifth species.
Though we are not guaranteed to have found the best tree topology,
we are
guaranteed that no nearby topology (i. e. none accessible by a single
local
rearrangement) is better. In this sense we have reached a local optimum
of our
criterion.
Note that the whole process is dependent on the order in
which the
species are present in the input file. We can try to find a
different and
better solution by reordering the species in the input file and
running the
program again (or, more easily, by using the J option).
If none of
these
attempts finds a better solution, then we have some indication that we
may have
found the best topology, though we can never be certain of this.
Note also that a new topology is never accepted unless it is better
than
the previous one, so that the rearrangement process can never fall
into an
endless loop. This is also the way ties in our criterion are resolved,
namely
by sticking with the tree found first. However, the tree construction
programs
other than CLIQUE, CONTML, FITCH, and DNAML do keep a record of all trees
found
that are tied with the best one found. This gives you some immediate
idea of
which parts of the tree can be altered without affecting the quality
of the
result.
Global Rearrangements
------ -------------A feature of most of the programs, such as PROTPARS, DNAPARS,
DNACOMP,
DNAML, DNAMLK, RESTML, KITSCH, FITCH, CONTML, MIX, and DOLLOP, is
"global"
optimization of the tree. In four of these (CONTML, FITCH, DNAML and
DNAMLK)
this is an option, 'G'. In the others it automatically applies. When
it is
present there is an additional stage to the search for the best tree.
Each
possible subtree is removed from the tree from the tree and added back
in all
possible places. This process continues until all subtrees can be
removed and
added again without any improvement in the tree. The purpose of this
extra
rearrangement is to make it less likely that one or more a species gets
"stuck"
in a suboptimal region of the space of all possible trees. The use of
global
optimization results in approximately a tripling (3x) of the run-time,
which is
why I have left it as an option in some of the slower programs.
The programs doing global optimization print out a dot "." after
each
group is removed and re-added to the tree, to give the user some sign
that the
rearrangements are proceeding. A new line of dots is started whenever
a new
round of global rearrangements is started following an improvement in the
tree.
On the line before the dots are printed there is printed a bar of the
form
"!--------------!" to show how many dots to expect. The dots will
not be
printed out at a uniform rate, but the later dots, which represent
removal of
larger groups from the tree and trying them consequently in fewer
places, will
print out more quickly. With some compilers each row of dots is not
printed
out until it is complete.
It should be noted that PENNY, DOLPENNY, DNAPENNY and CLIQUE use a
more
sophisticated strategy of "depth-first search" with a "branch and bound"
search
method that guarantees that all of the best trees will be found. In the
case
of PENNY, DOLPENNY and DNAPENNY there can be a considerable
sacrifice of
computer time if the number of species is greater than about ten: it
is a
matter for you to consider whether it is worth it for you to guarantee
finding
all the most parsimonious trees, and that depends on how much free
computer
time you have!
CLIQUE finds all largest cliques, and does so without
undue
burning of computer time.
Multiple Jumbles
-------- ------As just mentioned, for most of these programs the search depends
on the
order in which the species are entered into the tree. Using the J
(Jumble)
option you can supply a random number seed which will allow the program
to put
the species in in a random order. A new feature (with version 3.5) is to
allow
this to be done multiple times. If you tell the program to do it 10
times, it
will go through the tree-building process 10 times, each with a
different
random order of adding species. It will keep a record of the trees
tied for
best over the whole process. In other words, it does not just record
the best
trees from each of the 10 runs, but records the best ones overall.
Of
course
this is slow, taking 10 times longer than a single run. But it does
give us a
much greater chance of finding all of the most parsimonious trees.
In the
terminology of Maddison (1991) it can find different "islands" of
trees. The
present algorithms do not guarantee us to find all trees in a given
"island"
from a single run, so multiple runs also help explore those "islands"
that are
found.
STRATEGY FOR FINDING THE BEST TREE
In practice, it is advisable to use the Jumble option to evaluate
many
different orderings of the input species. When the programs which have
global
branch-swapping as default (such as DNAPARS) are used or when the G
option is
employed in other programs IT IS ADVISABLE TO USE THE JUMBLE OPTION AND
SPECIFY
THAT IT BE DONE MANY TIMES (AS MANY AS TEN) to use different orderings
of the
input species).
When the G (Global rearrangement) option is not being
used I
have also found it useful to do multiple Jumbles.
People who want a magic "black box" program whose results they do
not have
to question (or think about) often are upset that these programs give
results
that are dependent on the order in which the species are entered in the
data.
To me this property is an advantage, for it permits you to try
different
searches for better trees, simply by varying the input order of
species.
If
you do not use the multiple Jumble option, but do multiple
individual runs
instead, you can easily decide which to pay most attention to -- the
one or
ones that are best according to the criterion employed (for
example, with
parsimony, the one out of the runs that results in the tree with the
fewest
changes).
In practice, in a single run, it usually seems best to put species
that
are likely to be sources of confusion in the topology last, as by the
time they
are added the arrangement of the earlier species will have stabilized
into a
good configuration, and then the last few species will by fitted
into that
topology. There will be less chance this way of a poor initial topology
that
would affect all subsequent parts of the search.
However, a
variety of
arrangements of the input order of species should be tried, as can be
done if
the J option is used, and no species should be kept in a fixed place
in the
order of input. Note that the results of the "...PENNY" programs and
CLIQUE
are not sensitive to the input order of species, and NEIGHBOR is only
slightly
sensistive to it, so that multiple Jumbling is not possible with
those
programs.
Note also that with global search, which is standard
in many
programs and in others is an option, each group (including each
individual
species) will be removed and re-added in all possible positions, so
that a
species causing confusion will have more chance of moving to a new
location
than it would without global rearrangement.
A WARNING ON INTERPRETING RESULTS
Probably the most important thing to keep in mind while running any
of the
parsimony or compatibility programs is not to overinterpret the result.
Many
users treat the set of most parsimonious trees as if it were a
confidence
interval.
If a group appears in all of the most parsimonious trees
then they
treat it as well established.
Unfortunately THE CONFIDENCE
INTERVAL ON
PHYLOGENIES APPEARS TO BE MUCH LARGER THAN THE SET OF ALL MOST
PARSIMONIOUS
TREES (Felsenstein, 1985b). Likewise, variation of result among
different
methods will not be a good indicator of the size of the confidence
interval.
Consider a simple data set in which, out of 100 binary characters, 51
recommend
the rooted tree ((A,B),C) and 49 the tree (A,(B,C)). Many different
methods
will all give the same result on such a data set: they will estimate the
tree
as ((A,B),C).
Nevertheless it is clear that the 51:49 margin by
which this
tree is favored is not significantly different from 50:50.
So
CONSISTENCY
AMONG DIFFERENT METHODS IS A POOR GUIDE TO STATISTICAL SIGNIFICANCE.
RELATIVE SPEED OF DIFFERENT PROGRAMS AND MACHINES
Relative speed of the different programs
-------- ----- -- --- --------- -------C compilers differ in efficiency of the code they generate, and some
deal
with some features of the language better than with others. Thus a
program
which is unusually fast on one computer may be unusually slow on
another.
Nevertheless, as a rough guide to relative execution speeds, I have
tested the
programs on three data sets, each of which has 10 species and 20
characters.
The first is an imaginary one in which all characters are compatible ("The
Willi Hennig Memorial Data Set" as J. S. Farris once called it). The
second is
the binary recoded form of the fossil horses data set of Camin and
Sokal
(1965). The third data set has data that is completely random: 10
species and
20 characters with a 50% chance that each character state is 0 or 1
(or A or
G). The data sets range from a completely compatible one in which there
is no
homoplasy (paralellism or convergence), through the horses data set,
which
requires 29 steps where the possible minimum number would be 20, to the
random
data set, which requires 49 steps.
We can thus see how this
increasing
messiness of the data affects running times.
Here are the nucleotide sequence versions of the three data sets:
10
A
B
C
D
E
F
G
H
I
J
20
CACACACAAAAAAAAAAACA
CACACAACAAAAAAAAAACA
CACAACAAAAAAAAAAAACA
CAACAAAACAAAAAAAAACA
CAACAAAAACAAAAAAAACA
ACAAAAAAAACACACAAAAC
ACAAAAAAAACACAACAAAC
ACAAAAAAAACAACAAAAAC
ACAAAAAAAAACAAAACAAC
ACAAAAAAAAACAAAAACAC
10
20
MesohippusAAAAAAAAAAAAAAAAAAAA
HypohippusAAACCCCCCCAAAAAAAAAC
ArchaeohipCAAAAAAAAAAAAAAAACAC
ParahippusCAAACAACAACAAAAAAAAC
MerychippuCCAACCACCACCCCACACCC
M. secunduCCAACCACCACCCACACCCC
Nannipus CCAACCACAACCCCACACCC
NeohippariCCAACCCCCCCCCCACACCC
Calippus CCAACCACAACCCACACCCC
PliohippusCCCACCCCCCCCCACACCCC
10
A
B
C
D
E
F
G
H
I
J
20
CACACAACCAAACAAACCAC
AAACCACACACACAAACCCA
ACAAAACCAAACCACCCACA
AAAAACACAACACACCAAAC
AAACAACCACACACAACCAA
CCCAAACACCCCCAAAAAAC
ACACCCCCACACCCACCAAC
AAAACAACAACCACCCCACC
ACACAACAACACAAACAACC
CCAAAAACACCCAACCCAAC
Here are the timings of many of the version 3.5 programs on these
three
data sets as run after being compiled by Microsoft Quick C on an 16 MHz
80386SX
computer under PCDOS 5.0. An 80387 math co-processor was present and was
used
by the compiled code.
Hennigian Data
PROTPARS
DNAPARS
DNAPENNY
DNACOMP
DNAINVAR
DNAML
DNAMLK
DNADIST
RESTML
FITCH
KITSCH
NEIGHBOR
CONTML
GENDIST
MIX
PENNY
DOLLOP
DOLPENNY
CLIQUE
FACTOR
82.83
5.98
46.03
7.14
0.61
1928.99
2247.12
3.57
6818.34
35.92
12.42
2.20
56.85
1.00
13.62
8.41
26.69
12.25
0.77
0.39
Horses Data
86.23
5.66
23.51
6.43
0.66
2069.32
6094.81
4.50
13422.15
48.61
12.36
2.14
57.56
1.00
14.60
21.31
26.86
56.57
0.71
0.44
Random Data
148.03
11.54
5305.97
11.86
0.61
2611.48
4993.00
5.38
28418.34
38.17
13.18
2.903
59.15
1.00
25.92
3851.1
46.30
23934.22
0.77
0.44
In all cases the programs were run under the default options,
except as
specified here.
The data sets used for the discrete characters
programs have
0's and 1's instead of A's and C's. For CONTML the 0's and 1's were made
into
0.0's and 1.0's and considered as 20 2-allele loci. For the distance
programs
10 x 10 distance matrices were computed from the three data sets. Nor
does it
make much sense to benchmark MOVE, DOLMOVE, or DNAMOVE, although when
there are
many characters and many species the response time after each alteration
of the
tree should be proportional to the product of the number of species
and the
number of characters. For DNAML and DNAMLK the frequencies of the four
bases
were set to be equal rather than determined empirically as is the
default. For
RESTML the number of enzymes was set to 1.
Several patterns will be apparent from this. The algorithms (MIX,
DOLLOP,
CONTML, FITCH, KITSCH, PROTPARS, DNAPARS, DNACOMP, and DNAML, DNAMLK,
RESTML)
that use the above-described addition strategy have run times that
do not
depend strongly on the messiness of the data. The only exception to
this is
that if a data set such as the Random data requires one extra round of
global
rearrangements it takes longer. The programs differ greatly in run
time: the
likelihood programs RESTML, DNAML and CONTML are quite a bit slower
than the
others. The protein sequence parsimony program, which has to do a
considerable
amount of bookkeeping to keep track of which amino acids
each
other, is also relatively slow.
can
mutate
to
Another class of algorithms includes PENNY, DOLPENNY, DNAPENNY and
CLIQUE.
These are branch-and-bound methods: in principle they should have
execution
times that rise exponentially with the number of species and/or
characters, and
they might be much more sensitive to messy data. This is apparent with
PENNY,
DOLPENNY, and DNAPENNY, which go from being reasonably fast with clean
data to
very slow with messy data. DOLPENNY is paritcularly slow on messy data - this
is because this algorithm cannot make use of some of the
lower-bound
calculations that are possible with DNAPENNY and PENNY. CLIQUE is very
fast on
all data sets. Although in theory it should bog down if the number of
cliques
in the data is very large, that does not happen with random data, which
in fact
has few cliques and those small ones. Apparently the "worst-case" data
sets
are much rarer for CLIQUE than for the other branch-and-bound methods.
NEIGHBOR is quite fast compared to FITCH and KITSCH, and should
make it
possible to run much larger cases, although the results are expected
to be a
bit rougher than with those programs.
Speed with different numbers of species
----- ---- --------- ------- -- ------How will the speed depend on the number of species and the
number of
characters?
For the sequential-addition algorithms, the speed
should be
proportional to the cube of the number of species, and to the
number of
characters.
Thus a case that has, instead of 10 species and 20
characters, 20
species and 50 characters would take 2 x 2 x 2 x 2.5 = 20 times as long.
This
implies that cases with more than 20 species will be slow, and cases
with more
than 40 species VERY slow.
This places a premium on working on
small
subproblems rather than just dumping a whole large data set into the
programs.
An exception to these rules will be some of the DNA programs that
use an
aliasing device to save execution time. In these programs execution
time will
not necessarily increase proportional to the number of sites, as sites
that
show the same pattern of nucleotides will be detected as identical
and the
calculations for them will be done only once, which does not lead to
more
execution time.
This is particularly likely to happen with few
species and
many sites, or with data sets that have small amounts of
evolutionary
divergence.
For programs FITCH and KITSCH, the distance matrix is square, so
that when
we double the number of species we also double the number of
"characters", so
that running times will go up as the fourth power of the number of
species
rather than the third power. Thus a 20-species case with FITCH is
expected to
run sixteen times more slowly than a 10-species case.
For programs like PENNY and CLIQUE the run times will rise faster
than the
cube of the number of species (in fact, they can rise faster than any
power
since these algorithms are not guaranteed to work in polynomial
time).
In
practice, PENNY will frequently bog down above 11 species, while CLIQUE
easily
deals with larger numbers.
For NEIGHBOR the speed should vary only as the square of the
number of
species, so a case twice as large will take only four times as long.
This will
make it an attractive alternative to FITCH and KITSCH for large data
sets.
If you are unsure of how long a program will take, try it first on
a few
species, then work your way up until you get a feel for the speed and
for what
size programs you can afford to run.
Execution time is not the most important criterion for a
program,
particularly
as computer time gets much cheaper than your time
or a
programmer's time. With workstations on which background jobs can be
run all
night, execution speed is not overwhelmingly relevant. Some of us
have been
conditioned by an earlier era of computing to consider execution
speed
paramount.
But ease of use, ease of adaptation to your computer
system, and
ease of modification are much more important in practice, and in these
respects
I think these programs are adequate. Only if you are engaged in 1960's
style
mainframe computing is minimization of execution time paramount.
Nevertheless it would have been nice to have made the programs
faster.
The present speeds are a compromise between speed and effectiveness: by
making
them slower and trying more rearrangements in the trees, or by
enumerating all
possible trees, I could have made the programs more likely to find
the best
tree. By trying fewer rearrangements I could have speeded them up, but
at the
cost of finding worse trees.
I could also have speeded them up by
writing
critical sections in assembly language, but this would have sacrificed
ease of
distribution to new computer systems. There are also some options
included in
these programs that make it harder to adopt some of the
economies of
bookkeeping that make other programs faster. However to some extent
I have
simply made the decision not to spend time trying to speed up
program
bookkeeping when there were new likelihood and statistical methods
to be
developed.
Relative speed of different machines
It is interesting to compare different machines using DNAPARS
as the
standard task. One can rate a machine on the DNAPARS benchmark by
summing the
times for all three of the data sets. Here are relative total timings
over all
three data sets (done with various versions of DNAPARS) for some
machines,
taking Microsoft Quick C running under PCDOS on a 16 MHz 80386 clone
as the
standard.
Pascal benchmarks from version 3.4 of the program are also
included
-- they are compared only with each other and their times are in
parentheses.
This use of two separate standards is necessary not because of
different
languages but because different versions of the package are being
compared.
Thus, the "Time" is the ratio of the Total to that for the 386SX,
for the
appropriate standard, so that the Time for the Macintosh Classic for
DNAPARS
3.4 on Think Pascal 3 is compared to the Time for the 386/SX running
DNAPARS
3.4 on Turbo Pascal 6.0, but the Time for the Macintosh Classic running
version
3.5 on Think C is compared to the Time for the 386SX running version
3.5 on
Quick C. The Speed is the reciprocal of the Time.
Machine
Speed
----------
DOS
Compiler
Total
Time
---
--------
-----
----
Toshiba T1100+
0.126
Apple Mac Plus
0.193
Toshiba T1100+
0.210
Macintosh Classic
0.212
Macintosh Classic
0.279
IBM PS2/60
0.579
80286 (12 Mhz)
0.722
Apple Mac IIcx
0.810
Apple Mac SE/30
0.810
Apple Mac IIcx
0.853
Apple Mac IIcx
0.857
Zenith Z386 (16MHz)
0.866
PCDOS
Turbo Pascal 3.01A
(269)
7.912
MacOS
Lightspeed Pascal 2
(175.84)
5.172
PCDOS
Turbo Pascal 5.0
(162)
4.765
MacOS
Think Pascal 3
(160)
4.706
MacOS
Think C
PCDOS
Turbo Pascal 5.0
(58.76)
1.728
PCDOS
Turbo Pascal 5.0
(47.09)
1.385
MacOS
Think Pascal 3
(42)
1.235
MacOS
Think Pascal 3
(42)
1.235
MacOS
Lightspeed Pascal 2
(39.84)
1.172
MacOS
Lightspeed Pascal 2#
(39.69)
1.167
PCDOS
Turbo Pascal 5.0
(38.27)
1.155
43.0
3.58
-
Macintosh SE/30
0.883
80386SX (16 MHz)
1.0
80386SX (16 MHz)
1.0
MacOS
Think C
PCDOS
Turbo Pascal 6.0
PCDOS
Microsoft Quick C
13.6
(34)
12.01
1.132
1.0
1.0
Sequent-S81
2.615
VAX 11/785
2.857
80486-33
2.967
Sun 3/60
3.056
NeXT Cube (68030)
4.605
Sequent S-81
4.612
VAXstation 3500
4.658
Sequent S-81
6.07
Unisys 7000/40
6.49
VAX 8600
8.59
Sun SPARC IPX
9.383
VAX 6000-530
13.998
VAXstation 4000
14.845
IBM RS/6000 540
14.94
NeXTstation(040/25)
16.013
Sun SPARC IPX
17.662
486DX (33 MHz)
19.063
Sun SPARCstation-1+
20.00
DECstation 5000/200
26.69
Sun SPARC 1+
30.025
DECstation 3100
44.16
IBM 3090-300E
44.48
DECstation 5000/125
44.98
DECstation 5000/200
44.98
Sun SPARC 4/50
48.23
DEC 3000/400 AXP
53.62
DYNIX
Silicon Valley Pascal (13.0)
0.382
Unix
Berkeley Pascal
(11.9)
0.35
PCDOS
Turbo Pascal 6.0
(11.46)
0.337
SunOS
Sun C
3.93
0.327
Mach
Gnu C
2.608
0.217
DYNIX
Sequent Symmetry C
2.604
0.217
Unix
Berkeley Pascal
(7.3)
0.215
DYNIX
Berkeley Pascal
(5.6)
0.1647
Unix
Berkeley Pascal
(5.24)
0.1541
VMS
DEC VAX Pascal
(3.96)
0.1165
SunOS
Gnu C version 2.1
1.28
0.1066
VMS
DEC C
0.858
0.0714
VMS
DEC C
0.809
0.0674
AIX
XLP Pascal
(2.276)
0.0669
Mach
Gnu C
0.75
0.0624
SunOS
Sun C
0.68
0.0566
Linux
Gnu C #
0.63
0.0525
Unix
Sun Pascal
Unix
DEC Ultrix C
0.45
0.0375
SunOS
Sun C
0.40
0.0333
Unix
DEC Ultrix RISC Pascal (0.77)
0.0226
AIX
Metaware High C
0.27
0.0225
Unix
DEC Ultrix RISC C
0.267
0.0222
Unix
DEC Ultrix RISC C
0.256
0.0222
SunOS
Sun C
0.249
0.02073
Unix
DEC C
0.224
0.01865
(1.7)
0.05
DECstation 5000/240
63.58
SGI Iris R4000
65.27
IBM 3090-300E
73.28
DECstation 5000/200
87.18
Unix
DEC Ultrix RISC C
0.1889
0.01573
Unix
SGI C
0.184
0.1532
VM
Pascal VS
(0.464)
0.0136
Unix
DEC Ultrix RISC Pascal (0.39)
0.0114
The Toshiba T1100+ should be exactly as fast as an 8 MHz PC clone.
For a
couple of the machines I am not sure that this benchmark is
representative of
timings on non-numerical programs in PHYLIP. This is particularly the
case for
the DEC 3000/400 AXP (the DEC "Alpha") which is probably quite a bit
faster
than indicated here. The numerical programs benchmark below gives it a
fairer
test.
The IBM RS/6000 is probably up to ten times faster than shown
here: it
may have been ill-served by its Pascal compiler.
Note that parallel machines like the Sequent are not really as
slow as
indicated by the data here, as these runs did nothing to take
advantage of
their parallelism.
For a picture of speeds for a more numerically intensive program,
here are
benchmarks using DNAML, with the 16 MHz 386SX with math co-processor
active as
the standard. Numbers are total run times (total user time in the
case of
Unix) over all three data sets.
Machine
Speed
-------386SX 16 Mhz
1.0
386SX 16 Mhz
1.0
Compudyne 486DX/33
4.096
SUN Sparcstation 1+
4.669
Everex STEP 386/20
5.432
Operating
System
Compiler
Seconds
Time
---------
--------
-------
----
PCDOS
Turbo Pascal 6
(7826)
1.0
PCDOS
Quick C
6549.79
1.0
Linux
Gnu C
1599.9
0.2441
SunOS
Sun C
1402.8
0.2142
PCDOS
Turbo Pascal 5.5
(1440.8)
0.1841
---
486DX/33
5.916
Compudyne 486DX/33
6.263
Sun SPARCstation IPX
6.821
NeXTstation(68040/25)
7.146
PCDOS
Turbo C++
1107.2
0.1690
PCDOS
Waterloo C/386
1045.78
0.1597
SunOS
Gnu C
960.2
0.1466
Mach
Gnu C
916.6
0.1399
486DX/33
7.607
Sun SPARCstation IPX
8.315
486DX/33
10.063
VAX 6000-530
10.282
DECstation 5000/200
15.473
IBM 3090-300E
32.46
Convex C240/1024
64.47
DEC 3000/400 AXP
66.64
PCDOS
Waterloo C/386
861.0
0.1314
SunOS
Sun C
787.7
0.1203
PCDOS
Gnu C
650.9
0.0994
VMS
DEC C
637.0
0.0973
Unix
DEC Ultrix RISC C
423.3
0.0646
AIX
Metaware High C
201.8
0.0308
Unix
C
101.6
0.01551
Unix
DEC C
98.29
0.01501
You are invited to send me figures for your machine for
inclusion in
future tables. Use the data sets above and compute the total times for
DNAPARS
and for DNAML for the three data sets (setting the frequencies of the
four
bases to 0.25 each for the DNAML runs). Be sure to tell me the
name and
version of your compiler, and the version of PHYLIP you tested.
Published Benchmarks
--------- ---------Some of you may have seen the "benchmark" published by Luckow and
Pimentel
(1985).
PHYLIP's WAGNER (an immediate ancestor of MIX) did not do well
in it,
either in terms of the quality of result or execution speed. I do not
believe
that this was a fair benchmark. WAGNER was run only with one order of
input
species, not ten as recommended here. Had it been, perhaps the shortest
tree
would have been found more often.
No credit was given to PHYLIP
in that
article for its free distribution, availability on microcomputers,
availability
in source code form, or portability to new computers. Pimentel's
laboratory
commissioned the development of a competing package, PHYSYS, which
is a
commercial product, and that involvement was not stated in the article.
The benchmarks
some
by
Fink
(1986)
are
fairer,
although
there
are
impressions given by that article which do not apply to the present
version.
In particular, I have since added to many of the programs the ability to
save
multiple equally-parsimonious trees, and have changed the outputs
so that
reconstruction of states in the hypothetical ancestral nodes is much
easier,
thus answering Fink's major criticisms. I have since eliminated the
Metropolis
annealing method algorithms which he criticized. I disagree with Fink's
view
OF PHYLIP that one should "be wary of published results from an analysis
using
it", as I do not think that a tree slightly longer than the most
parsimonious
one should be rejected out of hand. Nor do I agree that "it is really
too slow
to use as a teaching tool", as in teaching one uses small data sets and
speed
is not of the essence. Rather, simplicity of user interface is
paramount, and
there PHYLIP does very well (so is ability to run on a variety of
computers, in
which respect PHYLIP is also superior).
In fact, it is widely
used as a
teaching tool.
Nevertheless MIX is undoubtably not as fast or as sophisticated as
PAUP or
Hennig86.
The present version of PHYLIP is closer to its
competitors in
quality of result than was the version Fink reviewed.
Platnick's (1987) benchmarks concentrated, as did the other
benchmarkers
(all of them members of the same school of systematists) on parsimony
as the
only phylogeny criterion worthy of attention. He concluded that PHYLIP
could
be used effectively, especially if up to ten different input orders of
species
were used. Again, as with the other benchmarks, no credit was
given for
diversity of methods, portability, price, or availability of source code.
Platnick's second benchmark paper (1989) concentrates on
Hennig86 and
Paup, and concludes that PHYLIP has not kept up with those programs
in its
features. Again, the review is entirely concerned with parsimony, and
only the
barest mention is made of ... (you can complete this sentence).
Sanderson's (1990) benchmark paper breaks with the method of the
others by
specifying 36 features of the packages rated and giving separate
ratings in
each. Like the other benchmark papers it concentrates almost
exclusively on
parsimony as applied to morphological characters, but does at least
give some
credit where credit is due.
My own, obviously biased, feeling is that there is a discrepancy
between
the benchmarkers' projections of how satisfied users of PHYLIP will be,
and how
satisfied they actually are. And that this discrepancy is in PHYLIP's
favor.
ENDORSEMENTS
Here are some comments
square
brackets is my own:
about
PHYLIP.
Explanatory
material
in
From the pages of Cladistics:
"Under no circumstances can we recommend PHYLIP/WAG [their name
for the
Wagner parsimony option of MIX]."
Luckow, M. and R. A. Pimentel (1985)
"PHYLIP has not proven very effective in implementing parsimony
(Luckow and
Pimentel, 1985)."
J. Carpenter (1987a)
"... PHYLIP. This is the computer program where every newsletter
concerning
it is mostly bug-catching, some of which have been put there by
previous
corrections. As Platnick (1987) documents, through dint of much
labor
useful results may be attained with this program, but I would
suggest an
easier way: FORMAT b:"
J. Carpenter (1987b)
"PHYLIP is bug-infested and both less
magnitude
slower than other programs ...."
effective
and
orders
of
"T. N. Nayenizgani" [J. S. Farris]
(1990)
"Hennig86 [by J. S. Farris] provides such substantial improvements
over
previously available programs (for both mainframes and
microcomputers) that
it should now become the tool of choice for practising systematists."
N. Platnick (1989)
and in the pages of other journals:
"The availability, within PHYLIP of distance, compatibility,
maximum
likelihood,
and
generalized
'invariants'
algorithms
(Cavender and
Felsenstein, 1987) sets it apart from other packages .... One
of the
strengths of PHYLIP is its documentation ...."
Michael J. Sanderson (1990)
(Sanderson also criticizes PHYLIP for slowness and inflexibility
of its
parsimony algorithms, and compliments other packages on their
strengths).
"This package of programs has gradually become a basic necessity to
anyone
working seriously on various aspects of phylogenetic inference
.... The
package includes more programs than any other known phylogeny package.
But
it is not just a collection of cladistic and related programs. The
package
has great value added to the whole, and for this it is unique and of
extreme
importance ....
methods
provided ...."
its
various
strengths
are in the great array of
Bernard R. Baum (1989)
(see also above under Benchmarks for W. Fink's critical
(1986) on
version 2.8 of PHYLIP).
remarks
GENERAL COMMENTS ON ADAPTING THE PACKAGE TO DIFFERENT COMPUTER
SYSTEMS
In the sections following you will find instructions on how to
adapt the
programs to different computers and compilers. The programs should
compile
without alteration on most versions of C. They use the "malloc"
library or
"calloc" function to allocate memory so that the upper limits on
how many
species or how many sites or characters they can run is set by the
system
memory available to that memory-allocation function.
In the document file for each program, I have supplied a small
input
example, and the output it produces, to help you check whether the
programs are
running properly.
Most of the programs read their data from a file called "infile" and
write
their output to a file called "outfile" and a tree file to a file
"treefile".
If "infile" does not exist the program will prompt you for its name.
Compiling the programs
--------- --- -------Many machines that have C compilers, particularly Unix systems,
have a
utility called "make" available that considerably simplifies the
process of
compiling these programs. I will first discuss how to compile these
programs
with "make" and then, after a digression on how to move PHYLIP
to a
microcomputer, discuss for different individual systems how to
compile the
programs.
As we shall see below, for some DOS and Macintosh
compilers one
cannot simply use "make" and the standard Makefile.
Using "make"
----- -----If your machine has "make" you can place all the programs for the
package,
together with the file "Makefile" and the header files
"phylip.h", and
"drawgraphics.h", in one directory.
The Makefile and header
files are
constructed to detect, for many varieties of C, which it is dealing
with, and
inform the programs accordingly so that they can (by using "#ifdef")
adapt to
the idiosyncracies of the compiler.
To compile all the programs just type:
make all
To compile just one program, such as DNAML, type:
make dnaml
After a time the compiler will finish compiling.
The names
of the
executables will be the same as the names of the C programs, but
without the
".c" suffix. Thus dnaml.c compiles to make an executable called
"dnaml".
If
object modules ending in ".o" are found in the directory after
compilation they
can be removed if you need space.
Getting PHYLIP onto your microcomputer
------- ------ ---- ---- ------------C is widely available on microcomputers, and in any case we
also
distribute executable versions for PCDOS, 386 PCDOS, and Macintosh
systems.
Your institution may have an Internet connection, and if so there is
probably a
PCDOS system or a Macintosh somewhere connected directly to it.
Using that
machine you could download the executables and put them directly into
diskette
for transfer to your own machine.
You can also get the source
code,
documentation, and executables by sending me the appropriate
number of
diskettes (see the general information at the start of this document).
If you cannot do this, you may be able to transfer the entire
package, in
the form of self-extracting archives (which is one of the ways we
distribute it
for microcomputers) to your system using a terminal program with file
transfer
capabilities. Some users are sufficiently terrified of this prospect
that they
prefer to mail us diskettes and wait for several weeks.
But if
your
institution has an Internet connection it is much faster to do it that
way. If
you have a serial port to which a modem can be hooked, you can get a
terminal
program and do the transfers yourself.
For most microcomputer
systems,
public-domain or shareware terminal programs are available, such
as the
widely-distributed KERMIT and MODEM families of programs. Most
university
computer centers have communications programs (KERMIT or XMODEM) to
"talk" to
KERMIT, MODEM, or PC-TALK and transfer files to and from it.
Thus, if you cannot get from me a disk format readable
machine,
you can:
by
your
(1) Get an account on your mainframe and learn to use its
facilities for
"anonymous ftp" (transfer of files over Internet) or electronic
mail.
(2a) If you are on Internet (Or NSFNET) use the "anonymous ftp"
method to
receive the self-extracting archive files (start by
downloading and
reading the file "pub/phylip/Read.Me" from my system whose
Internet
address is evolution.genetics.washington.edu (128.95.12.41)), or
(2b) if your institution is not on Internet but does have
Bitnet
electronic mail, you can request that I send you the PHYLIP
source code
files and documentation as e-mail messages over BITNET/EARN
(not the
executables, however).
(3) Make sure the files are saved on your mainframe account (you
will need
about 2.2 Megabytes of space) under appropriate names.
(4) Use the file transfer provisions of your terminal program to
transfer
the archives to your microcomputer, or if they came as many
e-mail
messages, to transfer these to your machine individually (most
file
transfer programs can transfer many files with one command) for
later
compilation of the C source.
If you cannot read the diskette formats that I can write, and
if you
absolutely INSIST that I distribute the package in this format, please
send me
the computer and thirteen diskettes. I will promptly write the
diskettes and
return them (but of course I will keep your computer).
Now we turn to particular C compilers
problems
that may be encountered.
and
describe
particular
Microsoft Quick C and Microsoft C
--------- ----- - --- --------- These comments apply to Microsoft Quick C but may also work with
Microsoft
C.
A Makefile for Microsoft Quick C is included with the source code.
It is
called "Makefile.qc". If you copy it and call the copy "Makefile"
(making sure
to first save the generic Makefile that comes with this package under
some name
such as Makefile.old), you should be able to use "make" as described
above,
except that it is called "nmake".
Note that the command you must
use to
compile (for example) DNAPARS is "nmake dnapars.exe", not "nmake
dnapars", as
the program that results is to be called "dnapars.exe" and the Quick C
Makefile
is set up that way.
To compile individual programs without using the makefile, you need
to do
the following.
For a non-graphics program use the following command
(DOS> is
the PCDOS prompt, so you do not type it):
DOS> qcl /AH /F 4000 /FPi [source files]
If the program you are trying to compile is a 1-part source (for
example,
neighbor only has one part, neighbor.c) you should replace "[source
files]"
with "neighbor.c". So the command would be:
DOS> qcl /AH /F 4000 /FPi neighbor.c
If the program you are trying to compile is a 2-part source (for
example, mix
has two parts, mix.c and mix2.c) you can replace [source files] with
both of
the source files. Make sure that the first source file in the list
has the
same name as the executable file you want. i.e. use mix.c mix2.c and
not the
other way around. If you reorder them, the executable file will be
called
"MIX2.EXE". For mix, the command would be:
DOS> qcl /AH /F 4000 /FPi mix.c mix2.c
to compile a graphics program (i.e. drawgram, drawtree) under quick
without
c
using the makefile, use one of the following commands:
for DRAWGRAM:
DOS> qcl /AH /F 4000 /FPi drawgram.c drawgraphics.c graphics.lib [for
drawgram]
for DRAWTREE:
DOS> qcl /AH /F 4000 /FPi drawtree.c drawgraphics.c graphics.lib [for
drawtree]
Turbo C++ for PCDOS
----- --- --- ----The following instructions are for Turbo C++ but may also work for Turbo
C and
for Borland C, perhaps with slight modifications. Under normal
situations you
can use the makefile. The makefile for Turbo C++ is included in the
package as
"Makefile.tc". Copy it and call the copy "Makefile" (it would be wise the
first
rename the original "Makefile" to "Makefile.old"). Then to compile,
say,
DNAPARS, just type:
make dnapars.exe
However, if for some reason you want to do it by
following
steps:
hand,
follow
the
For the non-graphical programs (all those other than DRAWGRAM and
DRAWTREE):
to compile dnapars.c type the following (DOS> is the PCDOS prompt)
DOS> tcc -mh dnapars.c
If the source file is sufficiently large to require two sources (for
example,
dnaml.c and dnaml2.c), you will need to use both dnaml.c and dnaml2.c.
Examples:
DOS> tcc -mh dnaml.c dnaml2.c
DOS> tcc -mh neighbor.c
If you would like to use the program under the TD debugger, you should
add a "-v" flag as a compiler option:
DOS> tcc -mh -v restml.c restml2.c
For the graphical programs (DRAWGRAM and DRAWTREE):
First you need to build the "BGI" drivers. The BGI drivers are
included
with your TURBOC compiler, and should be in the "BGI" directory (this
is
a subdirectory of the main turboc directory). To do this you need to
use
the "bgiobj" program, also in the BGI directory. The current version
of PHYLIP supports the EGA/VGA, CGA, and hercules drivers. If you
have
modified the sources to take advantage of other drivers, you will have
to include those as well.
To build the BGI drivers:
DOS>
is]
DOS>
DOS>
DOS>
cd \tc\bgi [this should be replaced with whatever your turboc dir
BGIOBJ EGAVGA
BGIOBJ CGA
BGIOBJ HERC
this generates the files "EGAVGA.OBJ", "CGA.OBJ", and "HERC.OBJ" in
the
current directory. you want to copy this into your main source
directory.
(assume this is \phylip)
DOS> CP EGAVGA.OBJ \phylip [replace this with your source directory]
DOS> CP CGA.OBJ \phylip
DOS> CP HERC.OBJ \phylip
To compile the program, cd back to your source directory. You want
to compile each source file, plus a shared graphics file called
"drawgraphics.c". You also want to link it to the newly created BGI
object files and to the graphics library.
Examples:
DOS> tcc -mh drawgram.c drawgraphics.c herc.obj egavga.obj cga.obj
graphics.lib
DOS> tcc -mh drawtree.c drawgraphics.c herc.obj egavga.obj cga.obj
graphics.lib
(to compile drawgram and drawtree, respectively)
If you want to compile for the TD debugger, add the -v flag as above.
Waterloo C/386
-------- ----Waterloo C/386 is the compiler we use to create the 386 PCDOS
and 386
Windows versions of the executables.
It has a "make" capability
called
"wmake". We have had problems using this so the instructions here
are for
individually compiling programs without wmake.
Watcom C/386 is a very flexible compiler which can generate
executable
programs for many different environments. Following are instructions for
using
Watcom C/386 to compile for DOS using the DOS/4GW DOS extender (included
with
the Watcom distribution) and for Microsoft windows.
DOS/4GW:
to compile a program under watcom C/386 for the DOS/4GW dos
extender use
the following (the "DOS>" is the PCDOS prompt, not something you type):
DOS> wcl386 /l=dos4gw /p /k65520 [source files]
If the program you are trying to compile is a 1-part source (for
example,
neighbor only has one part, neighbor.c) you can replace [source files]
with
"neighbor.c". So the command would be:
DOS> wcl386 /l=dos4gw /p /k65520 neighbor.c
If the program you are trying to compile is a 2-part source (for
example, mix
has two parts, mix.c and mix2.c) you can replace [source files] with
both of
the source files. Make sure that the first source file in the list
has the
same name as the executable file you want. i.e. use mix.c mix2.c and
not the
other way around. If you reorder them, the executable file will be
called
"MIX2.EXE". For mix, the command would be:
DOS> wcl386 /l=dos4gw /p /k65520 mix.c mix2.c
The resultant executable file will take advantage of your system's
extended
memory and will not be limited to using only the first 640K. However, it
needs
the file "dos4gw.exe" in order to run. If you want to be able to
use the
program generated, make sure that this program is somewhere in your
path. (To
ensure this you can copy the program into the directory where the
compiled
program resides).
This "dos extender" is bundled with the Watcom
C/386
compiler and is freely redistributable.
For Windows:
to compile a program under watcom C/386 for windows use the following:
DOS> wcl386 /l=win386 /zw /p /k65520 [source files]
again, replace [source files] with either the complete program (ie
neighbor.c)
or both parts of the program (ie mix.c mix2.c).
once you have compiled the windows program you are not quite ready to
run the
program under windows.
The final step is to link it with the
"windows
supervisor". to do this do the following:
DOS> wbind [program] -n
i.e.:
DOS> wbind mix -n
this program will generate
will be
runnable under windows.
CAVEATS:
[programname].exe.
this
application
1. Make sure that when you use wbind that \watcom\binw is somewhere in
your path. if it is not, you may have to tell wbind explicitly
where
the windows supervisor file is, as in the following example:
DOS> wbind mix -n -s c:\watcom\binw\win386.ext which
replace the
c:\watcom\win386.ext with the full path of win386.ext.
will
2. The draw programs (drawgram, drawtree) currently do not compile
under windows. Compile them for DOS/4GW and use it in a dos shell
under
windows.
Think C for Macintosh
----- - --- --------For Symantec's Think C compiler (formerly called Lightspeed C) a
"make"
utility is not available. Thus you cannot use the Makefile but must
compile
the programs individually. Here are the steps you should follow to
compile a
typical program.
(1) Start up Think-C.
(2) Click on "New project" in the Think C project menu.
asked to
enter the name of the project.
You will be
(3) Add the source code for the program to the project. To add sources
to the
project, you need to click on "add" from the source menu. You will need
to add
the sources from the main program (i.e. "neighbor.c" in the case of a
program
in 1 part or "dnaml.c" and "dnaml2.c" in the case of a 2-part
program). You
also need to add "interface.c" (included with the distribution) and two
things
which are included with the think C compiler. The first one is
"MacTraps", and
is contained within the Think C folder under a directory called
"MacLibraries".
The second one is "ANSI", and is contained within the Think C folder
under a
directory called "C Libraries"
(4) Segment the project: After adding each of the sources to the
project, you
need to segment the project. This means that every source file is
contained
within its own 32K segment. In order to do this within Think C, you can
click
on a source file name in the Think C project window (the window that
lists each
of the sources) and drag it down to the bottom of the source list.
After you
have done this for each of the source files, a dotted line should appear
around
each source file in the project window.
(5) Set up compile options: The first thing you need to do is set up what
sort
of project you're compiling, and some of the characteristics of how the
memory
is set up. To do this, select "Set project type" in the "Project"
menu, and
make sure it's set up to be an Application with far code and far data.
Depending on the hardware you will be running on, you may want to
select
different compilation options. Most notably, if your machine has a
68881 math
coprocessor, enable the use of the coprocessor by selecting "Options"
under the
"Edit" window, selecting "Compiler settings" through the list at the
upper left
corner of the display, and then checking the box next to "Generate
68881
instructions".
(6) Compile the project: select "Make" under the source window. After
this has
completed (assuming that there were no compile errors), you need to
generate a
mac application. To do this, select "Build Application" under the
project
menu. Select a name for the application, and think C will create a
Macintosh
application.
Although this is more tedious than using a Makefile, Think C works very
well
with the PHYLIP programs and is the compiler we use for creating the
Macintosh
executables.
Unix
---I have already mentioned that under Unix you can use the "make"
command to
compile programs.
This works on all Unix systems. To compile an
individual
program like dnapars.c you can give the command "make dnapars" or
alternatively
"cc dnapars.c -lm".
When compiling programs that come in two parts,
such as
dnaml.c and dnaml2.c, you will have to issue three commands, two
compile
commands and one link command:
cc -C dnaml.c
cc -C dnaml2.c
cc dnaml.o dnaml2.o -lm -o dnaml
where the first two commands produced the object modules dnaml.o and
dnaml2.o
and the third command links them together into an executable that is
called
dnaml.
In running the programs, you may sometimes want to put them in
background
so you can proceed with other work. On systems with a windowing
environment
they can be put in their own window, and commands like "nice" used to
make them
have lower priority so that they do not interfere with interactive
applications
in other windows. If there is no windowing environment, you will want
to use
an ampersand ("&") after the command file name when invoking it to put
the job
in the background. You will have to put all the responses to the
interactive
menu of the program into a file and tell the background job to take its
input
from that file.
For example: suppose you want to run DNAPARS in a background,
taking its
input data from a file called sequences.dat, putting its interactive
output to
file called "screenout", and using a file called "input" as the place to
store
the interactive input. The file "input" need only contain two lines:
sequences.dat
Y
which is what you would have typed to run the program
interactively, in
response to the program's request for an input file name if it did not
find a
file named "infile", in in response the the menu.
To run the program in background, you would simply give the command:
dnapars < input > screenout &
which runs the program with input responses coming from "input" and
interactive
output being put into file "screenout". The usual output file and
tree file
will also be created by this run (keep that in mind as if you run any
other
PHYLIP program from the same directory while this one is running in
background
you may overwrite the output file from one program with that from the
other!).
If you wanted to give the program lower priority, so that it
would not
interfere with other work, and you have Berkeley Unix type job
control
facilities in your Unix, you can use the "nice" command:
nice +10 dnapars < input > screenout &
which lowers the priority of the run. To also time the run and put the
timing
at the end of "screenout", you can do this:
nice +10 ( time dnapars < input ) >& screenout &
which I will not attempt to explain.
You may also want to explore putting the interactive output into the
null
file "/dev/null" so as to not be bothered with it (but then you cannot
look at
it to see why something went wrong. If you have problems with creating
output
files that are too large, you may want to explore carefully the turning
off of
options in the programs you run.
If you are doing several runs in one, as for example when you
do a
bootstrap analysis using SEQBOOT, DNAPARS (say), and CONSENSE, you can
use an
editor to create a "batch file" with these commands:
seqboot < input1 > screenout
mv outfile infile
dnapars < input2 >> screenout
mv treefile infile
consense < input3 >> screenout
and then take the file (say "foofile") containing these commands and
give it
execute permission by using the command "chmod +x foofile" followed
by the
command "rehash". Then the job that foofile describes can be run as a
single
job in background by giving the command "foofile &". Note that you
must also
have the interactive input commands for SEQBOOT (including the random
number
seed), DNAPARS, and CONSENSE in the separate files "input1",
"input2", and
"input3".
With Berkeley-style job control the "nice" command can be
used
within the batch file "foofile" before each program name to reduce the
priority
with which the programs run.
VMS VAX systems
--- --- -------
On the VMS operating system with DEC VAX VMS C the programs will
compile
without alteration, except that we have to add some extra routines
because the
"%hd" format in printf and fprintf does not work. These extra routines
are in
the file VAXFIX.C. The commands for compiling a typical program
(DNAPARS) are:
$
$
$
$
DEFINE LNK$LIBRARY SYS$LIBRARY:VAXCRTL
CC DNAPARS.C
CC VAXFIX.C
LINK DNAPARS,VAXFIX
Once you use this "$ DEFINE" statement during a given interactive
session, you
need not repeat it again as the symbol "LNK$LIBRARY" is thereafter
properly
defined. The compilation process leaves a file DNAPARS.OBJ in your
directory:
this can be discarded. The executable program is named DNAPARS.EXE.
To run
the program one then uses the command:
$ R DNAPARS
The compiler defaults to the filenames "INFILE.", "OUTFILE.",
and
"TREEFILE.".
If the input file "INFILE." does not exist the
program will
prompt you to type in its name. Note that some commands on VMS such as
"TYPE
OUTFILE" will fail because the name of the file that it will attempt
to type
out will be not "OUTFILE." but "OUTFILE.LIS". To get it to type the
write file
you would have to instead issue the command "TYPE OUTFILE.".
Some of the programs come in several pieces that have to be
compiled and
linked together. For example, DNAML comes in two pieces, dnaml.c and
dnaml2.c.
To compile them and link the resulting object files together
into one
executable, use the commands:
$
$
$
$
$
DEFINE LNK$LIBRARY SYS$LIBRARY:VAXCRTL
CC DNAML.C
CC DNAML2.C
CC VAXFIX.C
LINK DNAML,DNAML2,VAXFIX
This will make an executable called DNAML.EXE plus two ".OBJ" files that
can be
discarded.
Note that when a LINK command is issued the name of the
first file
(in this case DNAML) becomes the name of the ".EXE" file that is
produced by
the linker.
To make it easier to compile all of the programs on VMS systems,
have
supplied a command file, "compile.com" that will do this. If you
install that
file and issue the command "@compile" it will compile all of the
programs.
However it is recommended that you also know how to recompile
individual
programs so that they can be altered to your purposes.
The programs DRAWGRAM and DRAWTREE both use
drawgraphics.c.
To compile (for example) DRAWGRAM, use:
$
$
$
$
$
routines
in
DEFINE LNK$LIBRARY SYS$LIBRARY:VAXCRTL
CC DRAWGRAPHICS.C
CC DRAWGRAM.C
CC VAXFIX.C
LINK DRAWGRAM,DRAWGRAPHICS,VAXFIX
which will create a file called DRAWGRAM.EXE, plus two ".OBJ" files.
When you
run DRAWGRAM you must have a font file present in your directory, as
well as
the tree file. If they are not found under their default names the
program
we
will prompt you for these.
When you are using the interactive
previewing
feature of DRAWGRAM (or DRAWTREE) on a Tektronix or DEC ReGIS
compatible
terminal, you will want before running the program to have issued the
command:
$ SET TERM/NOWRAP/ESCAPE
so that you do not run into trouble from the VMS
of 255
characters or the filtering of escape characters.
line
length
limit
Some later versions of Digital's VAX VMS operating system
a C
compiler that no longer needs the VAXFIX patch. If so, follow the
instructions
below for OpenVMS and all will be well.
have
OpenVMS DEC Alpha systems
------- --- ----- ------The OpenVMS operating system on Digital AlphaStations and other
Digital
Alpha AXP computers has many of the properties of the VAX VMS systems
mentioned
above except on important one.
It does not need any of the
VAXFIX.C
corrections.
Thus the programs should be compiled without this.
Renove all
mention of VAXFIX from COMPILE.COM (the lines compiling it and the
linking of
it).
Also
take
PHYLIP.H
and
comment
out
the
section in
which
"vax_printf_is_broken"
proceed
normally.
is
defined.
Then
the
compilation
should
Cray
---A number of people (F. James Rohlf, Kent Fiala, Shan Duncan,
and Ron
DeBry), succeeded in various ways in adapting the Pascal version of
PHYLIP to
several models of Crays. Recently Cray has been adopting Unicos, a Unix
clone,
as the operating system for its machines, and this means the Unix
instructions
should work for compiling the programs on Crays.
However, although the underlying algorithms of most programs, which
treat
sites independently, should be amenable to vector processors, there are
details
of the code which might best be changed. In particular within the
innermost
loops of the programs there are often scalar quantities that are
used for
temporary bookkeeping. These quantities, such as sum1, sum2, zz, z1,
yy, y1,
aa, bb, cc, sum, and denom in procedure makenewv of DNAML (and
similar
quantities in procedure nuview) are there to minimize the number of
array
references.
For vectorizing compilers such as the Cray compilers it
will be
better to replace them by arrays so that processing can occur
simultaneously.
IBM Mainframes running CMS
--- ---------- ------- --The following information applies not only to IBM mainframes, but to
IBMcompatible mainframes such as Amdahls, Fujitsu, Hitachis, and ICLs
when they
run IBM operating systems or IBM-compatible operating systems.
It
does not
apply to IBM mainframes running AIX (IBM's version of Unix) as for
those one
can simply use the Unix instructions above without modification.
Because IBM is IBM, it tried to impose the EBCDIC character code on the
world.
There are good arguments for and against EBCDIC; in any case, the
ASCII (or
ISO) code is winning out. I have chosen to distribute PHYLIP in the
ASCII
character code, as more likely to be readable on more machines.
Some
characters in ASCII have no equivalent in EBCDIC and get arbitrarily
changed
when my ASCII files are read into an EBCDIC machine. You may
find some
characters which look strange when viewed on a 3270 terminal on a CMS
system,
but we have found none that cause trouble for the compiler.
Andrew Keeffe was asked to investigate how to compile the
version of
PHYLIP on our IBM 3090 system, and here is what he has found.
C
These are the procedures for compiling the phylip package in C on
an IBM
mainframe.
These instructions were developed using IBM C/370 on an IBM
running
VM/XA CMS 5.6 Service Level 201.
3090
If you fetch PHYLIP directly as an ftp binary transfer,
getting a
compressed tar archive file, as available from our machine, we do
not know
whether there is an "uncompress" and a "tar" utility available on CMS to
extact
the files from the archive and translate them from ASCII to EBCDIC. You
should
ask your computer consultants about that. Alternatively, you could
fetch the
files to a PCDOS or Unix machine, extract the archives there, and then
move the
resulting text files for the source code and documentation to the CMS
system.
If you that, after establishing the connection between the IBM and the
other
host, type will translate the text files properly.
CMS prefers the names of files to have a minimum of two parts,
called the
filename (abbreviated fn) and the filetype (abbreviated ft),
separated by a
space. We have chosen "data" as the filetype, so that "infile" becomes
"infile
data", "outfile" becomes "outfile data" and so forth.
All commands that you give to the host are shown in UPPER CASE.
You can
type them in upper or lower case; CMS does not care.
Before compiling, give these commands to CMS:
SETUP C370
GLOBAL TXTLIB EDCBASE IBMLIB
It would make sense to put these commands
until the
compiling and linking is complete.
in
your
profile
exec
To compile a single program, such as dnapars.c:
CC DNAPARS
If there are no errors, the compiler will produce a file with the same
filename
and a filetype of 'text', DNAPARS TEXT in this case. Now give these
commands:
LOAD DNAPARS
GENMOD DNAPARS
The genmod command generates an executable module file (DNAPARS MODULE)
which
may be invoked by typing its name on the command line. Use this
procedure to
compile all of the phylip programs except dnaml, dnamlk, restml,
drawgram, and
drawtree.
The source files for dnaml, dnamlk, and restml have been split into two
parts.
To compile one of these programs, give these commands:
CC DNAML
CC DNAML2
LOAD DNAML DNAML2
GENMOD DNAML
Proceed similarly for dnamlk and restml.
The draw programs, drawgram and drawtree, both depend on common code
which
is stored in drawgraphics.c and drawgraphics.h. These names will be
truncated
to DRAWGRAP C and DRAWGRAP H on the CMS system. The contents of the
files are
not affected.
Compile the drawgraphics code:
CC DRAWGRAP
Compile and link the draw programs:
CC DRAWGRAM
LOAD DRAWGRAM DRAWGRAP
GENMOD DRAWGRAM
CC DRAWTREE
LOAD DRAWTREE DRAWGRAP
GENMOD DRAWTREE
If you are having trouble getting the programs running on your machine,
contact
me.
If I can't help, I can at least find out whether there is anyone
else who
has adapted them to the same machine and put you in touch with them.
Other Computer Systems
----- -------- ------As you can see from the variety of different systems on which
these
programs have been successfully run, there are no serious
incompatibility
problems with most computer systems. PHYLIP in various past Pascal
versions
has also been compiled on 8080 and Z80 C/M Systems, Apple II systems
running
UCSD Pascal, a variety of minicomputer systems such as DEC PDP-11's
and HP
1000's, CDC Cyber systems, and so on.
We hope gradually to
accumulate
experience on a wider variety of C compilers. If you succeed in
compiling the
C version of PHYLIP on a different machine or a different compiler,, I
would
like to hear the details so that I can include the instructions in a
future
version of this manual.
FREQUENTLY ASKED QUESTIONS
(1) "If I copied PHYLIP from a friend without you knowing, should I
try to
keep you from finding out?". No. It is to your advantage and mine for
you to
let me know. If you did not get PHYLIP "officially" from me or from
someone
authorized by me, but copied a friend's version, you are not in my
database of
users.
You probably also have an old version which has
since
been
substantially improved (see the beginning of this main document file
for the
date on which this version was released).
I don't mind you
"bootlegging"
PHYLIP (it's free anyway, and that saves me the work of writing
diskettes), but
you should realize that you may have an outdated version. You may be
able to
get the latest version just as quickly over Internet. You can read
about
subsequent bug fixes in the electronic news bulletins the person you
got it
from may (or may not) have subscribed to. It will help both of us if
you get
onto my mailing list. If you are on it, then I will give your name to
other
nearby users when they get a new copy, and they are urged to contact
you and
update your copy.
(I benefit by getting a better feel for how
many
distributions there have been, and having a better mailing list to use
to give
other users local people to contact). Send me your name and address
(five
lines maximum), and your phone number, with the number of the version
that you
have, plus the type of your computer, operating system, and C compiler,
so that
I can add you to the address list. Note also the listserver information
which
you can get, which provides news about PHYLIP by electronic mail.
This is
described in the next to last section of this document.
(2) "How do I make a citation to the PHYLIP package in the paper
I am
writing?"
One way is like this:
Felsenstein, J. 1993. PHYLIP (Phylogeny Inference Package) version
3.5c.
Distributed by the author. Department of Genetics, University of
Washington, Seattle.
or if the editor for whom you are writing insists that the citation must
be to
a printed publication, you could cite a notice for version 3.2
published in
Cladistics:
Felsenstein, J. 1989. PHYLIP -- Phylogeny Inference Package (Version
3.2).
Cladistics 5: 164-166.
For a while a printed version of the PHYLIP documentation was available
and one
could cite that. This is no longer true. Other than that, this is
difficult,
because I have never written a paper announcing PHYLIP!
My 1985b
paper in
Evolution (see the References section below) on the bootstrap method
contains a
one-paragraph Appendix describing the availability of this package, and
that
can also be cited as a reference for the package, although it
has been
distributed since 1980 while the bootstrap paper is 1985.
A paper on
PHYLIP
is needed mostly to give people something to cite, as word-of-mouth,
references
in other people's papers, and electronic newsgroup postings have
spread the
word about PHYLIP's existence quite effectively.
(3) "How do I bootstrap? Why has DNABOOT disappeared?"
DNABOOT,
BOOT, and
DOLBOOT, the previous parsimony-based bootstrap programs, have been
removed
from the package as there is now a more general way of
bootstrapping.
It
involves running SEQBOOT to make multiple bootstrapped data sets out
of your
one data set, then running one of the tree-making programs with the
Multiple
data sets option to analyze them all, then running CONSENSE to make a
majority
rule consensus tree from the resulting tree file. Read the
documentation of
SEQBOOT to get further information. Before, only parsimony methods
could be
bootstrapped. With this new system almost any of the tree-making
methods in
the package can be bootstrapped. It is somewhat more tedious but you
will find
it much more rewarding.
(4) "How do I specify a multi-species outgroup with your parsimony
programs?"
It's not a feature but is not too hard to do in many of the
programs. In
parsimony programs like MIX, for which the W (Weights) and A (Ancestral
states)
options are available, and weights can be larger than 1, all you need to
do is:
(a) In MIX, make up an extra character with states 0 for all the
outgroups
and 1 for all the ingroups. If using DNAPARS the ingroup can have
(say)
"G" and the outgroup "A".
(b) Assign this character an enormous weight (such as Z for 35) using
the W
option, all other characters getting weight 1, or whatever weight
they had
before.
(c) If it is available, Use the A (Ancestral states) option to
designate that
for that new character the state found in the outgroup is the
ancestral
state.
(d) In MIX do not use the O (Outgroup) option.
(e) After the tree is found, the designated ingroup should have been
held
together by the fake character. The tree will be rooted somewhere
in the
outgroup (the program may or may not have a preference for one
place in
the outgroup over another). Make sure that you subtract from the
total
number of steps on the tree all steps in the new character.
In programs like DNAPARS, you cannot use this method as weights of
sites
cannot be greater than 1.
But you do an analogous trick, by
adding a
largish number of extra sites to the data, with one nucleotide state
("A")
for the ingroup and another ("G") for the outgroup. You will then
have to
use RETREE to manually reroot the tree in the desired place.
(5) "How do I force certain groups to remain monophyletic in your
parsimony
programs?"
By the same method, using multiple fake characters, any
number of
groups of species can be forced to be monophyletic.
In MOVE,
DOLMOVE, and
DNAMOVE you can specify whatever outgroups you want without going
to this
trouble.
(6) "How can I reroot one of the trees written out by PHYLIP?" Use the
program
RETREE. But keep in mind whether the tree inferred by the original
program was
already rooted, or whether you are free to reroot it.
(7) "Why doesn't NEIGHBOR read my DNA sequences correctly?". Because it
wants
to have as input a distance matrix, not sequences. You have to use
DNADIST to
make the distance matrix first.
(8) "What do I do about deletions and insertions in my sequences?"
The
molecular sequence programs will accept sequences that have gaps
(the "-"
character). They do various things with them, mostly not optimal.
DNAPARS
counts "gap" as if it were a fifth nucleotide state (in addition to A,
C, G,
and T). Each site counts one change when a gap arises or disappears.
The
disadvantage of this treatment is that a long gap will be
overweighted, with
one event per gapped site. So a gap of 10 nucleotides will count as
being as
much evidence as 10 single site nucleotide substitutions. If there
are not
overlapping gaps, one way to correct this is to recode the first site
in the
gap as "-" but make all the others be "?" so the gap only counts as one
event.
Other programs such as DNAML and DNADIST count gaps as equivalent to
unknown
nucleotides (or unknown amino acids) on the grounds that we don't
know what
would be there if something were there.
This completely leaves
out the
information from the presence or absence of the gap itself, but does
not bias
the gapped sequence to be close to or far from other gapped or
ungapped
sequences.
(9) "Why don't your parsimony programs print out branch lengths?"
Because
there are problems defining the branch lengths. If you look closely
at the
reconstructions of the states of the hypothetical ancestral nodes for
almost
any data set and almost any parsimony method you will find some
ambiguous
states on those nodes. There is then usually an ambiguity as to which
branch
the change is actually on. Other parsimony programs resolve this in
one or
another arbitrary fashion, sometimes with the user specifying how (for
example,
methods that push the changes up the tree as far as possible or down it
as far
as possible). I have preferred to leave it to the user to do this.
Few
programs available from others currently correct the branch
lengths for
multiple changes of state that may have overlain each other. One
possible way
to get branch lengths with nucleotide sequence data is to take
the tree
topology that you got, use RETREE to convert it to be unrooted,
prepare a
distance matrix from your data using DNADIST, and then use FITCH with
that tree
as User Tree and see what branch lengths it estimates.
(10) "Why can't your programs handle unordered multistate characters?"
Well,
they can if they are 4-state characters whose states are A, C, G, and T
(or U)
because then one can use the DNA sequence parsimony programs. But in
general
the discrete characters parsimony programs can only handle two states, 0
and 1.
This is mostly because I have not yet had time to modify them to do so - the
modifications would have to be extensive. Ultimately I hope to get these
done,
but in the meantime the best I can do is suggest that you either use one
of the
excellent parsimony programs produced by others (PAUP or Hennig86, for
example)
or if you have four or fewer states recode your states to look like
nucleotides
and use the parsimony programs in the molecular sequence section of
PHYLIP.
(11) "Where can I get a printed version of the PHYLIP documents?"
For the
moment, you can only get a printed version by printing it
yourself. For
versions 3.1 to 3.3 a printed version was sold by Christopher Meacham
and Tom
Duncan, then at the University Herbarium of the University of
California at
Berkeley. But they have had to discontinue this as it was too much work.
You
should be able to print out the documentation files on almost any
printer and
make yourself a printed version of whichever of them you need.
(12) "Why have I been dropped from your newsletter mailing list?" You
haven't.
The newsletter was dropped. It simply was too hard to mail it out to
such a
large mailing list. The last issue of the newsletter was Number 9 in
May,
1987.
I am hoping that the Listserver News Bulletins will replace
the old
PHYLIP Newsletter. If you have electronic mail access you should
definitely
sign up for these bulletins. For details see the section on the
Listserver
News Bulletins below.
(13) "How many copies of PHYLIP have been distributed?" Currently (July,
1995)
I have a bit over 2700 registered installations worldwide. Of course
there are
many more people who have got copies from friends. PHYLIP is the most
widely
distributed phylogeny package.
PAUP is catching up in terms of
official
registrations, but PHYLIP is probably far ahead in terms of numbers of
actual
copies out there. In terms of phylogenies published, however, PAUP is
ahead,
but PHYLIP is gaining on it. In recent years magnetic tape
distribution of
PHYLIP has declined precipitously, electronic mail distribution is
decreasing,
and there has been a slow decrease of diskette distributions. But all
this has
been more than offset by a huge explosion of distributions by
anonymous ftp
over Internet (a rate of about 6 ftp sessions per day, at the moment).
Because
some people who get the package by anonymous ftp forget to register
their
copies, it is hard to estimate how many people have got it this way.
ADDITIONAL FREQUENTLY ASKED QUESTIONS, OR:
"Why didn't it occur to you to ...
(1) ... write these programs in Pascal?" These programs started
out in
Pascal in 1980.
In 1993 we have released both Pascal and C
versions. All
future versions will be C-only. I make fewer mistakes in Pascal and do
like
the language better than C, but C has overtaken Pascal and Pascal
compilers are
starting to be hard to find on some machines.
Also C is a bit
better
standardized which makes the number of modifications a user has to
make to
adapt the programs to their system much less.
(2) ... forgot about all those inferior systems and just develop
PHYLIP
for Unix?". This is self-answering, since the same people first said I
should
just develop it for Apple II's, then for CP/M Z-80's, then for IBM
PCDOS, and
now they're starting to tell me to just develop it for Macintoshes or
for Sun
workstations. If I had listened to them and done any one of these, I
would
have had a very hard time adapting the package to any of the other
ones once
these folks changed their mind!
(3) ... write these programs in PROLOG (or Ada, or Modula-2, or
SIMULA, or
BCPL, or PL/I, or APL, or LISP)?" These are all languages I have
considered.
All have advantages, but they are not really spreading (C is).
(4) ... include in the package a program to do the Distance Wagner
method,
(or successive approximations character weighting, or transformation
series
analysis)?" In most cases where I have not included other methods,
it is
because I decided that they had no substantial advantages over
methods that
were included (such as the programs FITCH, KITSCH, NEIGHBOR, the T
option of
MIX and DOLLOP, and the "?" ancestral states option of the discrete
characters
parsimony programs).
(5) ... include in the package ordination methods and more
clustering
algorithms?"
Because this is NOT a clustering package, it's a
package for
phylogeny estimation. Those are different tasks with different
objectives and
mostly different methods.
Mary Kuhner has, however, included in
NEIGHBOR an
option for UPGMA clustering, which will be very similar to KITSCH in
results.
(6) ... include in the package a program to do nucleotide
sequence
alignment?"
Well, yes, I should have, and this is scheduled to be in
future
releases. But multiple sequence alignment programs, in the era after
Sankoff,
Morel, and Cedergren's 1973 classic paper, need to use substantial
computer
horsepower to estimate the alignment and the tree together. So I will be
slow
getting this into the package and in the meantime you may want to
investigate
ClustalV or TreeAlign.
(7) ... send me the programs over the electronic mail network I
use,
BUTTERFLYNET?"
Well, I am trying to. Maybe there is a BUTTERFLYNET
gateway
hanging off FISHNET, which hangs off HAIRNET, which ...
I am
connected to
Internet, which connects to Bitnet. I can mail to Bitnet (EARN,
NetNorth) and
to UUCP networks. Keep in mind that the resulting files take up
about 2.2
Megabytes and that if you are not going to use them on the machine I
send them
to, you will have to download the files to your other machine.
Also in
some
cases networks and gateways lose or truncate files (these can be up to
about
60K long). So sometimes diskette or tape are a better medium.
I
hope to
continually expand and solidify network distribution. For a couple of
years,
PHYLIP has been available over Internet by "anonymous ftp" from my
machine,
evolution.genetics.washington.edu (128.95.12.41).
You can start by
fetching
file "Read.Me" from directory pub/phylip. My electronic mail
addresses are
given at the end of this document. Contact me by electronic mail if
you are
interested in getting PHYLIP over your network but cannot get ftp to
work.
(8) ... let me log in to your computer in Seattle and copy the
files out
over a phone line?" No thanks. It would cost you for over two hours of
longdistance telephone time, plus a half hour of my time and yours in which
I had
to explain to you how to log in and do the copying.
(9) ... send me a listing of your program?"
Damn it, it's
not "a
program", it's 30 programs, in a total of 87 files. What were you
thinking of
doing, having 1800-line programs typed in by slaves at your end? If you
were
going to go to all that trouble why not try network transfer or
diskettes? If
you have these then you can print out all the listings you want to and
add them
to the huge stack of printed output in the corner of your office.
(This and
the following two questions, once common, are finally disappearing,
I am
pleased to report).
(10) ... write a magnetic tape in our computer center's favorite
format
(inverted Lithuanian EBCDIC at 998 bpi)?" Because the ANSI standard
format is
the most widely used one, and even though your computer center may
pretend it
can't read a tape written this way, if you sniff around you will find a
utility
to read it. It's just a LOT easier for me to let you do that work. If I
tried
to put the tape into your format, I would probably get it wrong anyway.
(11) ... give us a version of these in FORTRAN?" Because the
programs are
FAR easier to write and debug in C or Pascal, and cannot easily be
rewritten
into FORTRAN (they make extensive use of recursive calls and of
records and
pointers). In any case, C is widely available. If you don't have a C
compiler
or don't know how to use it, you are going to have to learn a language
like C
or Pascal sooner or later, and the sooner the better.
NEW FEATURES IN RECENT VERSIONS
Version 3.5 has many new features.
They include:
1. The programs now exist in C as well as in Pascal. In the future we
will
support only the C versions, and as of now will not make any more
improvements
in the Pascal version. It will cease to be distributed with the next
release
of PHYLIP.
A Makefile has been included in the distribution to
simplify the
problems of compiling the package. The existence of a C compiler on
most
workstations means that we have ceased to directly distribute
executables for
workstations, as people can easily create them themselves by
following our
instructions.
2. All programs now have had the upper limits on the numbers of
species and
numbers of sites (or characters) removed. They instead use the
"malloc" and
"free" functions of C to try to allocate as much memory as they need. If
they
fail to find it they will complain, and you will have to look for a
bigger
machine, or install more memory, or remove other jobs that are
competing for
the memory. We no longer have to guess how large a computer you have and
where
you want to put the tradeoff between species and sites.
3. The program SEQBOOT has now fully superseded the former programs
DNABOOT,
BOOT, and DOLBOOT, which have been withdrawn. SEQBOOT also now can
carry out
Archie-Faith permutation of characters across species.
4. The DNA likelihood programs DNAML and DNAMLK now have a revised
Categories
option that allows them to cope with rate variation from site to site.
Instead
of the user specifying in advance the rate category of each site, they
need
only specify how many categories there are, what their rates are, what
their
relative probabilities are, and how long are the patches of spread of a
single
rate along the molecule, on average. The program then computes the
likelihood
allowing for all of these, and adding up over all possibilities of
rate
patterns, without being dependent on assuming that it has inferred
rates at
individual sites correctly. This should go far to address the criticism
that
maximum likelihood assumes constancy of rate at all sites.
5. A new program PROTDIST has been added to compute distance matrices
from
protein sequences, using several different methods. This will allow
protein
sequence data to be analyzed by distance matrix methods as well as
parsimony
methods.
6. A new program, RETREE, has been added to allow users easily
and
interactively to reroot trees, flip branches around, change or remove
branch
lengths, change species names, and so on.
7. Programs that estimate a tree with branch lengths now all not only can
read
in a user tree that has branch lengths and the program can be told to use
these
rather than re-estimating the branch lengths (this was already
possible for
DNAML and DNAMLK) but the ones that are estimating an unrooted tree
(DNAML,
FITCH, RESTML and CONTML) can also read in a tree with branch lengths on
some
branches and not on others, and be told to hold the ones it read in
constant
while iterating the rest. Thus you can, for example, specify that a
certain
branch must have length zero.
8. DRAWTREE and DRAWGRAM can now write out a PICT file that can be read
by the
MacDraw drawing program.
They can also write out the file format for
the Xwindows drawing program XFIG, and the input format for the freelydistributed
ray tracing program RAYSHADE (for trees seen in 3 dimensions floating
above a
landscape). In addition they allow fonts to be specified for species
names
when a Postscript printer is being used, and they can also make an Xwindows
X-bitmap file. DRAWTREE has a new option that allows the program to
(slowly)
calculate node positions so as to make them avoid each other better.
Both
programs now, when plotting on raster devices such as dot-matrix
printers, use
round pens to make the lines smoother, and are faster at drawing the
lines.
9. DNADIST now computes its distances much more quickly. It also can
compute
the Nei and Jin (1991) distance that allows for rate variation among
sites.
10. The programs that estimate trees by adding species sequentially to a
tree
(PROTPARS, DNAPARS, DNACOMP, DNAML, DNAMLK, RESTML, FITCH, KITSCH,
MIX, and
DOLLOP) now allow the user the specify that multiple tries will be made
with
different input orders of species (using the Jumble option) with only the
trees
tied for best overall being reported. The trees found will be those
that are
tied for best among all of those found by all these runs, not the trees
found
as best by each run. This improves the chances of finding the best tree.
11. A program COALLIKE was added to compute likelihood functions for
4Nu, the
product of 4 times the effective population size times the mutation
rate, for
samples of genes from a single isolated population, where the program
read
trees that had been sampled from the data by bootstrapping followed by
maximum
likelihood. This method was described by me in a paper in late
1992 in
Genetical Research.
Subsequent work by Richard Hudson and our lab has
shown
the method to be biased. It has been withdrawn from the package in
version
3.57.
It is replaced by a program "coalesce" in a new package, LAMARC,
which
is available from our ftp server.
Version 3.4 also had many new features.
They included:
1. All programs were given interactive menus which allow the user to
see and
alter option settings.
The programs read from a file INFILE and
write to a
file OUTFILE, as well as to a treefile TREEFILE. The result should be
much
easier for novice users to deal with. Most of the options which once
were set
by altering the input file can now be selected using the menu.
Only
options
that require separate information for each character or site, such as
Weights,
Ancestors, Factors, and the Categories option continued to require
that
information be entered into the input file (although user-defined trees
are put
there also).
2. The molecular sequence programs now allowed either interleaved or
sequential
sequence input (i.e. sequences put in in "aligned" form or by having all
of one
sequence followed by all of another). The choice is made using the
interactive
menu.
3. Three new programs were added:
NEIGHBOR carried out Saitou and
Nei's
neighbor-joining method for distance matrix data which is much
faster than
FITCH and KITSCH and should be able to handle much larger data sets.
It
also
carried out the UPGMA clustering method. SEQBOOT allowed the user to
bootstrap
nucleotide sequence data sets, protein sequence data sets, or
discretecharacters data sets and write out to a file the multiple data
sets that
result. CONTRAST accepted a continuous-characters data set and a
series of
user trees, and wrote out the series of contrasts for each character
that are
independent under a Brownian motion model of character evolution, as
well as
regressions, correlations, and covariances between them.
4. All of the programs that inferred trees now accepted multiple data
sets.
This allowed us to use SEQBOOT together with this feature to
analyze
bootstrapped data sets and find different trees for the different
bootstrap
replicates.
Their variation could be summarized by the consensus tree
program
CONSENSE. Thus almost everything in this package could now be
bootstrapped.
5. A serious error that made the DNA likelihood programs and DNADIST
give
incorrect results when the Categories option was used and there was
more than
one category of rates was fixed, in version 3.31. Categories run with
these
programs before that should be rerun.
6. Almost all programs now printed out trees in the "phenogram" form
that
they grew left-to-right, rather that in the triangular diagram used
before.
so
7. The tree-plotting programs DRAWGRAM and DRAWTREE now supported the
HewlettPackard Laserjet printers and also could produce output files
compatible with
the PC-Paint drawing program. The code for placement of interior
nodes in
DRAWGRAM was corrected, and preview of trees using Tektronix graphics
was made
easier by having it clear the screen more often.
8. The DNA likelihood program DNAML now ran about 60% faster.
9. The restriction sites likelihood program RESTML now
data
arising from digests with multiple enzymes.
allowed
for
the
COMING ATTRACTIONS, FUTURE PLANS
There are some obvious deficiencies in this version.
holes
will be filled in the next few releases (3.6, 3.7, etc.).
Some of these
They include:
1. A program to align molecular sequences on a predefined User
Tree may
ultimately be included. This will allow alignment and phylogeny
reconstruction
to procede iteratively by successive runs of two programs, one aligning
on a
tree and the other finding a better tree based on that alignment.
In the
shorter run a simple two-sequence alignment program may be included.
2. An interactive "likelihood explorer" for DNA sequences
written.
This will allow, either with or without the assumption of
clock,
trees to be varied interactively so that the user can get a
feel
for the shape of the likelihood surface. Likelihood will be
plotted
against branch lengths for any branch.
will
be
a molecular
much
better
able to be
3. The DNAML and DNAMLK programs will reinstate the previous
Categories
option, where the user specified categories of rates of evolution
for each
site, but also retaining the present one, that infers them.
The hope
is to
allow for variation in rate in 1st, 2nd and 3rd positions in a coding
sequence
(these being identified by the user) while also allowing for
autocorrelated
rates of evolution in adjacent codons.
4. If possible we will find some way of correcting for
purine/pyrimidine
richness variations among species, within the framework of the
maximum
likelihood programs. That they maximum likelihood programs do not
allow for
base composition variation is their major limitation at the moment.
5. Inclusion of some kind of protein sequence maximum likelihood program
is an
obvious need (right now we have Adachi and Hasegawa's program
in the
Unsupported Division).
6. The Categories option of DNAML and DNAMLK will be generalized to
allow for
rates at sites to gradually change as one moves along the tree, in an
attempt
to implement Fitch and Markowitz's (1970) notion of "covarions".
7. Obviously we need to start thinking about a more visual X windows
interface,
but only if that can be used on most systems.
8. Program PENNY and its relatives will improved so as to run faster and
find
all most parsimonious trees more quickly.
9. A more sophisticated compatibility program should be
I can
find one.
included,
if
10. An "evolutionary clock" version of CONTML will be done, and the
same may
also be done for RESTML.
12 . We hope gradually to generalize the tree structures
programs to
infer multifurcating trees as well as bifurcating ones.
in
the
13. We hope to economize on the size of the source code, and enforce
some
standardization of it, by putting frequently used routines in a
library from
which they can be linked into various programs. This will enforce a
rather
complete standardization of our code.
14. We may decide to gradually move our code to an object-oriented
language,
most lkely C++. One could describe the language that version 3.4 was
written
in as "Pascal", version 3.5 as "Pascal written in C", version 4.0 as "C
written
in C", and maybe version 4.1 as "C++ written in C" and then 4.2 as "C++
written
in C++".
At least that scenario is one possibility.
Much of the future development of the package will be in
the DNA
likelihood programs and the distance matrix programs. This is for
several
reasons. First, I am more interested in those problems. Second,
collection of
molecular data is increasing rapidly, and those programs have the most
promise
for future development for those data.
REFERENCES FOR THE DOCUMENTATION FILES
In the documentation files that follow I frequently refer to papers
in the
literature.
In order to centralize the references they are given
in this
section. If you want to find further papers beyond these, my Quarterly
Review
of Biology review of 1982 and my Annual Review of Genetics review of
1988 list
many further references. The chapter by David Swofford and Gary Olsen
(1990)
is also an excellent review of the issues in phylogeny reconstruction.
Adams, E. N. 1972. Consensus techniques and the comparison of taxonomic
trees. Systematic Zoology 21: 390-397.
Adams, E. N. 1986. N-trees as nestings: complexity, similarity, and
consensus. Journal of Classification 3: 299-317.
Archie, J. W. 1989. A randomization test for phylogenetic information
in
systematic data. Systematic Zoology 38: 219-252.
Astolfi, P., K. K. Kidd, and L. L. Cavalli-Sforza. 1981. A comparison
of
methods of reconstructing evolutionary trees. Systematic Zoology
30:
156-169.
Baum, B. R. 1989. PHYLIP: Phylogeny Inference Package. Version 3.2.
(Software
review). Quarterly Review of Biology 64: 539-541.
Bron, C., and J. Kerbosch. 1973. Algorithm 457: Finding all cliques of
an
undirected graph. Communications of the Association for Computing
Machinery 16: 575-577.
Camin, J. H., and R. R. Sokal. 1965. A method for deducing branching
sequences in phylogeny. Evolution 19: 311-326.
Carpenter, J. 1987a. A report on the Society for the Study of Evolution
workshop "Computer Programs for Inferring Phylogenies". Cladistics
3:
363-375.
Carpenter, J. 1987b. Cladistics of cladists. Cladistics 3: 363-375.
Cavalli-Sforza, L. L., and A. W. F. Edwards. 1967. Phylogenetic
analysis:
models and estimation procedures. Evolution 32: 550-570 (also
Amer. J.
Human Genetics 19: 233-257).
Cavender, J. A. and J. Felsenstein. 1987. Invariants of phylogenies in
a
simple case with discrete states. Journal of Classification 4: 5771.
Churchill, G.A. 1989. Stochastic models for heterogeneous DNA
sequences.
Bulletin of Mathematical Biology 51: 79-94.
Conn, E. E. and P. K. Stumpf. 1963. Outlines of Biochemistry. John
Wiley and
Sons, New York.
Day, W. H. E. 1983. Computationally difficult parsimony problems in
phylogenetic systematics. Journal of Theoretical Biology 103: 429438.
Dayhoff, M. O. 1979. Atlas of Protein Sequence and Structure, Volume 5,
Supplement 3, 1978. National Biomedical Research Foundation,
Washington,
D.C.
DeBry, R. W. and N. A. Slade. 1985. Cladistic analysis of restriction
endonuclease cleavage maps within a maximum-likelihood framework.
Systematic Zoology 34: 21-34.
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood
from
incomplete data via the EM algorithm. Journal of the Royal
Statistical
Society B 39: 1-38.
Eck, R. V., and M. O. Dayhoff. 1966. Atlas of Protein Sequence and
Structure
1966. National Biomedical Research Foundation, Silver Spring,
Maryland.
Edwards, A. W. F., and L. L. Cavalli-Sforza. 1964. Reconstruction of
evolutionary trees. pp. 67-76 in Phenetic and Phylogenetic
Classification, ed. V. H. Heywood and J. McNeill. Systematics
Association
Volume No. 6. Systematics Association, London.
Estabrook, G. F., C. S. Johnson, Jr., and F. R. McMorris. 1976a. A
mathematical foundation for the analysis of character compatibility.
Mathematical Biosciences 23: 181-187.
Estabrook, G. F., C. S. Johnson, Jr., and F. R. McMorris. 1976b. An
algebraic
analysis of cladistic characters. Discrete Mathematics16: 141-147.
Estabrook, G. F., F. R. McMorris, and C. A. Meacham. 1985. Comparison
of
undirected phylogenetic trees based on subtrees of four evolutionary
units. Systematic Zoology 34: 193-200.
Faith, D. P. 1990. Chance marsupial relationships. Nature 345: 393394.
Faith, D. P. and P. S. Cranston. 1991. Could a cladogram this short
have
arisen by chance alone?: On permutation tests for cladistic
structure.
Cladistics 7: 1-28.
Farris, J. S. 1977. Phylogenetic analysis under Dollo's Law.
Systematic
Zoology 26: 77-88.
Farris, J. S. 1978a. Inferring phylogenetic trees from chromosome
inversion
data. Systematic Zoology 27: 275-284.
Farris, J. S. 1981. Distance data in phylogenetic analysis. pp. 3-23
in
Advances in Cladistics: Proceedings of the first meeting of the
Willi
Hennig Society, ed. V. A. Funk and D. R. Brooks. New York Botanical
Garden, Bronx, New York.
Farris, J. S. 1983. The logical basis of phylogenetic analysis. pp. 147 in
Advances in Cladistics, Volume 2, Proceedings of the Second Meeting
of the
Willi Hennig Society. ed. Norman I. Platnick and V. A. Funk.
Columbia
University Press, New York.
Farris, J. S. 1985. Distance data revisited. Cladistics 1: 67-85.
Farris, J. S. 1986. Distances and statistics. Cladistics 2: 144-157.
Farris, J. S. ["T. N. Nayenizgani"]. 1990. The systematics association
enters
its golden years (review of "Prospects in Systematics", ed. D.
Hawksworth). Cladistics 6: 307-314.
Felsenstein, J. 1973a. Maximum likelihood and minimum-steps methods for
estimating evolutionary trees from data on discrete characters.
Systematic Zoology 22: 240-249.
Felsenstein, J. 1973b. Maximum-likelihood estimation of evolutionary
trees
from continuous characters. Amer. J. Human Genetics 25: 471-492.
Felsenstein, J. 1978a. The number of evolutionary trees. Systematic
Zoology
27: 27-33.
Felsenstein, J. 1978b. Cases in which parsimony and compatibility
methods
will be positively misleading. Systematic Zoology 27: 401-410.
Felsenstein, J. 1979. Alternative methods of phylogenetic inference and
their
interrelationship. Systematic Zoology 28: 49-62.
Felsenstein, J. 1981a. Evolutionary trees from DNA sequences: a maximum
likelihood approach. J. Molecular Evolution 17: 368-376.
Felsenstein, J. 1981b. A likelihood approach to character weighting and
what
it tells us about parsimony and compatibility. Biological Journal
of the
Linnean Society 16: 183-196.
Felsenstein, J. 1981c. Evolutionary trees from gene frequencies and
quantitative characters: finding maximum likelihood estimates.
Evolution
35: 1229-1242.
Felsenstein, J. 1982. Numerical methods for inferring evolutionary
trees.
Quarterly Review of Biology 57: 379-404.
Felsenstein, J. 1983b. Parsimony in systematics: biological and
statistical
issues. Annual Review of Ecology and Systematics 14:313-333.
Felsenstein, J. 1984a. Distance methods for inferring phylogenies: a
justification. Evolution 38: 16-24.
Felsenstein, J. 1984b. The statistical approach to inferring
evolutionary
trees and what it tells us about parsimony and compatibility. pp.
169-191
in: Cladistics: Perspectives in the Reconstruction of Evolutionary
History, edited by T. Duncan and T. F. Stuessy. Columbia
University
Press, New York.
Felsenstein, J. 1985a. Confidence limits on phylogenies with a
molecular
clock. Systematic Zoology 34: 152-161.
Felsenstein, J. 1985b. Confidence limits on phylogenies: an approach
using
the bootstrap. Evolution 39: 783-791.
Felsenstein, J. 1985c. Phylogenies from gene frequencies: a statistical
problem. Systematic Zoology 34: 300-311.
Felsenstein, J. 1985d. Phylogenies and the comparative method.
American
Naturalist 125: 1-12.
Felsenstein, J. 1986. Distance methods: a reply to Farris. Cladistics
2:
130-144.
Felsenstein, J. and E. Sober. 1986. Parsimony and likelihood: an
exchange.
Systematic Zoology 35: 617-626.
Felsenstein, J. 1988a. Phylogenies and quantitative characters. Annual
Review of Ecology and Systematics 19: 445-471.
Felsenstein, J. 1988b. Phylogenies from molecular sequences: inference
and
reliability.
Annual Review of Genetics 22: 521-565.
Felsenstein, J. 1992a. Estimating effective population size from
samples of
sequences: inefficiency of pairwise and segregating sites as
compared to
phylogenetic estimates. Genetical Research 59: 139-147.
Felsenstein, J. 1992b. Phylogenies from restriction sites, a maximum
likelihood approach. Evolution 46: 159-173.
Felsenstein, J. 1992c. Estimating effective population size from
samples of
sequences: a bootstrap Monte Carlo integration approach. Genetical
Research, (December issue), in press.
Fink, W. L. 1986. Microcomputers and phylogenetic analysis. Science
234:
1135-1139.
Fitch, W. M., and E. Margoliash. 1967. Construction of phylogenetic
trees.
Science 155: 279-284.
Fitch, W. M. 1971. Toward defining the course of evolution: minimum
change
for a specified tree topology. Systematic Zoology 20: 406-416.
Fitch, W. M. 1975. Toward finding the tree of maximum parsimony. pp.
189-230
in Proceedings of the Eighth International Conference on Numerical
Taxonomy, ed. G. F. Estabrook. W. H. Freeman, San Francisco.
Fitch, W. M. and E. Markowitz. 1970. An improved method for determining
codon
variability and its application to the rate of fixation of mutations
in
evolution. Biochemical Genetics 4: 579-593.
George, D. G., L. T. Hunt, and W. C. Barker. 1988. Current methods in
sequence comparison and analysis. pp. 127-149 in Macromolecular
Sequencing and Synthesis, ed. D. H. Schlesinger. Alan R. Liss, New
York.
Gomberg, D. 1966. "Bayesian" post-diction in an evolution process.
unpublished manuscript: University of Pavia, Italy.
Graham, R. L., and L. R. Foulds. 1982. Unlikelihood that minimal
phylogenies
for a realistic biological study can be constructed in reasonable
computational time. Mathematical Biosciences 60: 133-142.
Hasegawa, M. and T. Yano. 1984a. Maximum likelihood method of
phylogenetic
inference from DNA sequence data. Bulletin of the Biometric Society
of
Japan No. 5: 1-7.
Hasegawa, M. and T. Yano. 1984b. Phylogeny and classification of
Hominoidea
as inferred from DNA sequence data. Proceedings of the Japan
Academy 60
B: 389-392.
Hasegawa, M., Y. Iida, T. Yano, F. Takaiwa, and M. Iwabuchi. 1985a.
Phylogenetic relationships among eukaryotic kingdoms as inferred
from
ribosomal RNA sequences. Journal of Molecular Evolution 22: 32-38.
Hasegawa, M., H. Kishino, and T. Yano. 1985b. Dating of the human-ape
splitting by a molecular clock of mitochondrial DNA. Journal of
Molecular
Evolution 22: 160-174.
Hendy, M. D., and D. Penny. 1982. Branch and bound algorithms to
determine
minimal evolutionary trees. Mathematical Biosciences 59: 277-290.
Higgins, D. G. and P. M. Sharp. 1989. Fast and sensitive multiple
sequence
alignments on a microcomputer. Computer Applications in the
Biological
Sciences (CABIOS) 5: 151-153.
Holmquist, R., M. M. Miyamoto, and M. Goodman. 1988. Higher-primate
phylogeny
-- why can't we decide? Molecular Biology and Evolution 5: 201216.
Inger, R. F. 1967. The development of a phylogeny of frogs. Evolution
21:
369-384.
Jin, L. and M. Nei. 1990. Limitations of the evolutionary parsimony
method of
phylogenetic analysis. Molecular Biology and Evolution 7: 82-102.
Jukes, T. H. and C. R. Cantor. 1969. Evolution of protein molecules.
pp.
21-132 in Mammalian Protein Metabolism, ed. H. N. Munro. Academic
Press,
New York.
Kim, J. and M. A. Burgman. 1988. Accuracy of phylogenetic-estimation
methods
using simulated allele-frequency data. Evolution 42: 596-602.
Kimura, M. 1980. A simple model for estimating evolutionary rates of
base
substitutions through comparative studies of nucleotide sequences.
Journal of Molecular Evolution 16: 111-120.
Kimura, M. 1983. The Neutral Theory of Molecular Evolution. Cambridge
University Press, Cambridge.
Kingman, J. F. C. 1982a. The coalescent. Stochastic Processes and
Their
Applications 13: 235-248.
Kingman, J. F. C. 1982b. On the genealogy of large populations.
Journal of
Applied Probability 19A: 27-43.
Kishino, H. and M. Hasegawa. 1989. Evaluation of the maximum likelihood
estimate of the evolutionary tree topologies from DNA sequence data,
and
the branching order in Hominoidea. Journal of Molecular Evolution
29:
170-179.
Kluge, A. G., and J. S. Farris. 1969. Quantitative phyletics and the
evolution of anurans. Systematic Zoology 18: 1-32.
Lake, J. A. 1987. A rate-independent technique for analysis of nucleic
acid
sequences: evolutionary parsimony. Molecular Biology and Evolution
4:
167-191.
Le Quesne, W. J. 1969. A method of selection of characters in numerical
taxonomy. Systematic Zoology 18: 201-205.
Le Quesne, W. J. 1974. The uniquely evolved character concept and its
cladistic application. Systematic Zoology 23: 513-517.
Lewis, H. R., and C. H. Papadimitriou. 1978. The efficiency of
algorithms.
Scientific American 238: 96-109 (January issue)
Luckow, M. and D. Pimentel. 1985. An empirical comparison of numerical
Wagner computer programs. Cladistics 1: 47-66.
Lynch, M. 1990. Methods for the analysis of comparative data in
evolutionary
biology. Evolution 45: 1065-1080.
Maddison, D. R. 1991. The discovery and importance of multiple islands
of
most-parsimonious trees. Systematic Zoology 40: 315-328.
Margush, T. and F. R. McMorris. 1981. Consensus n-trees. Bulletin of
Mathematical Biology 43: 239-244.
Nelson, G. 1979. Cladistic analysis and synthesis: principles and
definitions, with a historical not on Adanson's Familles des Plantes
(1763-1764). Systematic Zoology
28: 1-21.
Nei, M. 1972. Genetic distance between populations. American
Naturalist
106: 283-292.
Nei, M. and W.-H. Li. 1979. Mathematical model for studying genetic
variation in terms of restriction endonucleases. Proceedings of the
National Academy of Sciences, USA 76: 5269-5273.
Page, R. D. M. 1989. Comments on component-compatibility in historical
biogeography. Cladistics 5: 167-182.
Platnick, N. 1987.
An empirical comparison of microcomputer parsimony
programs. Cladistics 3: 121-144.
Platnick, N. 1989. An empirical comparison of microcomputer parsimony
programs. II. Cladistics 5: 145-161.
Reynolds, J. B., B. S. Weir, and C. C. Cockerham. 1983. Estimation of
the
coancestry coefficient: basis for a short-term genetic distance.
Genetics 105: 767-779.
Rohlf, F. J. and M. C. Wooten. 1988. Evaluation of the restricted
maximum
likelihood method for estimating phylogenetic trees using simulated
allele- frequency data. Evolution 42: 581-595.
Saitou, N., Nei, M. 1987. The neighbor-joining method: a new method for
reconstructing phylogenetic trees. Molecular Biology and Evolution
4:
406-425.
Sanderson, M. J. 1990. Flexible phylogeny reconstruction: a review of
phylogenetic inference packages using parsimony. Systematic Zoology
39:
414-420.
Sankoff, D. D., C. Morel, R. J. Cedergren. 1973. Evolution of 5S RNA
and the
nonrandomness of base replacement. Nature New Biology 245: 232234.
Sokal, R. R. and P. H. A. Sneath. 1963. Principles of Numerical
Taxonomy. W.
H. Freeman, San Francisco.
Smouse, P. E. and W.-H. Li. 1987. Likelihood analysis of mitochondrial
restriction-cleavage patterns for the human-chimpanzee-gorilla
trichotomy.
Evolution 41: 1162-1176.
Sober, E. 1983a. Parsimony in systematics: philosophical issues.
Annual
Review of Ecology and Systematics 14: 335-357.
Sober, E. 1983b. A likelihood justification of parsimony. Cladistics
1:
209-233.
Sober, E. 1988. Reconstructing the Past: Parsimony, Evolution, and
Inference.
MIT Press, Cambridge, Massachusetts.
Sokal, R. R., and P. H. A. Sneath. 1963. Principles of Numerical
Taxonomy.
W. H. Freeman, San Francisco.
Studier, J. A. and K. J. Keppler. 1988. A note on the neighbor-joining
algorithm of Saitou and Nei. Molecular Biology and Evolution 5:
729-731.
Swofford, D. L. and G. J. Olsen. 1990. Phylogeny reconstruction.
Chapter 11,
pages 411-501 in Molecular Systematics, ed. D. M. Hillis and C.
Moritz.
Sinauer Associates, Sunderland, Massachusetts.
Templeton, A. R. 1983. Phylogenetic inference from restriction
endonuclease
cleavage site maps with particular reference to the evolution of
humans
and the apes. Evolution
37: 221-244.
Thompson, E. A. 1975. Human Evolutionary Trees. Cambridge University
Press,
Cambridge.
Wu, C. F. J. 1986. Jackknife, bootstrap and other resampling plans in
regression analysis.
Annals of Statistics
14: 1261-1295.
CREDITS
Over the years various granting agencies have contributed to
support
of the PHYLIP project (at first without knowing it). They are:
the
Years
Agency
Grant or Contract Number
1995-1999
1992-1995
1992-1994
1989-1992
1990-1992
1987-1990
1979-1987
76EV71005
NIH NIGMS
National Science Foundation
NIH NIGMS Shannon Award
NIH NIGMS
National Science Foundation
National Science Foundation
U.S. Department of Energy
1 R01 GM51929-01
DEB-9207558
2 R55 GM41716-04
1 R01-GM41716-01
BSR-8918333
BSR-8614807
DE-AM06-76RLO2225 TA DE-AT06-
I am particularly grateful to program administrators William Moore,
Irene
Eckstrand, Peter Arzberger, and Conrad Istock, who have gone beyond the
call of
duty to make sure that PHYLIP continued.
Booby prizes for funding are awarded to:
(1) The people at the U.S. Department of Energy who, in 1987, decided
they were
"not interested in phylogenies",
(2) The members of the Systematics Panel of NSF who twice (in 1989 and
1992)
positively recommended that my applications NOT be funded. I am very
grateful
to program director William Moore for courageously overruling their
decision
the first time. The current (1992) Systematics Panel can claim no
credit for
PHYLIP whatsoever.
(3) The members of the 1992 Genetics Study Section of NIH who rated my
proposal
in the 53rd percentile (I don't know if that's 53rd from the top or the
bottom,
but does it matter?), thus denying it funding. I am, however, grateful
to the
NIGMS administrators who supported giving me a "Shannon award"
partially
funding my work for a period in spite of this rating.
The original Camin-Sokal parsimony program and the polymorphism
parsimony
program were written by me in 1977 and 1978. They were Pascal
versions of
earlier FORTRAN programs I wrote in 1966 and 1967 using the same
algorithm to
infer phylogenies under the Camin-Sokal and polymorphism parsimony
criteria.
Harvey Motulsky worked for me as a programmer in 1971 and wrote
FORTRAN
programs to carry out the Camin-Sokal, Dollo, and polymorphism
methods. But
most of the work on PHYLIP other than my own was by Jerry Shurman and
Mark
Moehring.
Jerry Shurman worked for me in the summers of 1979 and
1980, and
Mark Moehring worked for me in the summers of 1980 and 1981.
Both
wrote
original versions of many of the other programs, based on the original
versions
of my Camin-Sokal parsimony program and POLYM.
These formed the
basis of
Version 1 of the Package, first distributed in October, 1980.
Version 2, released in the spring of 1982, involved a fairly
complete
rewrite by me of many of those programs. Jerry and Mark are not to
be held
responsible for problems arising from use of these programs.
Hisashi
Horino
has for version 3.3 reworked some parts of the programs CLIQUE and
CONSENSE to
make their output more comprehensible, and has added some code to the
treedrawing programs DRAWGRAM and DRAWTREE as well.
My part-time programmers Akiko Fuseki, Sean Lamont and Andrew Keeffe
gave
me substantial help with the current release, and their excellent
work is
greatly appreciated. Akiko in particular did much of the hard work of
adding
new features and changing old ones in the 3.4 and 3.5 releases, and
Andrew
prepared the Macintosh version, wrote RETREE, and added the raytracing and
PICT code to the DRAW programs. Sean was central to the conversion to
C, and
tested it extensively. My postdoctoral fellow Mary Kuhner and her
associate
Jon Yamato created NEIGHBOR, the neighbor-joining and UPGMA program,
for the
current release, for which I am also grateful (Naruya Saitou kindly
encouraged
us to use some of the code from his own implementation of this method).
I am very grateful to many users for algorithmic suggestions,
complaints
about features (or lack of features), and information about the
behavior of
their operating systems and compilers. Among these are:
Jim Archie
Mary Barkworth
Yves Bertheau
Vincent Bauchau
Bernard Baum
Mary Berbee
Biff Bermingham
Yves Bertheau
Pierre Boursot
Tom Bruns
Tsan Iang Chuang
Timothy Goldsmith
Rees Griffiths
George Gutman
Linda Hardison
Gene Hart
Masami Hasegawa
Bill Hatheway
David Hillis
Richard Holliday
Eddie Holmes
Kent Holsinger
Dan Nickrent
Trang Nguyen
Cary O'Donnell
Steve O'Kane
Gary Olsen
John Olsen
Steve O'Neill
Greg Orloff
Pekka Pamilo
David Penny
Norman Platnick
Stephen Clark
Bruce Cochrane
Joel Cracraft
Ross Crozier
Mark Dalton
Dan Davison
Ron DeBry
Allen Delaney
Terry Delaney
John Devereux
Tod Distotell
John Doebley
Ken Dodds
Jim Doyle
Guy Drouin
Shan Duncan
Tom Duncan
Robert Eaglen
Scott Edwards
Willem Ellis
Ted Emigh
John Endler
Laurent Excoffier
James Farmer
David Featherston
Kent Fiala
Tim Flannery
Vera Ford
Kurt Fristrup
Douglas Futuyma
Michael Garrick
Don Gilbert
John Gillespie
Nick Goldman
Dan Hough
Richard Jensen
Bo Johansson
Quentin Kay
Steve Kelem
Kim Cheol-Min
Joseph H. Kirkbride
John Kirsch
Andrew Knight
Dennis Knudson
Mary Kuhner
Jan Kwiatowski
John LaDuke
Lionel Landry
Franz Lang
Niels Larsen
Jerry Learn
Rev. Arthur Lee
Pierre Legendre
Jack A.M. Leunissen
Andrew Lloyd
Wolfgang Ludwig
David Maddison
Wayne Maddison
George McKay
Brian McMahon
Christopher Meacham
Brook Milligan
Sanzo Miyazawa
Janice Moore
Susumu Nakayama
Jean-Marc Neuhaus
Haolin Ni
Mark Ragan
Neil Rawlings
Tom Ritch
Alistair Robertson
Joseph R. Rohrer
Naruya Saitou
Kay Schneitz
Paul Sharp
Arend Sidow
Hans Siegismund
Chuck Smart
Douglas Smith
Dave Spencer
Lisa Steiner
Per Sundberg
Susan Swensen
David Swofford
John Sved
Naoko Takezaki
Eric Taylor
Jeff Thorne
Clive Trotman
John Turnbull
Hans Ullitz-Moeller
Michael Vodkin
Carl Wadsworth
Ryk Ward
Daniel Weeks
Loni West
George D.F. Wilson
Thomas K. Wilson
M. Zandee
Eric Zurcher
My apologies to anyone who has accidentally been left out of this
Keep
making suggestions and you will get on eventually.
list.
A growing contribution to this package has been made by others
writing
programs or parts of programs. Chris Meacham contributed the important
program
FACTOR, long demanded by users, and the even more important ones
PLOTREE and
PLOTGRAM. Important parts of the code in DRAWGRAM and DRAWTREE were
taken over
from those two programs. He is thus mostly to blame for all problems
with
these programs.
Kent Fiala wrote PROCEDURE reroot to do outgrouprooting,
which was an essential part of many programs in earlier versions.
Someone at
the Western Australia Institute of Technology suggested the name
PHYLIP (by
writing it on a magnetic tape as the tape label), but they all seem
deny
having done so (and I've lost the relevant letter).
to
Arend Sidow contributed makeinf.c to the Unsupported Division of
this
release, and Masami Hasegawa and Jun Adachi contributed ProtML.pas.
Their
generosity is much appreciated.
The distribution of the package also owes much to Buz Wilson and
Willem
Ellis, who have put a lot of effort into the past distribution of the
PCDOS and
Macintosh versions respectively. Christopher Meacham and Tom Duncan for
three
versions distributed a printed version of these documentation files
(they are
no longer able to do so), and I am very grateful to them for those
efforts.
William H.E. Day and F. James Rohlf have been very helpful in setting
up the
listserver news bulletin service.
I also wish to thank the people who have made computer resources
available
to me, mostly in the loan of use of microcomputers. These include
Jeremy
Field, Clem Furlong, Rick Garber, Dan Jacobson, Rochelle Kochin, Monty
Slatkin,
Jim Archie, Jim Thomas, and George Gilchrist.
I should also acknowledge the computers used to develop this
package:
These include a CDC 6400, two DECSystem 1090s, my trusty old SOL-20,
my old
Osborne-1, a VAX 11/780, a VAX 8600, my old
MicroVAX I, my old
DECstation
3100, my old Toshiba 1100+, and my present mainstays, a DECstation
5000/200, a
DECstation 5000/125, a Compudyne 486DX/33, a Trinity Genesis 386SX, a
Zenith
Z386 and a Mac Classic.
(One of the reasons we have been
successful in
achieving compatibility between different computer systems is that I
have had
to run them myself under so many different operating systems and
compilers).
OTHER PHYLOGENY PROGRAMS AVAILABLE ELSEWHERE
Here are some of the other phylogeny packages that I know about.
Some of
them are available over Internet from ftp server machines, or by
World Wide
Web. If you are on Internet you should familiarize yourself with the
server
machines (see entries 6 and 7 below for more information). Another
major list
of phylogeny software is being compiled by David Maddison and Wayne
Maddison as
part of their "Tree of Life" project on the World Wide Web. Its URL is:
http://phylogeny.arizona.edu/tree/programs/programs.html
It is still very incomplete as of this writing but may be more up-to-date
than
this listing can be. The programs listed below include both free and
non-free
ones; in some cases I do not know whether a program is free. I have
listed as
free those that I knew were free; for the others you have to ask
their
distributor.
The list starts with programs and packages
to
estimate
phylogenies, continues with alignment-and-phylogeny programs, and
ends with
programs to do other phylogeny-related tasks.
1. David Swofford of the Laboratory of Molecular Systematics,
National
Museum of Natural History, Smithsonian Instition, Washington, D.C. has
written
PAUP (which originally meant Phylogenetic Analysis Using Parsimony).
Version
3.0 was available for Macintoshes.
It is currently not available, but
a new
version, to be called PAUP*, will be released by Sinauer
Associates, of
Sunderland, Massachusetts, in a new version called PAUP*, in late 1995 or
early
1996. It will have Macintosh, DOS, and Unix versions.
It will
include
parsimony, distance matrix, invariants, and maximum likelihood methods.
PAUP 3.0 was probably the most sophisticated parsimony program, with
many
options and close compatibility with MacClade (for which see below).
The new
program will become much broader with the inclusion of more methods. The
price
will be in the vicinity of $100 US. Sinauer Associates's e-mail
address is
biology@sinauer.com.
2. If you have a Macintosh computer and any interest in
discrete-state
parsimony methods (including DNA and protein parsimony), you should
definitely
get MacClade. It was written by Wayne Maddison and David Maddison
of the
University of Arizona. All distribution is by Sinauer Associates,
Sunderland
Massachusetts 01375, USA. Their phone number is: (413) 665 3722, FAX:
(413)
665 7292.
A disk with program, help file, and example data files,
plus book
(which has about 100 pages of intro to phylogenetic theory, and 250
pages of
program instructions), is $75 U.S. ($40 for the book alone). Site
licenses
also available.
An earlier and less capable Version 2 (which for
example
cannot read nucleic acid sequences and has fewer features for
discrete
characters) is also available by anonymous ftp from the EMBL,
Indiana and
Houston molecular biology software servers. Their addresses are given
below
under the descriptions of TreeAlign and ClustalV. MacClade 2.1 will be
found
among their Mac software, as a squeezed and then binhexed file.
MacClade enables you to use the mouse-window interface to
specify and
rearrange phylogenies by hand, and watch the number of character steps
and the
distribution of states of a given character on the tree change as you
do so.
MacClade is positively addictive and will give you a much better feel
for the
tree and your data. It's the closest thing to a phylogeny video game
that I
have seen.
It has been influential in spurring the inclusion of
interaction
and graphics into other phylogeny programs.
(I have tried to supply
this
functionality in PHYLIP by incorporating the programs MOVE,
DOLMOVE, and
DNAMOVE, which act somewhat like MacClade).
MacClade does not
have a
sophisticated search algorithm to find best trees: it largely relies on
you to
do it by hand (which is surprisingly effective), with only
a
local
rearrangement algorithm available to improve on that tree.
3. J. S. Farris has produced Hennig86, a fast parsimony program
including
branch-and-bound search for most parsimonious trees and
interactive tree
rearrangement. Although complete benchmarks have not been published it
is said
to be faster than Swofford's PAUP; both are a great many times faster
than the
parsimony programs in PHYLIP. The program is distributed in executable
object
code only and costs $50, plus $5 mailing costs ($10 outside of of the
U.S.).
The user's name should be stated, as copies are personalized as a
copyprotection measure.
It is distributed by Arnold Kluge,
Amphibians and
Reptiles, Museum of Zoology, University of Michigan, Ann Arbor,
Michigan
48109-1079, U.S.A. (Arnold.G.Kluge@um.cc.umich.edu) and by Diana
Lipscomb at
George Washington University (BIODL@gwuvm.gwu.edu). It runs on PCcompatible
microcomputers with at least 512K of RAM and needs no math
coprocessor or
graphics monitor. It can handle up to 180 taxa and 999 characters.
4. Mark Siddall, of the Virginia
Institute
of
Marine
Sciences
(mes@vims.edu) has released Random Cladistics, a set of programs that can
carry
out bootstrapping, jackknifing, and a variety of kinds of permutation
tests,
using Hennig86 to analyze the data. To use it you must have a copy of
Hennig86
(for whose distribution see above).
Random Cladistics will carry
out the
appropriate transformations of your data and will call Hennig86 and
have it
analyze them, and then it will summarize the results.
Random
Cladistics is
available free by anonymous ftp from zoo.utoronto.ca in directory "pub"
(files
random.doc and random.exe).
5. J. S. Farris has recently released RNA (Rapid Nucleotide
Analysis). It
features rapid bootstrapping.
It is available from Arnold Kluge,
Amphibians
and Reptiles, Museum of Zoology, University of Michigan, Ann Arbor,
Michigan
48109-1079, U.S.A. (Arnold.G.Kluge@um.cc.umich.edu ) and Diana
Lipscomb at
George Washington University (BIODL@gwuvm.gwu.edu) who may be
contacted for
details. The cost is said to be about $30 US.
6. ClaDOS, an interactive program which allows rearrangement of
trees and
their evaluation, mapping of characters into them, and more, is
available for
DOS systems from Kevin Nixon, L. H. Bailey Hortorium, Cornell
University, 467
Mann Library, Ithaca, New York 14853. Rumor has it that the cost is
in the
vicinity of $55 US.
7. MEGA (Molecular Evolutionary Genetic Analysis) has been released
at the
by Sudhir Kumar, Koichiro Tamura, and Masatoshi Nei of the
Institute of
Molecular Evolutionary Genetics, 328
Mueller
Lab,
Pennsylvania
State
University, University Park, Pennsylvania 16802, U.S.A. It is an
executable
program for DOS machines, and is menu-driven with context-sensitive
help. It
will also run under Windows in a DOS Window. It will analyze data
from DNA,
RNA and protein sequences, and distance matrices produced from other
kinds of
data as well.
It will include the Neighbor-Joining method distance
matrix
method, a branch and bound parsimony method, and bootstrapping. It will
also
plot trees on many kinds of printers.
The program costs $15
(for the
documentation) Inquiries can also be made by mail to Joyce White at the
above
address or by electronic mail to imeg@@psuvm.psu.edu.
8. Yves van de Peer of the University of Antwerp
(yvdp@reks.uia.ac.be) has
developed TREECON 3.0, a program package for analysis of molecular data
sets.
It is menu driven and runs on 386 (and higher) DOS systems, and also on
Windows
systems.
It carries out inference of phylogenies by distance matrix
methods,
with bootstrapping and a program to draw the trees. It is written in C
and is
available free by anonymous ftp from uiam3.uia.ac.be.
It was
described in
CABIOS 9: 177-182 (1993). A fee is asked to defray expenses. For
information
or ordering contact Van de Peer at the above e-mail address or
at the
Department of Biochemistry, University of Antwerp (UIA),
Universiteitsplein 1,
B-2610 Antwerpen, BELGIUM.
9. Jun Adachi and Masami Hasegawa have written a package MOLPHY
2.2,
carrying out maximum likelihood inference of phylogenies for either
nucleotide
sequences or protein sequences.
Their protein sequence maximum
likelihood
program, ProtML, is a successor to the one they made available to
me for
distribution on a nonsupported basis in PHYLIP, and is much improved over
that.
It is the best protein maximum likelihood program available. The
package is
distributed free in C source code, with documentation,
by
ftp
from
sunmh.ism.ac.jp.
10. Gary Olsen, of the Department of Microbiology, University of
Illinois,
has developed a speeded-up version of my program DNAML coded in C,
called
"fastDNAml". It achieves a number of economies and also is organized so
that
it can be run on parallel processors -- he and his co-workers have
constructed
trees of very large size on a high-speed parallel processor. The
program can
be compiled using the "p4" portable parallel processing toolkit. It
can also
be run in ordinary serial mode on workstations where it is fatser than
DNAML.
The C program is available by anonymous ftp from the Ribosomal Database
Project
at info.mcs.anl.gov in directory pub/RDP/programs/fastDNAml.
11. Ziheng Yang of the Institute of Molecular Evolutionary
Genetics at
Pennsylvania State University (who is soon to be moving to the
Department of
Integrative
Biology,
University
of
California,
Berkeley),
(yang@imeg.bio.psu.edu) has released PAML 1.0, a program for the
maximum
likelihood analysis of nucleotide or protein sequences (including Hidden
Markov
Model analysis like the features we have in DNAML). It is
available as C
source code for Unix systems, and is free by anonymous ftp from the
molecular
biology software servers.
It will be found on
ftp.bio.indiana.edu, for
example, in directory molbio/evolve.
12. Pablo Goloboff, of the American
Museum
of
Natural
History
(goloboff@amnh.org),
distributes PEWEE and NONA, to carry out
weighted
parsimony analyses. The programs run on DOS with versions available for
both
386-486-Pentium machines and earlier 16-bit machines. Goloboff's
address is
Dept. of Entomology, American Museum of Natural History, Central Park
West at
79th Street, New York, NY 10024. His telephone number is 212 769 5619,
and fax
number is 212 769 5277.
13. Yasuo Ina of the National Institute of Genetics, Mishima,
Japan
(yina@ddbj.nig.ac.jp) has developed ODEN, a package of programs for
doing
distance matrix analyses on nucleotide or protein sequences. It is
described
in CABIOS 10: 11-12 (1994).
It is available free by anonymous
ftp from
directory pub/oden in bioslave.uio.no as C source code for Unix systems.
14. A. Luettke and R. Fuchs have written MacT, a package of
programs for
Macintoshes that compute distances and compute Neighbor-Joining
phylogenies for
them. The programs work
code in
on
4
through
26
sequences,
and
source
Microsoft QuickBasic is provided as well as compiled executables.
package
is free and is available on the molecualr biology software
servers.
On
ftp.bio.indiana.edu it will be found in directory molbio/mac. The
programs are
described in CABIOS 8: 591-594, 1992.
The
15. Andrey A. Zharkikh,
Andrey
Rzhetsky,
and co-workers
in the
Institute
of Cytology and Genetics, Siberian Branch of the Russian
Academy of
Sciences, Novosibirsk, Russia, Ex-USSR, have produced VOSTORG, a
package of
programs for alignment (both manual and automatic) and inferring
phylogenies by
distance methods and parsimony for molecular sequences. It runs on
IBM PCcompatibles and includes some rather fancy graphics. The authors are
currently
in the U.S., not in Siberia. A version of the program is available
free by
anonymous ftp from gsbs18.gs.uth.tmc.edu in directory
pub/zharkikh/vostorg.
The programs are described in a paper by Zharkikh et. al. in Gene 101:
251254 (1991).
16. Rainer Wetzel and Daniel Huson have developed a Macintosh
program for
carrying out the "split decomposition" method of A. Bandelt and A.
Dress
(Molecular Phylogenetics 1: 242-252 (1992)).
Contact
huson@mathematik.unibielefeld.de for details.
17.
James Lake distributes "Evomony", a program
for
using
the
"evolutionary parsimony" (invariants) method for inferring phylogenies
from DNA
or RNA sequences. It runs on 286 or higher DOS systems with at least
500k
bytes of memory. A Macintosh version was also contemplated. I do not
know what
the current distribution arrangements are. Lake's address is
Department of
Biology, University of California, Los Angeles, California 90024.
18.
Walter Fitch (Department of Ecology and Evolutionary
Biology,
University of California, Irvine, California
92717, U.S.A.) has a
package
"Molevol" available free (on receipt of an appropriate number of
PCDOS
formatted floppy disks) with about 20 FORTRAN programs for not only
estimating
trees by parsimony and distance methods but doing various other
manipulations
of data that might be needed such as format interconversions and
searching for
homology and secondary structure. They are available as FORTRAN source
and/or
as PCDOS executables. The FORTRAN programs will also run on Sun
workstations
(and probably others too, I would suspect). His electronic mail
address is
wfitch@daedalus.bio.uci.edu.
19. Pierre Roux and Tim Littlejohn of the Informatics Division
of the
Organelle Genome Megasequencing Program at the Universite de Montreal
has made
available PARBOOT, a program that takes bootstrap sampled data sets and
splits
them up, submitting each to a different computer, so as to run
bootstrapping
quickly on networks of computers. It is available free as C source code
by ftp
from
megasun.bch.umontreal.ca
in directory pub/parboot.
It
requires a
networked system of computers with PHYLIP,
a "perl" interpreter,
and
appropriate accounts and permissions.
20. Andrey Zharkikh of the Genetics Centers at the University of
Texas
Health Sciences Center in Houston has programs for bootstrapping of
nucleotide
sequences, including his innovative double-bootstrap method for getting
less
biased
P
values.
They
are
available
free
by
anonymous
ftp at
gsbs18.gs.uth.tmc.edu/pub/zharkikh/bootstrap
or
gsbs18.gs.uth.tmc.edu/pub/zharkikh/bootstrap/double-bootstrap.
The
programs
njbootjc, njbootk2, and njbootli implement methods based on JukesCantor,
Kimura, and Li distances, respectively.
21. David Penny (Department of Botany and Zoology, Massey
University,
Palmerston North, New Zealand) has been offering for free distribution
several
PCDOS programs, one a fast parsimony program, TurboTree. There are
also two
others,
possible
Hadtree
which
computes
expected
frequencies
of
all
distributions of nucleotides among species, and Great Deluge, an
approximate
search for the most parsimonious tree by a quasi-random method.
tells me
that funding exigiencies are such that he may soon have to start
charging for
these. His electronic mail address is dpenny@massey.ac.nz.
He
22. Jotun Hein, (Institute of Genetics and Ecology, University of
Aarhus,
8000 Aarhus C, Denmark) has produced TreeAlign, a multiple sequence
alignment
program that builds trees as it aligns DNA or protein sequences.
It
uses a
combination of distance matrix and approximate parsimony methods.
TreeAlign
uses too much memory for it to run on PC's (DOS or Mac systems) but is
really
designed for a workstation or mainframe. It is available by anonymous
ftp at
the Indiana, Houston, and EMBL molecular biology software distribution
sites.
Their
network
addresses
are
respectively:
ftp.bio.indiana.edu,
ftp.bchs.uh.edu, and ftp.ebi.ac.uk. In the Indiana archive one must
enter
directory molbio/align, in the Houston archive it is in directory
pub/geneserver in the directories unix and vms.
If you are on Internet
and use
molecular data it is important that you learn to use anonymous ftp and
become
familiar with these ftp servers.
23. Another multisequence alignment program that estimates trees
as it
aligns multiple sequences is ClustalW. Currently it is distributed as C
source
code, and in Macintosh and DOS executables by its author, Desmond
Higgins.
He
is at the European Bioinformatics Institute in Cambridge, England.
ClustalW
successfully compiles and runs on many different workstations. DOS,
Mac, and
PowerMac executables are also available
It is a complete rewrite and upgrade of the Clustal and ClustalV
packages;
the first was described by Higgins and Sharp (1989). New features
include the
ability to detect read different
input
formats
(NBRF/
PIR,
Fasta,
EMBL/Swissprot);
align old alignments; produce phylogenetic trees
after
alignment (Neighbor Joining trees with a bootstrap option); write
different
alignment
formats
(Clustal, NBRF/PIR, GCG, PHYLIP); full command
line
interface.
The program is available by anonymous ftp at the Indiana,
Houston, and
EMBL molecular biology distribution sites.
Their network
addresses are
respectively:
ftp.bio.indiana.edu, ftp.bchs.uh.edu, and
ftp.ebi.ac.uk.
In
the Indiana archive one must enter directory molbio/align, in the
Houston
archive it is in directory pub/gene-server in all of the four directories
dos,
Mac, unix, and vms (I do not know exactly where it is in the EBI
machine). If
you are on Internet and use molecular data it is important that you
learn to
use anonymous ftp and become familiar with one or more of these ftp
servers.
24. Ward Wheeler and David Gladstein have written MALIGN, a
parsimonybased alignment program for molecular sequences. It implements the
original
suggestion by Sankoff, Morel, and Cedergren (1973) that alignment
and
phylogenies could be done at the same time by finding that tree that
minizes
the total alignment score along the tree.
Jotun Hein's program
TreeAlign
(mentioned above) is another, more approximate but probably faster,
attempt to
implement the Sankoff-Morel-Cedergren suggestion.
MALIGN is available
from
Ward Wheeler at the American Museum of Natural History in New York
city. His
email address is wheeler@amnh.org. It comes in DOS, Mac and SUN
versions.
25. Rod Page has written COMPONENT, a program for PCDOS
systems for
comparing cladograms for use in phylogeny and biogeography studies.
It has
many tree comparison and consensus methods, and far more
features for
biogeographic studies (such as comparing species and area cladograms)
than any
other package. It runs on PCDOS 286 or 386 systems under Windows
3.0 or
higher.
Its cost is 40 pounds U.K., and it can be ordered Liz Timpson
at the
Department of Botany, Natural History Museum, London (emt@nhm.ic.ac.uk).
Rod's
e-mail address is rod.page@zoology.oxford.ac.uk.
There is a review
of the
program in Cladistics 9: 351-353 (1993). COMPONENT has a World Wide Web
site:
http://evolve.zps.ox.ac.uk/Rod/cpw.html which includes an order form.
26. Andrew Purvis and Andrew Rambaut of the Department of
Zoology,
University of Oxford, England, have written CAIC (Comparative
Analysis of
Independent Contrasts). It is a Macintosh program that carries
out the
contrasts method (like my CONTRAST) but with some modifications by
others to
cope with lack of resolution of the phylogeny.
It is available
free by
anonymous ftp from directory packages/CAIC at evolve.zps.ox.ac.uk.
It is
described in CABIOS 11: 247-251 (1995).
27. Joaquin Dopazo at the Centro Nacional de Biotecnologia in
Madrid,
Spain, has written a program ABLE (Analysis of Branch Length Errors)
which
implements the method described by Adell and Dopazo in J. Mol. Evol.
38:305-309
(1994).
This is a form of the parametric bootstrap. It makes use of
PHYLIP.
It is
available
as
a
DOS
executable
over
World
Wide
Web
at
http://www.cnb.uam.es/www/ximo or by anonymous ftp at:
ftp.cnb.uam.es in
directory software/molevol.
28. Kent Fiala, now of SAS Institute, has written a compatibility
(clique)
program, based on an earlier program written by Kent and George
Estabrook.
Christopher Meacham has put the latest version of CLINCH (6.2), with
Kent's
permission, as a self-extracting DOS archive vailable free on Jim
Beach's
TAXACOM fileserver, muse.bio.cornell.edu. CLINCH 6.2 and associated
files can
be found by anonymous ftp in /pub/software/clinch as clinch62.exe,
which is a
self-extracting archive. Documentation, sample input and output, and
FORTRAN
source code are included.
PC-CLINCH is probably the most
sophisticated
compatibility analysis program. The Taxacom server, by the way, also has
other
material related to botanical systematics, including flora information.
29. Christopher Meacham (Museum Informatics Project,
University
of
California, Berkeley, California 94720, U.S.A.) produces COMPROB, a
Pascal
program to compute probabilities that characters would be compatible at
random,
thus telling us which clique is "most surprising". He can be
contacted as
meacham@violet.berkeley.edu about receiving a copy. The program is free.
30. The program MARKOV computes a distance measure between
pairs of
nucleotide sequences. It also constructs phylogenies from these and
summarizes
the 4x4 substitution matrices between the pairs of species.
It uses a
more
general model of substitution than used in PHYLIP, the Stationary Markov
Model
described in the paper by Saccone et. al. in Methods in Enzymology volume
183,
pages 570-583, 1990. Bootstrapping is used to analyze the statistical
error of
the results. Output files from CLUSTAL and PILEUP, as well as some
other
formats, can be used for input, and analysis can be confined to certain
codon
positions in coding sequences. The program is written in FORTRAN and
runs on
VMS and Unix systems.
It was produced by Dr. Graziano Pesole and
Professor
Cecilia Saccone at the University of Bari, Italy, and is available (for
free?)
from Dr. Cecilia Lanave at CSMME-CNR, Dipartimento di Biochimica e
Biologia
Molecolare, Universita` di Bari, via Orabona 4, 70126 Bari, Italy.
Her
phone
number is 39-80-243305, her fax number is 39-80-243317, and her e-mail
address
is lanave@vaxba0.ba.it or mvx36@ibacsata.it
31. J. S. Armstrong, A. J. Gibbs, R. Peakall and G. Weiller, of
Australian
National University, Canberra, have produced RAPDistance, a package for
DOS and
(presumably) Windows systems for computing distance matrices for RAPD
analyses,
for use in various phylogeny programs.
RAPDistance is available
free by
anonymous ftp from directory pub/RAPDistance at life.anu.edu.au, or
on the
World Wide Web at http://life.anu.edu.au/molecular/software/rapd.html.
32.
have
P. R. Reeves and colleagues at
Sydney
University,
Australia,
produced
sequence
MULTICOMP,
a
program
for computing various distances from
data. It is described in a paper by Reeves et. al. in CABIOS 10:
281-284
(1994).
I do not know what computer systems it runs on. Reeves
may be
contacted at reeves@angis.su.oz.au for distribution information.
33. Ken Rice of the Department of Organismal and Evolutionary
Biology of
Harvard University has produced RSVP (restriction site variability
program)
which calculates several measures of genetic variability based on
restriction
map data.
It also produces Jukes-Cantor corrected distance
matrices with
standard errors from collections of restriction maps.
C source
code for
Version
2.08
of
RSVP
is
available
free
by
anonymous ftp
from:
oeb.harvard.edu/rice
or
you
can
get
it
on
WWW
from:
http://oeb.harvard.edu/~rice. It runs under Unix.
34. J. S. Farris and Mary Mickevich earlier released a
package of
phylogeny programs, PHYSYS, which, at about $5,000, was extremely
expensive (in
my opinion, which is certainly a biased one). I am not sure whether,
from
whom, or under what conditions it is still available.
35. Fujitsu Ltd. ("a $21 billion global leader in advanced
computer,
telecommunications, and electronic devices") sells for $28,000 US a
Fujitsu S
family workstation complete with a
program,
SINCAIDEN,
which
allows
"experimental researchers, even those unfamiliar with such
analyses, [to]
easily create phylogenetic trees in their own laboratories." The program
also
allows searches of the major nucleic acid sequence and protein
databases (the
ad I saw does not make it clear whether these databases are provided
with the
workstation).
The methods available are UPGMA, neighbor-joining,
Farris's
(Distance Wagner) and the modified Farris distance matrix methods.
The
workstation is SPARC compatible and runs SunOS. The SYNCAIDEN
program was
developed by the group at the National Institute of Genetics, Japan
under Dr.
Takashi Gojobori.
Fujitsu Ltd. may be contacted at 21-8, NishiShinbashi 3chome, Minato-ku, Tokyo 105, Japan (phone 81-3-3437-5111 ext. 2831, fax
81-35472-4354), or in the U.S. at Fujitsu America Inc., 3055 Orchard
Drive, San
Jose, California 95134-2017 (phone 1-408-432-1300 ext. 5168, fax 1408-4341045).
36. MUST, a package of sequence management programs, is distributed
on a
shareware basis by Herve Phillippe, Laboratoire de Biologie
Cellulaire (URA
CNRS 1134 D), Batiment 444, Universite de Paris-Sud, 91405 Orsay cedex,
France.
His e-mail address is: adoutte@frciti51 on Bitnet/EARN. His phone
and fax
numbers are respectively 33.1.69.41.64.81 and 33.1.69.41.21.30.
MUST is
available on a shareware basis ($100 registration fee if you do
not send
diskettes) and runs on DOS systems using DOS version 3 or later.
It is
intended as complementary to existing phylogeny and alignment programs
and can
produce output files in the formats of PHYLIP, PAUP, Hennig86, and
CLUSTAL. It
contains a variety of sequence input, editing, checking, and storage
functions,
as well as a sequence editor and a phylogeny plotter. It also allows
further
analyses of the results from these phylogeny programs.
37. Steve Smith, formerly of the Harvard Genome Laboratory, has
written
an X-Windows interactive sequence editor, GDE (Genetic Data Environment)
which
allows the user to edit sequences and align them by hand, and to select
subsets
of sites and sequences and call a variety of analysis proprams
including
ClustalV and many of the PHYLIP 3.5 programs. The GDE 2.0 system will
run on
many workstations that have the X windowing system. It also
includes the
TreeTool tree-plotting program (see below). GDE 2.0 is free and is
available
for anonymous ftp transfer at the molecular biology software servers,
such as
ftp.bio.indiana.edu
in
directory
molbio/unix/GDE,
or
at
megasun.bch.umontreal.ca in directory pub/gde. At the latter location
there
are also Linux binaries, and at both there are Sun binaries.
38. Mike Maciukenas, at the Department of Microbiology of the
University
of Illinois, has written a wonderful X-windows based interactive treeplotting
program called TreeTool. It takes as input a PHYLIP tree file, with
branch
lengths if they are provided, displays the tree in either rooted or
unrooted
form on any X-windows screen, and allows the user to modify the form
of the
tree and the placement of nodes and labels. When the tree is in final
form the
user can have it written to a Postscript file and/or printed to a
Postscriptcompatible printer.
TreeTool is free as a C program for X windows
and is
available
for
anonymous
ftp
from
ftp.bio.indiana.edu
in
directory
molbio/unix/GDE.
It is also included in the GDE 2.0 sequence
analysis
environment mentioned above.
39. Manolo Gouy of the University of Lyon, France, has produced
NJplot,
which displays phylogenies (input in the standard form) on Macintosh
screens
and saves them in PICT files. It is available free and can be
retrieved by
anonmyous ftp from molecular biology software servers such as the
European
Bioninformatics Institue's server, ftp.ebi.ac.uk, where it is in
directory
pub/software/mac.
HOW YOU CAN HELP ME
Simply let me know of any problems you have had adapting the
programs to
your computer. I can often make "transparent" changes that, by making
the code
avoid the wilder, woolier, and less standard parts of C, not only help
others
who have your machine but even improve the chance of the programs
functioning
on new machines. I would like fairly detailed information on what
gave
trouble, on what operating system, compiler and machine, and what had
to be
done to make it work. I will be pleased to help do some over-thetelephone
trouble-shooting, particularly if I don't pay for the call. Electronic
mail is
a particularly convenient way for me to be asked about problems, as
you can
include your input and output files so I can see what is going on. I'd
really
like these programs to be able to run with only routine changes on
ABSOLUTELY
EVERYTHING, down to and possibly including the Amana Touchmatic
Radarange
Microwave Oven (which is an Intel 8080 system -- early versions of this
package
did run successfully on Intel 8080 systems).
I would also like to know timings of programs from the package,
when run
on the three test input files provided above, for various computer and
compiler
combinations, so that I can provide this information in the section on
speeds
of this document.
For the phylogeny plotting programs DRAWGRAM and DRAWTREE, Chris
Meacham
and I are particularly interested in knowing what has to be done to
adapt it
for other common plotters, laser printers, and dot matrix printers.
You can also be helpful to PHYLIP users in your part of the
world by
giving them the latest version of PHYLIP and helping them with any
problems
they may have in getting PHYLIP working on their data.
Your help is appreciated. I am always happy to hear
suggestions for
features and programs that ought to be incorporated in the package, but
please
do not be upset if I turn out to have already considered the
particular
possibility you suggest and decided against it.
I would also like to know of any applications of PHYLIP
that get
published: I would appreciate receiving a reprint of any paper
reporting work
that used PHYLIP.
IN CASE OF TROUBLE
READ THE (DOCUMENTATION) FILES METICULOUSLY ("RTFM").
If that
doesn't
solve the problem, get in touch with me. I am on electronic mail
at the
addresses given below. If you do ask about a problem, please
specify the
program name, version of the package, computer and compiler, and be
prepared to
send me your data file so I can test the problem. Also it helps to
have the
relevant input and output and documentation file nearby so that we can
refer to
it. I can also be reached by calling me in my office: (206)-543-0150,
or at
home: (206)-526-9057 (how's THAT for user support!). If I cannot be
reached at
either place, a message can be left at the office of the
Department of
Genetics, (206)-543-1657 but I prefer strongly that I not call you, as
in any
phone consultation the least you can do is pay the phone bill.
Particularly if you are in a part of the world distant from me,
you may
also want to try to get in touch with other users of PHYLIP nearby.
I can
also, if requested, provide a list of nearby users.
Joe Felsenstein
Department of Genetics
University of Washington
Box 357360
Seattle, Washington 98195-7360, U.S.A.
Electronic mail addresses (I prefer that you use the Internet address
if possible):
joe@genetics.washington.edu
joe@evolution.genetics.washington.edu
joe@128.95.12.41
Download