Documentation for FREQPARS
Release 1.0
October 31, 1988
I. General features of the program
FREQPARS is a program that constructs phylogenetic trees by applying the principle of parsimony to allele frequency data. The program was written by D.L. Swofford, and carries out the procedure described by D.L. Swofford and S.H. Berlocher (1987).
The objective of the program is to find the tree on which the frequency of each allele undergoes the least possible amount of change, while at the same time ensuring that allele frequencies in hypothetical ancestors add to one (the "additivity requirement").
The procedure for finding the shortest tree has two parts, as described in Swofford and Berlocher (1987). First, given a particular tree topology, one must find a set of allele frequencies for hypothetical ancestral taxa that results in the least possible amount of allele frequency change (subject to the additivity requirement) on that particular tree. When the minimal amount of change is summed across all alleles, one obtains the tree length. Part two is to find, from the set of all possible tree topologies, the particular tree that yields the shortest tree length when step one is carried out.
FREQPARS will carry out both parts of the procedure or will carry out only part one, when supplied with a tree topology. If both parts of the procedure are executed, FREQPARS uses a modification of the Wagner method (a "heuristic" method) to find the tree topology yielding the shortest tree length (Swofford and
Berlocher, 1987).
This version of FREQPARS is written in a very standard subset of Fortran 77. A version compiled for the IBM-PC (FREQPARS.EXE) and source code are included.
II. Limitations of the current version
FREQPARS is in a very early stage of development. We had hoped to provide a more sophisticated version, but have been unable to find the time to develop it. Consequently, we have decided to send out the same version of the program that we used to write our Systematic Zoology paper, which is admittedly not very user-friendly. We plan to develop a subsequent version
(written in C) that provides simpler input, more complete and useful output, and faster operation. In the meantime, two caveats are in order.
The two parts of the procedure described above are not of equal difficulty. Part one, finding ancestor frequencies for a given topology, is much simpler than part two, searching for the topology on which the shortest tree can be built. This corresponds exactly to the case of conventional data; finding the shortest tree is difficult for large numbers of taxa and characters, regardless of the procedure or program employed.
Caveat one, then, is that FREQPARS at present has a very limited ability to search for the shortest tree. That is, the second part of the procedure is not efficient in this release, and may very well not find the shortest tree. The program has no provision for the kind of searches possible in PAUP (branch-and- bound, automatic branch swapping) (Swofford, 1985).
The reason that these features are not yet incorporated into
FREQPARS is that part one -- finding ancestor frequencies that minimize total frequency change -- is computationally intensive.
Even though shortcuts are taken when possible, many loci in a data set will require linear programming optimization (see Swofford and
Berlocher, 1987), which require copious amounts of central processor time.
Therefore, our second caveat is that FREQPARS will use large amounts of time for large data sets, just to build a single tree.
Having issued these warnings, we hasten to add that we are not seeking to discourage you from using FREQPARS but simply trying to give you realistic expectations. As long as you are aware of the limitations of the current release, there is still much that can be done with the program. In addition to straightforward tree estimation, there are several applications of the program not discussed in our paper. For example, you can use the program to evaluate tree topologies produced by other methods. If one topology is suggested by a Qualitative Hennigian
Analysis (Patton and Avise, 1983), and another by a Distance
Wagner analysis of, say, Manhattan Genetic Distance, then FREQPARS could be used to judge between the two topologies.
III. Format for datafile
A. Description of datafile
The datafile contains the allele frequencies for the
terminal taxa (henceforth referred to as OTUs), and other
information that the program requires. Lines are usually
limited to 80 characters for convenience but may be as long
as required by the data.
Line 1: Title line: Up to 80 characters.
Line 2: Numbers of OTUs, number of loci: Free format, separate the
two integer values with one or more blanks.
Line 3: OTU labels: Each label can be from 1-8 characters, each
enclosed in single-quotes. Use as many lines as needed.
Line 4: Options line: This line can be omitted to take defaults for
all values. If used, the first word in the line must be
"OPTIONS", followed by one or more of the items below (see
the example in the appendix):
a. USERTREE:
Read tree topology (or topologies) from treefile. If
USERTREE is not specified the program will compute a
tree using the modified Wagner procedure (Swofford and
Berlocher, 1987) as the default.
b. NOTABLE:
Suppress the output of a table of allele frequencies
either read from the data file (terminal taxa, OTUs) or
assigned by the program to hypothetical ancestral taxa
(HTUs). The default is to print these tables.
c. LIST:
Creates a file containing lengths of each tree evaluated
(useful when input treefile contains many topologies,
and you use NOTABLE to reduce the amount of output).
Default is to not create this file.
d. NOCHECK:
Suppresses the check that allele frequencies sum to 1
(within a 0.01 tolerance for roundoff error) for each
locus in each taxon. Default is to check these sums.
Warnings are issued if frequencies do not sum to one,
but execution proceeds.
Line 5: Fortran format specification for data: An "A" field is used
to read allele names and one or more "F" fields are used to
the allele frequencies (see example in Appendix and
description of next lines).
Line 6: Name of locus 1 (enclosed in single quotes) and number of
alleles for this locus. There must be at least one blank
between locus name and number of alleles.
Line 7: Allele name and frequencies for allele 1 of locus 1. The
first 1-5 characters, in the A field of the FORTRAN
specification, are for the allele name. Allele frequencies
then follow, in the order of OTUs given in line 3. That is,
the frequency of allele 1 of locus 1 for OTU 3 is in the
3rd
F field in line 7. (Note that as many lines as necessary may
be used, so long as the format specification is compatible
with a multiline entry.)
Line 8-n: Lines in the same format as line 7 list the allele name and
frequencies for allele 2 of locus 1 etc., with subsequent
lines as necessary until all alleles of locus 1 are accounted
for. Then repeat line 6 for locus 2, etc. See example file
in Appendix.
B. Conversion program for BIOSYS-1 data sets.
A short FORTRAN program entitled "BIO2FREQ", included with FREQPARS, will convert a BIOSYS-1 input file (Swofford and Selander, 1981) into a
FREQPARS datafile.
C. Limits on size of data set.
As currently dimensioned, the compiled PC version of FREQPARS will accept up to 10 alleles/locus, 200 total alleles, 20 OTUs, and 50 loci
(see
Section VI for information on changing these limits). On mainframes, the program can be redimensioned to accept larger data sets (assuming enough system memory is available). See Section VI for more information.
IV. Format for treefile
By specifying the USERTREE option, the user can request optimization of one or more user defined tree topologies. The resulting output includes the minimum length required by each tree and a set of allele frequency assignments to the HTUs consistent with this length. If USERTREE is employed, a file (treefile) containing the topology (topologies) of the tree
(trees) of interest must be provided. The only information in the treefile is an ancestor function representing the topology of each tree topology to be optimized.
An ancestor function is a concise way of describing the topology of a given tree. Trees must be rooted in a particular way in order to define an ancestor function, but this is only for convenience, since FREQPARS in reality produces unrooted trees. Note that trees input to FREQPARS must be completely bifurcating. That is, each interior node (HTU) must be connected to exactly three other nodes.
A. Format for one topology
Assume that we wish to evaluate the unrooted tree below
(or
any of the 7 possible rooted trees that can be produced from the
tree).
1 3 4
\ | /
\ | /
*------*------*
/ \
/ \
2 5
1. Rooting.
To proceed with the ancestor function, the tree should
be rooted at the interior node adjacent to the last taxon in
the datafile, as shown below:
1 2 3 4 5
\ / / | /
\ / / | /
* / | /
\ / | /
* | /
\ | /
\ | /
\ | /
\ | /
\ | /
\ | /
\|/
*
OTUs need not be ranked in numerical order, but the interior
node connected to the last taxon (rightmost in the data file)
must be the root.
2. Labeling interior nodes.
If there are n OTUs, there will be n + 1 through 2n -
2
interior nodes representing HTUs. The interior nodes are
labeled with numbers, starting with n + 1. The labels must
always increase as one moves on the tree from any tip toward
the root. Consequently, the only valid labeling for the
above tree is as follows:
1 2 3 4 5
\ / / | /
\ / / | /
6 / | /
\ / | /
7 | /
\ | /
\ | /
\ | /
\ | /
\ | /
\ | /
\|/
8
3. Determining ancestor functions.
The ancestor function is simply a list of immediate
ancestor of each OTU and HTU written in the order of the
OTUs and HTUs. For the labeled tree above (the one on the
right), the ancestor function is determined as follows:
-------- OTUs ------- --HTUs--
( 1 2 3 4 5 6 7 )
6 6 7 8 8 7 8
The ancestor function itself is the bottom line; this is
all that need be present in the treefile. (The top line, in
parenthesis, is shown only to simplify explanation, and is
not entered in the treefile.)
There should be exactly 2n - 3 entries for each tree.
The ancestor function is entered in free format, with a blank
between numbers. As many lines as needed can be used.
B. Multiple tree topologies.
The ancestor function for each succeeding tree topology
simply follows the preceeding one. You may use as many lines as
necessary for each tree, but start each ancestor function on a new
line.
V. Execution of program
A. Execution on IBM PC:
A math coprocessor (8087, 80287, or 80387, depending on your
machine) is required for running FREQPARS.
Make sure the line "FILES=10" (or >10) appears in the
CONFIG.SYS
file of the disk you boot from. Remember that if you change the
CONFIG.SYS, you will have to reboot before the change takes
effect.
To run FREQPARS.EXE, type the following at the DOS prompt:
FREQPARS datafile
where "datafile" may be any 1-8 character filename. The file
"datafile.FRQ" contains the allele frequency data. If USERTREE
was requested in the OPTIONS line, the file "datafile.TRE",
containing one or more ancestor functions in the format described
above, must be present.
Primary output is written to "datafile.OUT" (unless NOTABLE was
requested in the OPTIONS line). If "LIST" was requested in the
options line, the list of tree lengths for each topology evaluated
will be written to the file "datafile.LST".
B. Implementation and execution on mainframe:
In addition to the IBM PC, FREQPARS has been successfully
implemented on an IBM 3081 mainframe running under VM/CMS. See
comments in the source code for further information. The source
code is written in very standard Fortran 77, so we do not
anticipate great difficulties in implementing the program on other
systems. The only major changes involve redimensioning (see
below), and associating data and output files with the
corresponding unit numbers in the program itself. A sample CMS
EXEC is provided on the distribution disk.
VI. Redimensioning
As provided for the IBM PC, the program will allow up to 10
alleles per locus (MA), a total of 200 alleles (MTOTA), 20 taxa
(MT),
and 20 loci (ML). Although these limits are somewhat restrictive,
remember that monomorphic loci contribute no information and should be
eliminated from the data set. Also, loci that are fixed identically in
n-1 taxa can be eliminated. If a taxon has two alleles that were
observed only in that taxon and not elsewhere, these can be pooled into
a single allele.
For larger data sets, a mainframe or minicomputer is required.
To
change the dimensions, simply change the values of the dimension
parameters (shown in parentheses in the above paragraph) in the file
FREQPARM.FOR and recompile. (Note that in Fortran implementations that
do not provide an 'INCLUDE' statement, this file will have to be copied
into the source code at the appropriate locations.)
VII. References
Patton, J.C., and J.C. Avise. 1983. An empirical evaluation of
qualitative Hennigian analyses of protein electrophoretic data.
J. Mol. Evol., 19: 244-254.
Swofford, D.L. 1985. PAUP: Phylogenetic analysis using parsimony.
User's manual. Illinois Natural History Survey, Champaign,
Illinois.
Swofford, D.L. and S.H. Berlocher. 1987. Inferring evolutionary
trees from gene frequency data under the principle of maximum
parsimony. Syst. Zool. 36: 293-325.
Swofford, D.L., Selander, R.B. 1981. BIOSYS-1: A FORTRAN program for
the comprehensive analysis of electrophoretic data in population
genetics and systematics. J. Hered. 72: 281-83.