Documentation for FREQPARS Release 1.0 October 31, 1988 I

advertisement

Documentation for FREQPARS

Release 1.0

October 31, 1988

I. General features of the program

FREQPARS is a program that constructs phylogenetic trees by applying the principle of parsimony to allele frequency data. The program was written by D.L. Swofford, and carries out the procedure described by D.L. Swofford and S.H. Berlocher (1987).

The objective of the program is to find the tree on which the frequency of each allele undergoes the least possible amount of change, while at the same time ensuring that allele frequencies in hypothetical ancestors add to one (the "additivity requirement").

The procedure for finding the shortest tree has two parts, as described in Swofford and Berlocher (1987). First, given a particular tree topology, one must find a set of allele frequencies for hypothetical ancestral taxa that results in the least possible amount of allele frequency change (subject to the additivity requirement) on that particular tree. When the minimal amount of change is summed across all alleles, one obtains the tree length. Part two is to find, from the set of all possible tree topologies, the particular tree that yields the shortest tree length when step one is carried out.

FREQPARS will carry out both parts of the procedure or will carry out only part one, when supplied with a tree topology. If both parts of the procedure are executed, FREQPARS uses a modification of the Wagner method (a "heuristic" method) to find the tree topology yielding the shortest tree length (Swofford and

Berlocher, 1987).

This version of FREQPARS is written in a very standard subset of Fortran 77. A version compiled for the IBM-PC (FREQPARS.EXE) and source code are included.

II. Limitations of the current version

FREQPARS is in a very early stage of development. We had hoped to provide a more sophisticated version, but have been unable to find the time to develop it. Consequently, we have decided to send out the same version of the program that we used to write our Systematic Zoology paper, which is admittedly not very user-friendly. We plan to develop a subsequent version

(written in C) that provides simpler input, more complete and useful output, and faster operation. In the meantime, two caveats are in order.

The two parts of the procedure described above are not of equal difficulty. Part one, finding ancestor frequencies for a given topology, is much simpler than part two, searching for the topology on which the shortest tree can be built. This corresponds exactly to the case of conventional data; finding the shortest tree is difficult for large numbers of taxa and characters, regardless of the procedure or program employed.

Caveat one, then, is that FREQPARS at present has a very limited ability to search for the shortest tree. That is, the second part of the procedure is not efficient in this release, and may very well not find the shortest tree. The program has no provision for the kind of searches possible in PAUP (branch-and- bound, automatic branch swapping) (Swofford, 1985).

The reason that these features are not yet incorporated into

FREQPARS is that part one -- finding ancestor frequencies that minimize total frequency change -- is computationally intensive.

Even though shortcuts are taken when possible, many loci in a data set will require linear programming optimization (see Swofford and

Berlocher, 1987), which require copious amounts of central processor time.

Therefore, our second caveat is that FREQPARS will use large amounts of time for large data sets, just to build a single tree.

Having issued these warnings, we hasten to add that we are not seeking to discourage you from using FREQPARS but simply trying to give you realistic expectations. As long as you are aware of the limitations of the current release, there is still much that can be done with the program. In addition to straightforward tree estimation, there are several applications of the program not discussed in our paper. For example, you can use the program to evaluate tree topologies produced by other methods. If one topology is suggested by a Qualitative Hennigian

Analysis (Patton and Avise, 1983), and another by a Distance

Wagner analysis of, say, Manhattan Genetic Distance, then FREQPARS could be used to judge between the two topologies.

III. Format for datafile

A. Description of datafile

The datafile contains the allele frequencies for the

terminal taxa (henceforth referred to as OTUs), and other

information that the program requires. Lines are usually

limited to 80 characters for convenience but may be as long

as required by the data.

Line 1: Title line: Up to 80 characters.

Line 2: Numbers of OTUs, number of loci: Free format, separate the

two integer values with one or more blanks.

Line 3: OTU labels: Each label can be from 1-8 characters, each

enclosed in single-quotes. Use as many lines as needed.

Line 4: Options line: This line can be omitted to take defaults for

all values. If used, the first word in the line must be

"OPTIONS", followed by one or more of the items below (see

the example in the appendix):

a. USERTREE:

Read tree topology (or topologies) from treefile. If

USERTREE is not specified the program will compute a

tree using the modified Wagner procedure (Swofford and

Berlocher, 1987) as the default.

b. NOTABLE:

Suppress the output of a table of allele frequencies

either read from the data file (terminal taxa, OTUs) or

assigned by the program to hypothetical ancestral taxa

(HTUs). The default is to print these tables.

c. LIST:

Creates a file containing lengths of each tree evaluated

(useful when input treefile contains many topologies,

and you use NOTABLE to reduce the amount of output).

Default is to not create this file.

d. NOCHECK:

Suppresses the check that allele frequencies sum to 1

(within a 0.01 tolerance for roundoff error) for each

locus in each taxon. Default is to check these sums.

Warnings are issued if frequencies do not sum to one,

but execution proceeds.

Line 5: Fortran format specification for data: An "A" field is used

to read allele names and one or more "F" fields are used to

the allele frequencies (see example in Appendix and

description of next lines).

Line 6: Name of locus 1 (enclosed in single quotes) and number of

alleles for this locus. There must be at least one blank

between locus name and number of alleles.

Line 7: Allele name and frequencies for allele 1 of locus 1. The

first 1-5 characters, in the A field of the FORTRAN

specification, are for the allele name. Allele frequencies

then follow, in the order of OTUs given in line 3. That is,

the frequency of allele 1 of locus 1 for OTU 3 is in the

3rd

F field in line 7. (Note that as many lines as necessary may

be used, so long as the format specification is compatible

with a multiline entry.)

Line 8-n: Lines in the same format as line 7 list the allele name and

frequencies for allele 2 of locus 1 etc., with subsequent

lines as necessary until all alleles of locus 1 are accounted

for. Then repeat line 6 for locus 2, etc. See example file

in Appendix.

B. Conversion program for BIOSYS-1 data sets.

A short FORTRAN program entitled "BIO2FREQ", included with FREQPARS, will convert a BIOSYS-1 input file (Swofford and Selander, 1981) into a

FREQPARS datafile.

C. Limits on size of data set.

As currently dimensioned, the compiled PC version of FREQPARS will accept up to 10 alleles/locus, 200 total alleles, 20 OTUs, and 50 loci

(see

Section VI for information on changing these limits). On mainframes, the program can be redimensioned to accept larger data sets (assuming enough system memory is available). See Section VI for more information.

IV. Format for treefile

By specifying the USERTREE option, the user can request optimization of one or more user defined tree topologies. The resulting output includes the minimum length required by each tree and a set of allele frequency assignments to the HTUs consistent with this length. If USERTREE is employed, a file (treefile) containing the topology (topologies) of the tree

(trees) of interest must be provided. The only information in the treefile is an ancestor function representing the topology of each tree topology to be optimized.

An ancestor function is a concise way of describing the topology of a given tree. Trees must be rooted in a particular way in order to define an ancestor function, but this is only for convenience, since FREQPARS in reality produces unrooted trees. Note that trees input to FREQPARS must be completely bifurcating. That is, each interior node (HTU) must be connected to exactly three other nodes.

A. Format for one topology

Assume that we wish to evaluate the unrooted tree below

(or

any of the 7 possible rooted trees that can be produced from the

tree).

1 3 4

\ | /

\ | /

*------*------*

/ \

/ \

2 5

1. Rooting.

To proceed with the ancestor function, the tree should

be rooted at the interior node adjacent to the last taxon in

the datafile, as shown below:

1 2 3 4 5

\ / / | /

\ / / | /

* / | /

\ / | /

* | /

\ | /

\ | /

\ | /

\ | /

\ | /

\ | /

\|/

*

OTUs need not be ranked in numerical order, but the interior

node connected to the last taxon (rightmost in the data file)

must be the root.

2. Labeling interior nodes.

If there are n OTUs, there will be n + 1 through 2n -

2

interior nodes representing HTUs. The interior nodes are

labeled with numbers, starting with n + 1. The labels must

always increase as one moves on the tree from any tip toward

the root. Consequently, the only valid labeling for the

above tree is as follows:

1 2 3 4 5

\ / / | /

\ / / | /

6 / | /

\ / | /

7 | /

\ | /

\ | /

\ | /

\ | /

\ | /

\ | /

\|/

8

3. Determining ancestor functions.

The ancestor function is simply a list of immediate

ancestor of each OTU and HTU written in the order of the

OTUs and HTUs. For the labeled tree above (the one on the

right), the ancestor function is determined as follows:

-------- OTUs ------- --HTUs--

( 1 2 3 4 5 6 7 )

6 6 7 8 8 7 8

The ancestor function itself is the bottom line; this is

all that need be present in the treefile. (The top line, in

parenthesis, is shown only to simplify explanation, and is

not entered in the treefile.)

There should be exactly 2n - 3 entries for each tree.

The ancestor function is entered in free format, with a blank

between numbers. As many lines as needed can be used.

B. Multiple tree topologies.

The ancestor function for each succeeding tree topology

simply follows the preceeding one. You may use as many lines as

necessary for each tree, but start each ancestor function on a new

line.

V. Execution of program

A. Execution on IBM PC:

A math coprocessor (8087, 80287, or 80387, depending on your

machine) is required for running FREQPARS.

Make sure the line "FILES=10" (or >10) appears in the

CONFIG.SYS

file of the disk you boot from. Remember that if you change the

CONFIG.SYS, you will have to reboot before the change takes

effect.

To run FREQPARS.EXE, type the following at the DOS prompt:

FREQPARS datafile

where "datafile" may be any 1-8 character filename. The file

"datafile.FRQ" contains the allele frequency data. If USERTREE

was requested in the OPTIONS line, the file "datafile.TRE",

containing one or more ancestor functions in the format described

above, must be present.

Primary output is written to "datafile.OUT" (unless NOTABLE was

requested in the OPTIONS line). If "LIST" was requested in the

options line, the list of tree lengths for each topology evaluated

will be written to the file "datafile.LST".

B. Implementation and execution on mainframe:

In addition to the IBM PC, FREQPARS has been successfully

implemented on an IBM 3081 mainframe running under VM/CMS. See

comments in the source code for further information. The source

code is written in very standard Fortran 77, so we do not

anticipate great difficulties in implementing the program on other

systems. The only major changes involve redimensioning (see

below), and associating data and output files with the

corresponding unit numbers in the program itself. A sample CMS

EXEC is provided on the distribution disk.

VI. Redimensioning

As provided for the IBM PC, the program will allow up to 10

alleles per locus (MA), a total of 200 alleles (MTOTA), 20 taxa

(MT),

and 20 loci (ML). Although these limits are somewhat restrictive,

remember that monomorphic loci contribute no information and should be

eliminated from the data set. Also, loci that are fixed identically in

n-1 taxa can be eliminated. If a taxon has two alleles that were

observed only in that taxon and not elsewhere, these can be pooled into

a single allele.

For larger data sets, a mainframe or minicomputer is required.

To

change the dimensions, simply change the values of the dimension

parameters (shown in parentheses in the above paragraph) in the file

FREQPARM.FOR and recompile. (Note that in Fortran implementations that

do not provide an 'INCLUDE' statement, this file will have to be copied

into the source code at the appropriate locations.)

VII. References

Patton, J.C., and J.C. Avise. 1983. An empirical evaluation of

qualitative Hennigian analyses of protein electrophoretic data.

J. Mol. Evol., 19: 244-254.

Swofford, D.L. 1985. PAUP: Phylogenetic analysis using parsimony.

User's manual. Illinois Natural History Survey, Champaign,

Illinois.

Swofford, D.L. and S.H. Berlocher. 1987. Inferring evolutionary

trees from gene frequency data under the principle of maximum

parsimony. Syst. Zool. 36: 293-325.

Swofford, D.L., Selander, R.B. 1981. BIOSYS-1: A FORTRAN program for

the comprehensive analysis of electrophoretic data in population

genetics and systematics. J. Hered. 72: 281-83.

Download