TreeFitter Version 1.0 Manual (c) F. Ronquist 1999-2001 Tree fitting has important applications in historical biogeography, coevolution and gene tree-species tree fitting [see recent reviews by \Page, 1998 #1419; Ronquist, 1998 #760]. A general characteristic of these problems is that two different kinds of trees are fitted to each other because we are interested in inferring how members of these two tree types have interacted with each other during their evolution. In historical biogeography, the two different kinds of trees are organism phylogenies and area cladograms, in historical biogeography they are parasite and host phylogenies, and in gene tree-species tree problems they are gene trees and species phylogenies. I have argued that parsimony methods for tree fitting should be based on models recognising different types of events and associating each of these events with a cost inversely related to the likelihood of the event. Because many other workers do not believe in deriving parsimony methods from models with events, I have called my approach “event-based parsimony”. In my view, however, event-based parsimony really represents the only logically defensible way in which parsimony inference can be applied to the problem of tree fitting. A number of different models can be used in parsimony-based tree fitting but most of the work has thus far been concentrated on the four-event model first introduced by Page [, 1995 #551; Ronquist, 1998 #739] or subsets and variants thereof [Ronquist, 1990 #552; Goodman, 1979 #751; Ronquist, 1996 #547; Ronquist, in press #1591] In this model, we recognize four different types of events: codivergence events, duplication events, sorting events and switching events. Codivergence events correspond to vicariance in historical biogeography, simultaneous host and parasite speciation in coevolving species associations, and gene tree divergence caused by speciation in gene tree-species tree studies. Duplication events correspond to sympatric speciation or allopatric speciation in response to a temporary barrier in biogeography, independent parasite speciation in coevolution, and gene duplication in gene tree analysis. Sorting events correspond to (partial) extinction in biogeography, and lineage sorting in coevolution and gene tree analysis. Finally, switches correspond to dispersal between isolated areas in biogeography, host shifts in coevolving associations, and horizontal gene transfer in gene tree analysis. TreeFitter is a simple program for parsimony-based tree fitting. It can handle arbitrary cost assignments fulfilling the requirements that duplication events, sorting events, and switches all have zero or positive cost associated with them. Codivergence events can be associated with either positive or negative cost (or zero cost). In TreeFitter terminology, one kind of trees is called P-trees, the other kind H-trees, referring to the analogy with parasite and host trees in coevolutionary analysis. In historical biogeography, the H-trees are the area cladograms and the P-trees are the organism phylogenies. In coevolution, the interpretation is self-evident. In gene treespecies tree fitting, the P-trees are the gene trees and the H-trees are the species trees. TreeFitter has a limited number of commands but still allows a number of useful inferences to be drawn from the data sets. It fits any number of P-trees to a given H-tree, and it can search for the best H-tree given a set of P-trees. It can calculate the events implied by the minimum-cost solutions and reconstructions can be saved in TreeMap format (not yet implemented). Inferences about historical constraints or the number of events of a particular type can be tested against inferences drawn from random data sets. These random data sets are drawn from the original data either by random permutation of the terminals in the P-tree, the H-tree, or both. Alternatively, either the P-trees, the H-tree or both are replaced by trees drawn at random from a tree universe generated by the Markov process (all labelled histories equally probable) or from the tree universe where all labelled distinct cladograms are equally probable. Finally, TreeFitter can examine portions of parameter space to find the combination of cost assignments giving the best chances of finding historically constrained patterns, given a set of P-trees and an H-tree. By default, TreeFitter works with the following cost assignments: codivergence and duplication events have zero cost, sorting events have a cost of 1.0, and switches a cost of 2.0. This combination of cost assignments works well for a wide variety of problems but not for all cases where it is possible to retrieve phylogenetically conserved association patterns [Ronquist, in press #1591]. The commands available in TreeFitter are summarized below. Please remember that this software has been developed mainly for my own research needs and is not being maintained as a commercial software package. It is provided for free on the understanding that there are no guarantees that the software will not crash your system, destroy your files or fail to perform as you expect. Always keep backup copies of your files. Any suggestions for improvements or detailed bug reports are welcome and should be addressed to me (fredrik.ronquist@ebc.uu.se). TreeFitter commands The TreeFitter commands are described with a syntax similar to that used in the PAUP manual. A line fed to TreeFitter should contain a command, followed by some options with corresponding settings. In describing the syntax, items that are optional are given within square brackets [ ]. The settings can either be a floating point value (specified by floatval), an integer value (specified by intval, or any of a set of alternative keyword settings (given within curly brackets and separated by vertical lines, as in {setting1|setting2|setting3}. The commands can either be typed in from the keyboard or entered in a batch file. The batch file can then be processed by using the execute command. The format of the batch files is similar to the NEXUS format with different blocks of commands. TreeFitter commands can also be issued outside blocks. TreeFitter is case-insensitive except for the labels of the H-tree and P-tree terminals. Data file format Data files should commence with the line #NEXUS. The commands should then be divided into blocks, each block starting with “Begin {HOSTS | AREAS | SPECIES | PARASITES | ORGANISMS | GENES | TREEFITTER};” and ending with “End;” or “Endblock;”. Depending on the block, different commands are available. The TreeFitter commands are also valid if issued out of block but all other commands issued out of block will generate error messages. See Appendix 1 for examples of data files. Commands used in an AREAS, HOSTS or SPECIES block An AREAS, HOSTS or SPECIES block is used to feed TreeFitter with one or more H-trees, which are added to those already in memory, if any. The syntax is Begin {AREAS | HOSTS | SPECIES}; [tree [<tree-name1>] <tree-description1>;] [tree [<tree-name2>] <tree-description2>;] End; The hosts (or areas or species) block may contain as many trees as desired. Each tree may be named by a label. The label can be any combination of printing characters of any length. However, the label may not begin with a number or a left parenthesis. Furthermore, the name may not, when converted to lower case, be “all”. If no label is provided, the trees will be labelled “noname1”, “noname2”, etc. The tree description follows the Newick-format. If branch lengths are provided, TreeFitter will use the branch lengths to order the nodes in the tree (see below). Otherwise, the trees will be ordered arbitrarily (see above for discussion on ordered versus nonordered trees). The labels used for the terminal taxa in the H-tree are critical: the same labels must be used in the range descriptions of the P-trees that you wish to fit to the H-tree (this match is case-sensitive). The labels of the terminal taxa can contain any symbols except white space. Any polytomies in the tree will be broken according to the settings of the polyresolve option (see the SET command). Examples of H-tree blocks: Begin AREAS; tree hypothesis1 ((WN:2,EN:2):1,(WP:1,EP:1):2):1); tree hypothesis2 ((WN:1,EN:1):2,(WP:2,EP:2):1):1); End; Begin HOSTS; tree (Pap,(((Fab,(Fag,Ros)),(Sap,Ana)),(Lam,(Api,(Ast,Val))))); End; Begin SPECIES; tree right (kangaroo,(dog,(human,chimp))); tree wrong (chimp, (kangaroo, (human, dog))); tree very_wrong (man,(dog,(kangaroo,chimp))); End; The first block defines two different H-trees describing the same area relationships but differing in the order of the postulated vicariance events. The order is set by specifying the branch lengths of the two different H-trees (these ordered trees may more appropriately be termed H-tree histories). TreeFitter attempts to interpret the branch lengths as if they were ultrametric (measured in time units rather than in amounts of evolutionary change) and sets the order of the splitting events accordingly. The length of each time segment separating two consecutive splitting events can be set to an arbitrary number, such as 1, for the purpose of ordering the H-tree (Fig. XXX). Commands used in an ORGANISMS, PARASITES or GENES block Either of these blocks describes a set of P-trees, which TreeFitter adds to the P-trees in memory. The syntax is Begin {ORGANISMS | PARASITES | GENES}; [tree [<tree-name1>] [weight = <floatval>] <tree-description1>;] [tree [<tree-name2>] [weight = <floatval>] <tree-description2>;] End; The organisms (or parasites or genes) block may contain as many trees as desired. Each tree may be named by a label. The label can be any combination of printing characters of any length. However, the label may not begin with a number or a left parenthesis. Furthermore, the name may not, when converted to lower case, be “all” or “weight”. If no label is provided, the trees will be labelled “noname1”, “noname2”, etc. Each tree may be associated with a weight between 0 and 1. If no weight is given, the weight defaults to 1.0. The tree description follows the Newick-format. Branch lengths are ignored. The labels used for the terminal taxa in the P-tree are critical: they are used to match the P-tree terminals to the H-tree terminals. These labels are casesensitive. Example of an Organisms block: Begin ORGANISMS; tree lithobiusA weight=0.5 ((lava,lcom),(lser,lbig)); tree lithobiusB weight=0.5 (lava,(lcom,(lser,lbig))); tree Carabus ((C_aratus,C_hortensis),C_nemoralis),C_nitens); End; Commands used in an ASSOCIATIONS or DISTRIBUTIONS block An ASSOCIATIONS or DISTRIBUTIONS block specifies the match between the terminals in a specified P-tree with the terminals in an H-tree. The syntax is as follows: Begin {ASSOCIATIONS | DISTRIBUTIONS} [range <tree-name1> <range description1>;] [range <tree-name1> <range description1>;] End; The following distributions block specifies the distribution areas of the terminals in the three P-trees described in the ORGANISMS block given above: Begin DISTRIBUTIONS; range lithobiusA lava: WN EN, lser: WP EN, lcom:EP, lbig:EP; range lithobiusB lava: WN EN, lser: WP EN, lcom:EP, lbig:EP; range Carabus C_nitens:A B C D, C_nemoralis:C, C_nitens:D, C_hortensis:E, C_aratus:D; End; Note that the range statement can be divided into several lines. TreeFitter uses the semicolon to find the end of a statement in a datafile. Commands used in a TREEFITTER block Everything that can be done in the blocks discussed above can also be achieved using statements within a TREEFITTER block. These commands can also be issued out of block. The syntax is as follows (listing all currently implemented commands): Begin TREEFITTER; [ptree [<treename>] <tree-description>;] [htree [<treename>] <tree-description>;] [range <treename> <range-description>;] [select {htrees | ptrees} <tree-list>;] [deselect {htrees | ptrees} <tree-list>;] [clear {all | htrees | ptrees};] [show {htrees | ptrees} [<tree-list>]; [list {htrees | ptrees} <options>;] [set <options>;] [estimate <options>;] [fit <options>;] [search <options>;] [order <options>;] [filter <options>;] [log <file-name>;] [execute <file-name>;] [ihtest <options>;] End; Each of these commands is described in more detail below, in alphabetical order. ptree [<treename>] <tree-description>; This command is exactly equivalent to the tree command issued within a block describing P-trees (that is, an ORGANISMS, PARASITES or GENES block). See above. htree [<treename>] <tree-description>; This command is exactly equivalent to the tree command issued within a block describing H-trees (an AREAS, HOSTS or SPECIES block). See above. range <treename> <range-description>; This command has been described above under the ASSOCIATIONS or DISTRIBUTIONS block. select {htrees | ptrees} {ALL | <tree-list>}; This command is used to select some H-trees or some P-trees from those currently in memory. The user must specify whether H-trees or P-trees are being selected, and then choose whether all those trees (ALL) or only a subset specified by a tree list should be selected. The tree list should give either the names or the numbers of the trees that are to be selected. All trees that are not selected are automatically deselected. It is possible to specify a range of trees by giving the name or the number of the first tree followed by a hyphen and the name or number of the last tree. Note that the names of the trees are case-sensitive. Thus, ‘tree1’ is not the same as ‘Tree1’. To obtain a list of the trees and their number and selection status, use the list command. Examples: select select select select select htrees htrees htrees htrees htrees tree1 tree2 tree3; 1 2 3; 1-3; tree1 tree5-tree7; all; deselect {htrees | ptrees} {all | <tree-list>}; This command deselects H-trees or P-trees currently in memory but is otherwise equivalent to the SELECT command. clear {htrees | ptrees | all}; This command clears H-trees, P-trees, or both H-trees and P-trees from memory. show {htrees | ptrees} [<tree-list>]; This command writes an ASCII representation of either the H-trees or the P-trees specified in the tree list. If no tree list is given, all H-trees or P-trees are shown. list {htrees | ptrees} <options>; This command lists all H-trees or all P-trees, their number, name and whether they are selected or not. If there are valid costs for the H-trees (calculated by a FIT or a SEARCH command), these costs are given. set <options>; This command is used to change the settings of a number of different parameters, as follows. algorithm = {LB | UB} This option determines whether TreeFitter will be using a lower-bound or an upper-bound algorithm to fit H-trees and P-trees. The lower-bound algorithm is recommended for general usage but can occasionally give reconstructions with incompatible switches [Ronquist, 1996 #547; Ronquist, in press #1591]. The upper-bound algorithm is slower but gives exact solutions without incompatible switches for ordered H-trees. Unless you have many P-terminals per H-terminal and many switches, there is not likely to be much information in the P-trees about the order of the nodes in the H-tree, and all or most of the orderings of the nodes of the H-tree will have the same cost and imply the same set of events. Therefore, if you use the upper-bound algorithm in H-tree searches you are likely to obtain a large set of equally optimal H-trees that are identical in topology but differ only in the order of the splitting events (nodes). To check whether you have problems with incompatible switches, you can compare the lengths of the H-trees fitted with the upper-bound algorithm to the length of the same trees fitted with the lowerbound algorithm. treespace = {MARKOV | EQUAL} This sets the tree universe used for drawing random trees. If MARKOV is chosen, random trees are generated by a random speciation-extinction process (default setting). If EQUAL is chosen instead, trees are picked from a universe with all distinct labelled trees equiprobable. ccost = <floatval> This sets the codivergence cost to the specified value. The value can be zero, negative or positive. The default value is 0.0. ucost = <floatval> This sets the duplication cost to the specified value, which must be larger than or equal to zero. The default value is 0.0. scost = <floatval> This sets the sorting cost to the specified value, which must be larger than or equal to zero. The default value is 1.0. icost = {<floatval> | HFUNCTION} This sets the switch cost to the specified value, which must be larger than or equal to zero. The default value is 2.0. If HFUNCTION is given instead of a value, the switching cost is determined by the node distance between H-tree elements (as in modified Brooks Parsimony Analysis) (not yet implemented). cost = {DEFAULT | MC | BPA | FITCH} This sets all the event-cost assignments at the same time. If DEFAULT is specified, the cost values are set to the defaults (see above). If MC is specified, the cost values are set as appropriate for maximum codivergence analysis (ccost = -1, ucost = scost = icost = 0). If BPA is specified, the cost values are set as appropriate for modified Brooks Parsimony Analysis (ccost = INFINITY, ucost = 0, scost = 1, icost = HFUNCTION) (not yet available). If FITCH is specified, the cost values are set up for Fitch optimisation (ccost = INFINITY, ucost = 0, scost = INFINITY, icost = 1). INFINITY represents an arbitrary large number (in practice, 10 000 is used). polyresolve = <intval> When a polytomous P-tree is read, the polytomies are arbitrarily resolved to produce one or more binary trees. The value of polyresolve determines how many arbitrarily resolved trees are produced. If polyresolve is set to 1, only one tree of the same weight as the original tree is produced. If polyresolve is set to a value larger than 1, each resolved tree receives a weight corresponding to the weight of the original tree divided by the number of resolved trees produced (not yet available). mstaxa = {RECENT | ANCIENT | FREE} Determines whether a widespread P-tree terminal is treated using the recent, ancient or free option [Ronquist, in press #1591]. Default setting is RECENT. estimate <options>; This command explores different cost event assignments and their effects on the possibilities of finding phylogenetically conserved association patterns. The p values obtained with different cost-event assignments are reported. It is then up to the user to evaluate the results and to set the cost-event assignments accordingly. (The parameter space tested is currently hard-coded). cmin = <floatval> Determines the minimum codivergence cost. Default setting is 0.0. cmax = <floatval> Determines the maximum codivergence cost. Default setting is 0.0. cstep = <floatval> Determines the interval between successive codivergence costs tried. Default setting is 0.2. umin = <floatval> Determines the minimum duplication cost. Default setting is 0.0. umax = <floatval> Determines the maximum duplication cost. Default setting is 0.0. ustep = <floatval> Determines the interval between successive duplication costs tried. Default setting is 0.5. smin = <floatval> Determines the minimum sorting cost. Default setting is 1.0. smax = <floatval> Determines the maximum sorting cost. Default setting is 1.0. sstep = <floatval> Determines the interval between successive sorting costs tried. Default setting is 0.5. imin = <floatval> Determines the minimum switching cost. Default setting is 0.0. imax = <floatval> Determines the maximum switching cost. Default setting is 10.0. istep = <floatval> Determines the interval between successive switching costs tried. Default setting is 0.5. fit <options>; This command will fit the selected H-trees onto the selected P-trees using the currently chosen event-cost assignments (altered with the set command). Available options: output = {SUMMARY | STANDARD | DETAILED} The setting of this option determines the type of report produced by the fit command. If SUMMARY is chosen, only the cost (and p value, if relevant) is printed for each H-tree. If STANDARD is chosen, then a more detailed report is printed for each H-tree. If DETAILED is chosen, results are printed separately for each P-tree. perm = {HTERMS | PTERMS | HPTERMS | HTREE | PTREE | HPTREE} The setting of this option determines the type of permutation used to test the significance of results. If HTERMS is chosen, H-tree terminals are permuted; if PTERMS is chosen, P-tree terminals are permuted instead; and if HPTERMS is selected, both H-tree and P-tree terminals are permuted. If HTREE is chosen, then a random H-tree is drawn for each permutation; if PTREE is chosen, then a random P-tree is drawn instead. Finally, if HPTREE is chosen, both the H-tree and the P-tree is replaced by random trees. The tree universe used for the random trees is set by the treespace option. nperm = <intval> Sets the number of permutations used in permutation tests of the fit. If 0 is chosen, no permutations will be performed. calcevents = {YES | NO } Determines whether the program will calculate the frequency of different types of events (switches, duplications, sortings and switches) when fitting H-trees and P-trees. The reported frequency is the range (minimum and maximum) over the equally optimal reconstructions. showancstates = {YES | NO } Determines whether the ancestral states (the ancestral hosts) are output for each P-tree. Ignored unless output = DETAILED. (not yet available). showreconstructions = {YES | NO } Determines whether the optimal reconstructions are output for each P-tree. Ignored unless output = DETAILED. (not yet available). search <options>; Searches for the best H-tree given the selected P-trees. Available options: type = {EXHAUSTIVE | HEURISTIC} Determines whether an exhaustive or a heuristic search will be used. start = {RANDOM | STEPWISE} Determines whether a random tree or a stepwise built tree will be used as the starting point for heuristic searches. neighbourhood = <intval> Determines the swapping neighbourhood of the TBR algorithm. Setting neighbourhood to 1 is slightly more extensive than NNI. keep = {ONE | MIN | BOUND} Determines whether the search should keep only one tree of minimum cost, all trees of minimum cost, or all trees with a maximum cost set by the bound option. bound = <floatval> Determines the maximum cost of the H-trees to be kept. hterms = <list of H-tree terminals> This command sets the H-tree terminals that should be included in the calculated H-trees. The list of H-tree terminals should be put within quotation marks. For instance, the line ‘hterms = “A B C”;’ would restrict the H-tree terminals to the areas named ‘A’, ‘B’ and ‘C’. By default, all H-tree terminals appearing in the range descriptions of the selected P-trees will be included in the calculated H-trees. order <options>; Determines the order of the nodes in the currently selected H-trees. (not yet available). keep = {ONE | MIN | BOUND} Determines whether the search should keep only one tree of minimum cost for each starting tree, all trees of minimum cost, or all trees with a maximum cost set by the bound option. bound = <floatval> Determines the maximum cost of the H-trees to be kept. Determines whether an exhaustive or a heuristic search will be used. filter {htrees | ptrees} <options>; This command will filter H-trees based on cost and P-trees based on whether they are informative about the relationships among the terminal taxa in the selected H-trees. Unlike all other commands, the options used for this command are not persistent; they have to be given each time the command is invoked. (not yet available). cost = floatval This option sets the maximum cost value of the H-trees to be retained in memory. It will be ignored if P-trees are being filtered. The default value is INFINITY. compress = {YES | NO} Determines whether H-trees that have the same cost but differ only in the order of the splitting events should be compressed to a single tree. log <file-name>; This command is used to log the results to a file with the specified name. The file will be stored in the same directory as the TreeFitter program. execute <file-name>; This command will execute the file with the specified name. The file must be in the same directory as the TreeFitter program, unless the correct path is given as part of the file name.