GSA_Denver_2013

advertisement
Not Applicable? Not a Problem for Parsimony
Troy
1Department
1
Fadiga
and Ann F.
2
Budd
of Earth and Planetary Sciences, University of Tennessee, 1412 Circle Drive, Knoxville, TN 37996
2Department
of Geoscience, University of Iowa, 121 Trowbridge Hall, Iowa City, IA 52242
The Problem with Not Applicable Characters
Absence Coding of Not Applicable Characters
Absence coding , one suggestion for overcoming the NA problem,
When describing and delineating characters for phylogenetic analyses, the characters should aim to be
treats the unobservable condition as a character state. This means in
biologically independent, but some desirable characters are logically dependent on the observed condition of
our example that the tail color will be coded as red, blue, or tail
other characters (Sereno 2007). The classic example, given by Maddison (1993), deals with the presence and
absent.
absence of a tail and the color of the tail --if it is present. When dealing with a tailless taxon, additional
characters dealing with aspects of the tail would be considered not applicable (NA).
This is problematic in several ways: it makes the “absent” tail state
redundant with the unobservable tail color condition and it asserts a
The problem with these types of characters was elaborated by Maddison (1993). Traditionally NA character state
homology based not on an observed attribute but the tautology that
codings were treated the same way as missing/unobserved characters. This resulted in equally parsimonious
no observable difference can be observed among unobservable
trees being erroneously scored for longer tree lengths.
phenomena. In the tail example, the redundancy imposed by this
method would automatically give the observation that a tail is absent
Purple arrows depict transitions we want
to include when figuring out the number
twice the weight of observing the presence of the tail. In fact, the
of steps in the most parsimonious trees.
weight of the absent tail can be further inflated by adding more
The solid red lines depict incorrect
characters that describe the tail, such as tail shape.
transitions that are consistently counted
when figuring out the number of steps in
a phylogenetic tree.
While this approach is rightly rejected by researchers, there are two
improvements over the traditional treatment of NA coding: it places
the transitions at the correct places on the trees and its treatment of
newly applicable characters is consistent.
This diagram depicts the logical
relationship between characters (in
rectangles) and character states (in
rounded boxes) with black arrows. Purple
arrows depict transitions we want to
include when figuring out the number of
steps in the most parsimonious trees.
The red, dashed lines depict incorrect
transitions that are intermittently
counted when figuring out the number
of steps in a phylogenetic tree.
P- present, A- absent, R- red, B- blue
The above tree summarizes our
knowledge about a hypothetical clade.
Below are two trees that are both
equally consistent with the above tree
and equally parsimonious. The lower
trees have reconstructed the tail color,
mistakenly predicting the tail color for
taxa that have no tails. Even though both
trees below are equally parsimonious,
current algorithms mistakenly retrieve
different numbers of implied transitions
for the tail color character.
Composite Coding
Composite coding takes every logical
combination of character states from a set of
characters and yields a single character with
4
3
2
1
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
each logical combination as a character state.
Strong and Lipscomb (1999) critiqued the use
of composite coding when there are
secondary losses of the feature which other
Three inferred changes in tail color
Two inferred changes in tail color
The above diagram depicts the relationship between three characters: Tail
characters describe, but the critique is
(present/absent), Tail color (red/blue), and Tail shape (round/flat). Note that
mistaken. Their argument was based on a
transitions between a flat ,blue tail and a round, red tail is the same as
changes implying only a change in tail color or tail shape.
mappings on a particular tree and should be
Both of the above trees show the two independent acquisitions of tails. The inferred changes in tail color should
dismissed.
count the same number of steps in both trees. Because one clade with tails is well supported and originally had
blue tails, the algorithms are going to penalize trees where the other independent acquisition of tails start off as
red.
failure to consider all possible character state
Composite coding can work, but it depends on the particular relationships between characters (see next panel).
In the diagram above, three characters are composite coded and transitions that should be considered multiple
transitions are reduced to a single implied transition.
This results in treating newly applicable characters inconsistently and causing what should be equally
parsimonious trees to have different tree lengths. The inconsistent treatment originates from the current
algorithms’ (Fitch and Sankoff) progressive serial sweeps along the nodes of proposed trees which fail to detect
inappropriate influences from more distant parts of the tree.
A step matrix is a matrix of costs that applies different tree lengths to different inferred transitions. If the
maximum number of transitions between any two combinations of character states is two, then a step matrix can
be used to preserve the correct number of transitions. If the maximum number of transitions between any two
combinations of character states is more than two, then the step matrix can violate triangular inequalities and
Real data sets are not always amenable to manual inspection for the “accidental assumptions” introduced by
this treatment of NA coding, making this traditional treatment for NA characters an unacceptable option. For
some sets of taxa numerous NA codings are required, meaning those groups are poorly served by current
software.
provide undercounts of the number of transitions implied by a given tree.
The Solution
The solution is a radically simple subtraction problem that includes the use of two matrices to count the proper number of implied transitions. The original matrix gets decomposed into two matrices. The first matrix is the absence coded version of
the original matrix. This first matrix is going to take advantage of the fact that the absence coded matrix recovers all of the inappropriate transitions between a NA condition and an applicable condition along with all of the appropriate transitions.
The traditional treatment might retrieve some or none of the inappropriate transitions. The second matrix only encodes whether the character is applicable or not applicable for each taxon. This matrix, when optimized on a tree will retrieve all
inappropriate transitions and none of the appropriate transitions. The difference between the number of transitions in the absence coded matrix and the applicability matrix gives the number of true transitions implied for a tree by the original
character matrix . This approach allows the continued use of either the Fitch or the Sankoff algorithms for determining the number of transitions implied by each matrix.
Taxon 1
Taxon 2
Taxon 3
Taxon 4
Taxon 5
Taxon 6
Taxon 7
Taxon 8
Char 1 Char 2 Char 3 Char 4 Char 5
0
1
1
NA
NA
0
0
0
NA
NA
0
1
1
NA
NA
1
1
0
0
1
1
1
0
0
0
1
0
1
1
1
1
0
1
1
1
1
1
1
1
0
This matrix records the original observations and indications
of which characters are not applicable for which taxa.
Taxon 1
Taxon 2
Taxon 3
Taxon 4
Taxon 5
Taxon 6
Taxon 7
Taxon 8
Char 1 Char 2 Char 3 Char 4 Char 5
0
1
1
2
2
0
0
0
2
2
0
1
1
2
2
1
1
0
0
1
1
1
0
0
0
1
0
1
1
1
1
0
1
1
1
1
1
1
1
0
This matrix encodes the original matrix and changes the
NA codings to a character state that will be treated the
same as other character states.
Taxon 1
Taxon 2
Taxon 3
Taxon 4
Taxon 5
Taxon 6
Taxon 7
Taxon 8
Char 4 Char 5
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
This matrix only indicates whether a character is applicable or not
applicable for a taxon. Note that characters 1-3 were dropped
because they were applicable to all taxa and would be
uninformative in this matrix
These diagrams show the logical
dependencies between the characters and
character states (black arrows), all of the
appropriate transitions (purple arrows), and
inappropriate transitions (red arrows)
detected by each matrix. The method takes
advantage of absence coded matrices
implying a knowable number of inappropriate
transitions for a given tree.
Actions for Analyses
While the procedure is implemented in R and can be done with Mesquite (Maddison and Maddison 2011), no current software can search for trees that maximizes the difference in tree length of two
matrices in a time-appropriate manner that does a reasonable job of exploring tree space. If the most parsimonious trees all have the NA characters applicable for a monophyletic group, then the tree
length calculated by the traditional method is correct. Unfortunately this can’t be known a priori. When assessing which approach to use, a diagram of all of the logical dependencies between characters is
very valuable. Groups of characters with logical dependencies are called character complexes.
When to use composite coding
Use composite coding when the number of transitions implied by any combination of character states in a character complex is one. While this seems very restrictive, it is very frequently met. Most
instances of NA coding occur in one character that is logically dependent on the presence of one piece of anatomy. Use composite coding and conventional software will return the correct trees and tree
length. If the character complex is more elaborate, the condition can still be met. If all the NA characters are only immediately logically dependent on one character and all characters have at most one
character dependent on it, then composite coding and current software will work correctly.
When to use a step matrix
A step matrix can be applied to a composite coded character if the step matrix that would preserve the appropriate number of transitions between composited states maintains triangular inequality–
meaning that the total tree length implied by transitions from A to B and B to C are never less than the tree length implied by the transition from A to C. This condition is usually met in the literature when
two, but not more, NA characters are dependent on the same character.
This diagram shows the logical dependencies between
the characters and character states (black arrows). All
character complexes which show this pattern (the
number of characters can vary) can be composite coded
and used with current software to avoid NA related
miscalculations of tree length and tree searches.
When neither of the above conditions are met, current software will not be guaranteed to return the correct trees and tree length. Mesquite can calculate and filter trees based on the difference in tree
length from two matrices. Use this software and the needed two matrices to see if the most parsimonious trees returned by current software all have the same tree length. When a fast implementation
of the proposed method is written, the advantages will be that it will work regardless of the nature of the logical dependencies within a character complex.
Acknowledgements and Literature Cited
This work was supported by the NSF grant DEB-0343208.
Maddison, W. P. 1993. Missing data versus missing characters in phylogenetic analysis. Systematic Biology 42, no. 4: 576.
Maddison, W. P., and D. R. Maddison. 2011. Mesquite: a modular system for evolutionary analysis. Version 2.75 http://mesquiteproject. org/mesquite/mesquite. html.
Sereno, Paul C. 2007. “Logical Basis for Morphological Characters in Phylogenetics.” Cladistics 23 (6): 565–587.
Strong, E. E, and D. Lipscomb. 1999. Character coding and inapplicable data. Cladistics 15, no. 4: 363–371.
Download