2014-SystBiol-Adams-SuppMaterial-Kmult

advertisement
Supplementary Material from D.C. Adams, “A generalized K statistic for estimating phylogenetic
signal from shape and other high-dimensional multivariate data”. Systematic Biology.
Type I error and Statistical Power
Here I use computer simulations to show the statistical properties of the proposed method for
evaluating phylogenetic signal in multivariate data. Six sets of simulations were performed. The first three
sets of simulations were conducted on four perfectly balanced phylogenetic trees that differed in their
number of taxa: N = 16, 32, 64, 128. The remaining three sets of simulations were conducted on randomly
generated trees that differed in their number of taxa: N = 16, 32, 64, 128. For each simulation, a
phylogeny was specified, and the number of trait dimensions was specified (p = 2, 4, 6, 8, 10). Next, input
covariance matrices of dimension p × p were constructed, from which multivariate data were simulated
under Brownian motion. For simulations, three different models of input covariance structure were
utilized: 1) isotropic structure, where the input variance for each trait dimension was identical (
 12   22 
) and there was no input covariation between dimensions, 2) non-isotropic structure,
where the input variance in each trait dimension was allowed to differ from one another (  12   22 
)
and there was no input covariance between trait dimensions, and 3) non-isotropic structure that included
covariation among trait dimensions. For models with isotropic covariance structure, 2 = 1.0 was chosen
all trait dimensions. For simulations modeling non-isotropic covariance structure, the input 2 for each
trait dimension was drawn from a normal distribution following:   1.0; std  0.2 , and the p × p
covariance matrix was constructed using these values as the diagonal elements. For simulations modeling
non-isotropic covariance structure with covariation among trait dimensions, a random p × p covariance
matrix was generated in the following manner. First, a lower-triangular matrix L was generated by
drawing values from the normal distribution (   0; std  1 ). Next, the matrix product LLt was
calculated, which produces a positive semi-definite covariance matrix (following the Cholesky
decomposition: Σ = LLt ). Thus, LLt represents a random covariance matrix that includes differing
amounts of input variation among trait dimensions and covariation among trait dimensions.
From the initial covariance matrices, 1000 phenotypic datasets were obtained by evolving multidimensional traits along a phylogeny according to a Brownian motion model of evolution. Data were
simulated using the function ‘sim.char’ in the R-package Geiger (Harmon, et al. 2008). For tests of Type I
error, data the original phylogeny was transformed into a star phylogeny, using the lambda
transformation:  = 0.0. Phylogenetic signal in these datasets were then evaluated on the original fullyresolved tree. For tests of statistical power, data were simulated and evaluated on the resolved tree (for
details of this approach see: Blomberg, et al. 2003). To obtain a known range of phylogenetic signal, prior
to simulating phenotypic data the branch lengths of the phylogeny were transformed by the parameter ,
where  = 0.0 transforms the tree to a star phylogeny while  = 1.0 represents the original fully-resolved
tree. Transformation values used in this study were:  = 0, 0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0. For those
simulations utilizing randomly generated trees, a new phylogeny was simulated for each data set. Across
all simulation conditions, Kmult was estimated for each dataset and statistically evaluated using
permutation. The proportion of significant results (out of 1000) was then treated as the Type I error or
statistical power of the test, depending upon initial simulation conditions.
Results: For all simulations, hypothesis tests of phylogenetic signal displayed appropriate Type I
error rates near the nominal value of  = 0.05. In addition, the statistical power of tests based on Kmult
increased rapidly as the degree of phylogenetic signal increased, and this pattern remained consistent
across the range of trait dimensionality examined in this study, as well as across a range of the number of
species in the phylogeny. For balanced phylogenies, patterns were concordant between simulations where
data were generated using isotropic variation (fig A1), non-isotropic variation (fig A2), and non-isotropic
variation with covariance among trait dimensions (fig A3). This was also the case for randomly generated
phylogenies (figs. A4 – A6). Across all simulations the power of the test also rose rapidly as the input
level of phylogenetic signal increased. This pattern became more acute as the number of species in the
phylogeny increased, and as the number of trait dimensions increased (figs. A1-A6).
Overall these simulations reveal that tests of phylogenetic signal based on Kmult display
appropriate Type I error and statistical power. Thus, Kmult is an appropriate method for evaluating
phylogenetic signal in high-dimensional datasets.
Fig. A1. Statistical power curves for tests evaluating phylogenetic signal using Kmult when data are
simulated on balanced phylogenies using an isotropic model to generate the data. Each point on each
power curve is the result of 1000 simulations at the conditions specified.
Fig. A2. Statistical power curves for tests evaluating phylogenetic signal using Kmult when data are
simulated on balanced phylogenies using a non-isotropic model to generate the data. Each point on each
power curve is the result of 1000 simulations at the conditions specified.
Fig. A3. Statistical power curves for tests evaluating phylogenetic signal using Kmult when data are
simulated on balanced phylogenies using a non-isotropic model with trait covariance to generate the data.
Each point on each power curve is the result of 1000 simulations at the conditions specified.
Fig. A4. Statistical power curves for tests evaluating phylogenetic signal using Kmult when data are
simulated on random phylogenies using an isotropic model to generate the data. Each point on each
power curve is the result of 1000 simulations at the conditions specified.
Fig. A5. Statistical power curves for tests evaluating phylogenetic signal using Kmult when data are
simulated on random phylogenies using a non-isotropic model to generate the data. Each point on each
power curve is the result of 1000 simulations at the conditions specified.
Fig. A6. Statistical power curves for tests evaluating phylogenetic signal using Kmult when data are
simulated on random phylogenies using a non-isotropic model with covariance among trait dimensions to
generate the data. Each point on each power curve is the result of 1000 simulations at the conditions
specified.
Literature Cited
Blomberg SP, Garland T, Ives AR. 2003. Testing for phylogenetic signal in comparative data: behavioral
traits are more labile. Evolution, 57:717-745.
Harmon LJ, Weir J, Brock C, Glor RE, Challenger W. 2008. GEIGER: Investigating evolutionary
radiations. Bioinformatics, 24:129-131.
Download