DOCX

advertisement
Biol 206/306 – Advanced Biostatistics
Lab 10 – Phylogenetically Independent Contrasts
By Philip J. Bergmann
0. Laboratory Objectives
1. Learn why phylogenetic relationships need to be taken into account in comparative
studies
2. Explore the phylo class object and learn how to work with phylogenies in R
3. Learn what Phylogenetically Independent Contrasts are and how to calculate them in R
4. Learn how Phylogenetically Independent Contrasts are typically analyzed
1. Analyzing Comparative Data
Whenever a dataset includes variables that have been measured/recorded for multiple taxa, we
are analyzing comparative data. Most often, we think of comparative data as constituting values
collected from multiple species, but this is not always the case. Comparative data could include
taxa above the species level, which are called clades of various levels of inclusion, sometimes
ranked into genera, families, etc. Comparative data could also include taxa below the species
level, including sub-species, varieties, or populations. The key here is that your units of
comparison are related to one another through evolution.
The problem facing an analysis of comparative data is that they are not independent, which is a
key assumption of most statistical techniques. The reason for this is that taxa share part of their
evolutionary history with other taxa. If you trace the branches of a phylogeny from the base
towards the tips, where two taxa of interest appear, you will notice that part of the pathway you
follow is the same for both taxa and part is unique to each taxon. The unique segments of the
route are the time the two species were evolving independently of one another, and the common
segments are where they were the same lineage. Try this out on the phylogeny shown:
What are two taxa that have been
evolving largely independently of
one another given the phylogeny?
What are two taxa that are
minimally independent, having
shared most of their evolutionary
history?
Biol 203/306 2012 – Page 1
A basic approach to ensuring that comparisons that you make among taxa are independent is
called sister group comparisons. Sister groups share a node (branching event) on a phylogeny,
and one can compare everything along one branch from the node with everything on the other
branch from the same node. For example, in the phylogeny on the previous page, you can
compare taxa A and B, but not taxa B and C. You can also compare taxa (A+B) and C. What
are the two remaining sister group comparisons you can make on the phylogeny?
Phylogenetically Independent Contrasts (PICs), the focus of today’s lab, are a formalized
approach to sister group comparison that is far more powerful because it is developed in an
explicitly statistical and evolutionary context. There are a number of components to PICs. First
of all, they are contrasts, which are general statistical approach that makes comparisons through
looking at differences. So, a contrast is a difference. If you have two species (1 & 2) for which
you are comparing a variable (X), you could calculate the contrast simply as X1-X2. Note that
the order in which you do this operation is arbitrary, so there is no reason to do X1-X2 in favor
of X2-X1. This means that the sign (+/-) of a contrast is also arbitrary. In the PICs approach you
can calculate the value for variable X for each of the nodes, allowing for n-1 contrasts, where n is
the number of taxa on the phylogeny. Using the phylogeny on the previous page, list all of the
contrasts that you can make. The nodes are already labeled for you for convenience.
The contrasts that you list above are now independent of phylogeny because they only include
sister group comparisons. However, PICs are a little more complicated because the branches on
a phylogeny are typically of different lengths and are frequently interpreted as being proportional
to evolutionary time. The PICs approach assumes that your trait of interest, X, evolves following
a Brownian Motion (BM) model. In this model, the maximum amount of divergence between
two taxa is proportional to time (in this case, branch lengths, and actually the sum of the two
descending branch lengths). PICs are, therefore, standardized by their variance, which is
equivalent to the sum of their branch lengths. This means that the contrasts themselves
(differences) are divided by the summed branch lengths. An important assumption of PICs is
that the branch lengths of the phylogeny adequately standardize the PICs. How would you test
this assumption?
We will not go into the details of how you calculate PICs beyond what is explained above
because we have covered this in the lecture and discussion on PICs. PICs are also implemented
in R, so will be calculated for you.
Biol 203/306 2012 – Page 2
2. Working with Phylogenies in R
The ape package for R treats phylogenies as objects of class phylo, and provides various
functions necessary for reading and manipulating phylogenetic trees, and for doing some
analyses. The packages geiger and phytools provided added functionality, mainly in the
analysis of phylogenetic and comparative data. If you have not done so previously, install ape,
geiger and phytools (don’t confuse it with phylotools, which is another R package). Load
these packages. TIP: If you load phytools, it depends on the other two packages, and so all
will be loaded automoatically. Also, download the phylogeny and dataset for this week’s lab
from the course website.
The most important and first step when using phylogenies in R is loading them into the work
space. The following function does this very simply:
> read.tree(“filename.tre”)
Note that this is the appropriate function for a Newick format tree (Take a look at the tree file
you downloaded using notepad to see what Newick format looks like – it is akin to a Venn
diagram, in that it represents the phylogeny as a nested series of brackets). If you have a
Nexus format tree (this is simply another file format for storing phylogenies), then a different
function, read.nexus() should be used. We will use only Newick trees in this class, so do not
worry about Nexus files. Use the read.tree function to load the phylogeny for the lab into R,
and assign the phylogeny to an object called “tree”.
If you just type your tree object’s name, you get very basic information: the number of taxa, the
number of nodes, some of the taxon names, and a few other pieces of information. To visualize
the phylogeny, simply use plot(tree). What is the sister group of the species Scel. spinosus?
You can find out more about your phylogeny using the functions:
> is.binary.tree(tree)
> is.ultrametric(tree)
A binary tree is another term for a tree that is fully resolved, meaning that all nodes are
dichotomous, or have exactly two descendents. An ultrametric tree is one where all of the tips
line up at the same level (all taxa are contemporary/extant). Both of these functions return
TRUE if the tree is binary or ultrametric, and FALSE otherwise. If a tree is not binary, but you
need it to be, you can resolve any nodes with >2 descendents randomly , producing new nodes,
separated by branches of length zero. Use the function:
> multi2di(tree)
If you need to resolve an incompletely resolved tree. Is the tree you have loaded fully resolved?
Is it ultrametric?
Another useful trick for working with phylogenies is using the str(tree) function, which , as
usual, tells you the components of the object. $edge provides you with a Sx2 matrix, where S is
the number of branches in the phylogeny. If each node, including the tips (taxa) are numbered,
then $edge gives you the ancestor and descendent node numbers for each branch. Note that
“edge” and “branch” are synonymous. Branch is typically used in biology, edge is typically used
in graph theory. $Nnode tells you how many nodes are on your tree. $tip.label is a list of your
taxa. $edge.length is a vector of length S that gives you the length of each branch. $root.edge is
Biol 203/306 2012 – Page 3
the length of the single branch at the base (root) of the tree. Look at the structure (str) of your
tree. How many nodes does it have?
How many branches does it have?
What is the root branch length?
As mentioned in the introduction to PICs, in some cases, when a phylogeny’s branch lengths do
no t adequately standardize the contrasts, you may need to transform the branch lengths.
Because tree$edge.length is a vector that contains all the branch lengths, transformation can be
quite easy to do. One common transformation that people do to branch lengths in an attempt to
meet the assumption of PICs is to convert all of the branch lengths to 1. The tree you have
loaded contains 74 branches, and so we need to create a vector containing 74 ones and substitute
it into the phylo object. However, we also want to keep the original tree as an object, so you first
need to create a new phylo object that contains the tree we will modify. You can do these steps
as follows:
> tree1s <- tree
> tree1s$edge.length <- rep(1,74)
Plot the new tree to see what it looks like – all branch lengths should be the same length. Is
the new tree ultrametric?
Another transformation that could be useful is to square-root-transform all the branch lengths on
a tree. Do this using the same approach as above, but save this tree to an object called
“tree_sqrt”. Is this tree ultrametric?
A final useful transformation is making a non-ultrametric tree ultrametric. The first step is to see
how long the tree is from root to tip. Then we can use that information to generate an ultrametric
tree that is the same length. Do this now, assigning the resulting tree to “tree1_ultra”.
> nodeHeights(tree1s)
This function is part of the phytools package, and gives you the distance of each node from the
base of the tree (the base is at zero). The output is the node Height from the base for each branch,
with the first column being the ancestral side of a given branch, and the second being the
descendent side of each branch. If you take the maximum of this object (the greatest number),
you have the height of the entire tree.
> chronopl(tree1s, lambda, age.min=y)
This function is part of the ape package, and transforms the tree into a chronogram, which is
another term for an ultrametric tree. The first argument is the tree you are transforming, the
second is a smoothing parameter (just use 1 for this in this lab), and the third one gives the length
of the tree, so y=max(nodeHeights(tree1s)).
Is your new tree ultrametric?
Biol 203/306 2012 – Page 4
4. Calculating Phylogenetically Independent Contrasts in R
You now have a collection of trees and are ready to load some data for the species in the trees
and see how they have evolved. The dataset we will be using today is a lizard body shape
dataset that we haven’t used yet. Download and view the Excel spreadsheet. There are two
worksheets in the file. The first, called “Raw_Data” contains measurements for adult individuals
of a number of species of phrynosomatine lizards. The variables are explained to the right. The
second worksheet, “Spp_Data”, contains one line for each species, with the average value for
each for the relative body proportions, and the maximum SVL. All of these data are calculated
from the Raw_Data worksheet. You will be using the “Spp_Data” worksheet for this lab, so
save it as a tab delimited text file and load it into R, assigning it do object “ldata”. In this
dataset, what are the experimental units on which we will be doing analyses?
Why are most of the variables in the dataset “relative” lengths? And why do we also include
maximum SVL?
When you have a dataset that you plan to analyze using a phylogeny, the first step is to ensure
that the same taxa are in the dataset and in the tree. Something as simple as a minor typo in one
but not the other can result in a lot of time spent solving problems. The following function
checks if the same taxa are on the tree and in the dataset (part of the Geiger package):
>name.check(tree,data)
Use the above function to confirm that the same taxa are in your tree and loaded data. Now
delete one of the taxa in the tree, saving the new tree as “tree_demo”:
> tree_demo <- drop.tip(tree, “Uro_ornatus”)
This function is part of the Geiger package and allows you to edit trees in yet another way.
Repeat your name.check on the new tree and your ldata object. What do you get?
It is now time to calculate PICs for each of your variables. We will use the following function
from the ape package:
> pic(x, tree, scaled=TRUE, var.contrasts=FALSE)
Here, x is a vector containing the variable for which you want to calculate contrasts, tree is your
phylo class tree object, scaled refers to whether the contrasts should be standardized by the
phylogeny branch lengths (you virtually always would want this), and var.contrasts specifies
whether you want to get the variances for each contrast, calculated from the branch lengths.
Note that although this function provides the variances of the contrasts, contrasts are
standardized by not variances, but standard deviations! The assumption of PICs is also that the
contrasts are adequately standardized by their standard deviations, not their variances. Calculate
PICs for each of the variables in your dataset, assigning the output for each to a different
Biol 203/306 2012 – Page 5
object. Be sure to calculate the scaled PICs and also obtain the variances. Is the output in the
form of a matrix or a data frame?
Now that you have the PICs and their variances calculated, it is time to test the assumption that
they adequately standardized. The easiest and probably most rigorous way to do this is to
calculate the Pearson correlation between each set of PICS and their standard deviations and to
test if that correlation is significant. What would a significant correlation between a set of PICs
and their standard deviations mean?
For each set of contrasts, use the cor.test function to test your assumption. Provide the results
in the table below.
Variable
rBW
rFLL
rHLL
rTL
SVL
Original Tree
R
p
Branches=1
R
p
R
p
Which sets of PICs (for which variables) do not meet the assumption of standardization by
their branch lengths? Put an asterisk (*) beside the p-value for any that do not meet the
assumption. For these variables, you will need to redo the PIC calculation, but using a tree
with transformed branch lengths. Try the tree with all branch lengths equal to 1. Do this only
for the variables that did not have adequately standardized branch lengths. Are any of the
new PICs still not standardized properly? If they are not, then choose another of the
transformed trees and redo the PIC calculations for the appropriate variables. Repeat until
you have adequately standardized contrasts.
5. Analysis of Trait Co-evolution Using PICs
Now that you have a set of standardized PICs for each of the body shape variables of interest,
you are ready to do the actual analysis of the now-independent data points. We will use linear
regressions and Pearson correlations to consider the co-evolutionary relationships among traits,
and we will do this in a pairwise manner. However, there are a few extra things to consider
when doing these analyses on PICs. First of all, since the sign of the contrasts is arbitrary (see
above for explanation), we need to account for this. When we regress PICs for one variable
Biol 203/306 2012 – Page 6
against PICs for another variable, those for the variable (generally the one on the x axis by
convention) need to be “positivized”. This essentially means that we take their absolute values
and use these in the regression/correlation analysis. However, when we do this, we need to
reverse the sign of a PIC for the y-variable if we switched the sign for the x-variable. To do
this, we should first put each pair of PICs into a new object. Then we can use an if statement to
do the work. There are two ways to do this:
> if(condition1=TRUE) {statements 1} else {statements 2}
> ifelse(test, true_statement, false_statement)
In the if() function, condition1=TRUE can be any logical statement, for example “x<0”. If that
condition is true (if x is less than zero), then statements 1 are carried out, but if the condition is
false, then statements 2 are carried out. In the ifelse() function, test is the condition, followed
by what to do if test is true and if it is false. The ifelse() function is better if your series of
statements are simple. To do what we need, we also need to place the if or ifelse functions in a
for loop.
Start by placing the standardized contrasts for two variables into a new object, for example:
> rFLL_SVL <- cbind(PICrFLL[,1],PICSVL[,1])
Now it is time to do the if function within the for loop:
> rFLL_SVL_use <- rFLL_SVL
> for (i in 1:length(rFLL_SVL[,1]))
> {ifelse(rFLL_SVL[i,2]<0, rFLL_SVL_use[i,]<-rFLL_SVL[i,]*-1,
rFLL_SVL_use[i,]<-rFLL_SVL[i,])}
The first line above makes a new object that will contain your analyzable PICs (positivized as
appropriate). The example just fills the object with the unpositivized PICs, which simply ensures
that the dimensions are the same (you could do this in other ways as well). The second line sets
up the for loop to count so that i goes from 1 to the number of rows in your newly produced PIC
dataset. The third line then is applied to each row in the object holding your PICs. If the ith cell
in the second column is less than zero, then the whole row is multiplied by -1 (switching the
second column PIC to positive, and reversing the sign on the first column PIC). If the ith cell of
the second column is not less than zero, then the line is left untouched. Again, there are other
ways in which this could be implemented. Considering the lines above, the PICs for which
variable are being positivized?
Use the lines above for rFLL and SVL PICs. You now have an object that is ready to be
analyzed using correlation and regression. Since there are five variables that we wish to study
in a pairwise manner, there are ten combinations of variables that you need to do this for.
Keep track of which variable you are positivizing by keeping it in the same column in each of
your ten new datasets. We will then use this positivized variable as x for our analyses.
The second modification that needs to be done when doing a regression analysis on PICs is that
the regression must be run through the origin (0,0). What this means is that the regression is
estimated with a slope but without an intercept. This is simply done in R by expressing the
Biol 203/306 2012 – Page 7
linear model in an lm() function as: Y~X-1. The -1 simple means “no intercept”. Why do we
need to run the regression of PICs through the origin?
It is now time to do your regression and correlation analysis. For each of your (positivized)
pairwise PIC datasets, do a Pearson correlation using cor.test to obtain the R-value and its
associated p-value. Then do a regression through the origin using the summary() and lm()
functions. Since all of the variables that we are studying have error associated with them, you
will need to convert the OLS slopes to RMA slopes (Hint: Neither the RMA nor the lmodel2
functions are set up to do this – you can do this conversion using a very simple calculation).
Also, since we are doing multiple pairwise comparisons, you will need to adjust your alpha
values to account for this. Use the BH method but use an experiment-wide alpha of 0.1
instead of 0.05 (Hint: Refer to lab 2 for how to calculate BH alpha values). Put an asterisk
beside p-values that remain significant after BH correction. You may notice that the lm()
function provides a p-value and correlation coefficient as well. However, the p-value
corresponds to the OLS slope, testing the null hypothesis that it is not significantly different
from zero. Since we are using RMA slopes, that p-value is not valid, which is why we need to
use cor.test as well.
X var
Y var
n
R
p
OLS
RMA
At long last, you have completed an analysis of phylogenetically independent contrasts.
Provide a biological interpretation of the results you presented in the above table. Attach an
extra page if needed.
Biol 203/306 2012 – Page 8
Download