Biol 206/306 – Advanced Biostatistics Lab 10 – Phylogenetically Independent Contrasts By Philip J. Bergmann 0. Laboratory Objectives 1. Learn why phylogenetic relationships need to be taken into account in comparative studies 2. Explore the phylo class object and learn how to work with phylogenies in R 3. Learn what Phylogenetically Independent Contrasts are and how to calculate them in R 4. Learn how Phylogenetically Independent Contrasts are typically analyzed 1. Analyzing Comparative Data Whenever a dataset includes variables that have been measured/recorded for multiple taxa, we are analyzing comparative data. Most often, we think of comparative data as constituting values collected from multiple species, but this is not always the case. Comparative data could include taxa above the species level, which are called clades of various levels of inclusion, sometimes ranked into genera, families, etc. Comparative data could also include taxa below the species level, including sub-species, varieties, or populations. The key here is that your units of comparison are related to one another through evolution. The problem facing an analysis of comparative data is that they are not independent, which is a key assumption of most statistical techniques. The reason for this is that taxa share part of their evolutionary history with other taxa. If you trace the branches of a phylogeny from the base towards the tips, where two taxa of interest appear, you will notice that part of the pathway you follow is the same for both taxa and part is unique to each taxon. The unique segments of the route are the time the two species were evolving independently of one another, and the common segments are where they were the same lineage. Try this out on the phylogeny shown: What are two taxa that have been evolving largely independently of one another given the phylogeny? What are two taxa that are minimally independent, having shared most of their evolutionary history? Biol 203/306 2012 – Page 1 A basic approach to ensuring that comparisons that you make among taxa are independent is called sister group comparisons. Sister groups share a node (branching event) on a phylogeny, and one can compare everything along one branch from the node with everything on the other branch from the same node. For example, in the phylogeny on the previous page, you can compare taxa A and B, but not taxa B and C. You can also compare taxa (A+B) and C. What are the two remaining sister group comparisons you can make on the phylogeny? Phylogenetically Independent Contrasts (PICs), the focus of today’s lab, are a formalized approach to sister group comparison that is far more powerful because it is developed in an explicitly statistical and evolutionary context. There are a number of components to PICs. First of all, they are contrasts, which are general statistical approach that makes comparisons through looking at differences. So, a contrast is a difference. If you have two species (1 & 2) for which you are comparing a variable (X), you could calculate the contrast simply as X1-X2. Note that the order in which you do this operation is arbitrary, so there is no reason to do X1-X2 in favor of X2-X1. This means that the sign (+/-) of a contrast is also arbitrary. In the PICs approach you can calculate the value for variable X for each of the nodes, allowing for n-1 contrasts, where n is the number of taxa on the phylogeny. Using the phylogeny on the previous page, list all of the contrasts that you can make. The nodes are already labeled for you for convenience. The contrasts that you list above are now independent of phylogeny because they only include sister group comparisons. However, PICs are a little more complicated because the branches on a phylogeny are typically of different lengths and are frequently interpreted as being proportional to evolutionary time. The PICs approach assumes that your trait of interest, X, evolves following a Brownian Motion (BM) model. In this model, the maximum amount of divergence between two taxa is proportional to time (in this case, branch lengths, and actually the sum of the two descending branch lengths). PICs are, therefore, standardized by their variance, which is equivalent to the sum of their branch lengths. This means that the contrasts themselves (differences) are divided by the summed branch lengths. An important assumption of PICs is that the branch lengths of the phylogeny adequately standardize the PICs. How would you test this assumption? We will not go into the details of how you calculate PICs beyond what is explained above because we have covered this in the lecture and discussion on PICs. PICs are also implemented in R, so will be calculated for you. Biol 203/306 2012 – Page 2 2. Working with Phylogenies in R The ape package for R treats phylogenies as objects of class phylo, and provides various functions necessary for reading and manipulating phylogenetic trees, and for doing some analyses. The packages geiger and phytools provided added functionality, mainly in the analysis of phylogenetic and comparative data. If you have not done so previously, install ape, geiger and phytools (don’t confuse it with phylotools, which is another R package). Load these packages. TIP: If you load phytools, it depends on the other two packages, and so all will be loaded automoatically. Also, download the phylogeny and dataset for this week’s lab from the course website. The most important and first step when using phylogenies in R is loading them into the work space. The following function does this very simply: > read.tree(“filename.tre”) Note that this is the appropriate function for a Newick format tree (Take a look at the tree file you downloaded using notepad to see what Newick format looks like – it is akin to a Venn diagram, in that it represents the phylogeny as a nested series of brackets). If you have a Nexus format tree (this is simply another file format for storing phylogenies), then a different function, read.nexus() should be used. We will use only Newick trees in this class, so do not worry about Nexus files. Use the read.tree function to load the phylogeny for the lab into R, and assign the phylogeny to an object called “tree”. If you just type your tree object’s name, you get very basic information: the number of taxa, the number of nodes, some of the taxon names, and a few other pieces of information. To visualize the phylogeny, simply use plot(tree). What is the sister group of the species Scel. spinosus? You can find out more about your phylogeny using the functions: > is.binary.tree(tree) > is.ultrametric(tree) A binary tree is another term for a tree that is fully resolved, meaning that all nodes are dichotomous, or have exactly two descendents. An ultrametric tree is one where all of the tips line up at the same level (all taxa are contemporary/extant). Both of these functions return TRUE if the tree is binary or ultrametric, and FALSE otherwise. If a tree is not binary, but you need it to be, you can resolve any nodes with >2 descendents randomly , producing new nodes, separated by branches of length zero. Use the function: > multi2di(tree) If you need to resolve an incompletely resolved tree. Is the tree you have loaded fully resolved? Is it ultrametric? Another useful trick for working with phylogenies is using the str(tree) function, which , as usual, tells you the components of the object. $edge provides you with a Sx2 matrix, where S is the number of branches in the phylogeny. If each node, including the tips (taxa) are numbered, then $edge gives you the ancestor and descendent node numbers for each branch. Note that “edge” and “branch” are synonymous. Branch is typically used in biology, edge is typically used in graph theory. $Nnode tells you how many nodes are on your tree. $tip.label is a list of your taxa. $edge.length is a vector of length S that gives you the length of each branch. $root.edge is Biol 203/306 2012 – Page 3 the length of the single branch at the base (root) of the tree. Look at the structure (str) of your tree. How many nodes does it have? How many branches does it have? What is the root branch length? As mentioned in the introduction to PICs, in some cases, when a phylogeny’s branch lengths do no t adequately standardize the contrasts, you may need to transform the branch lengths. Because tree$edge.length is a vector that contains all the branch lengths, transformation can be quite easy to do. One common transformation that people do to branch lengths in an attempt to meet the assumption of PICs is to convert all of the branch lengths to 1. The tree you have loaded contains 74 branches, and so we need to create a vector containing 74 ones and substitute it into the phylo object. However, we also want to keep the original tree as an object, so you first need to create a new phylo object that contains the tree we will modify. You can do these steps as follows: > tree1s <- tree > tree1s$edge.length <- rep(1,74) Plot the new tree to see what it looks like – all branch lengths should be the same length. Is the new tree ultrametric? Another transformation that could be useful is to square-root-transform all the branch lengths on a tree. Do this using the same approach as above, but save this tree to an object called “tree_sqrt”. Is this tree ultrametric? A final useful transformation is making a non-ultrametric tree ultrametric. The first step is to see how long the tree is from root to tip. Then we can use that information to generate an ultrametric tree that is the same length. Do this now, assigning the resulting tree to “tree1_ultra”. > nodeHeights(tree1s) This function is part of the phytools package, and gives you the distance of each node from the base of the tree (the base is at zero). The output is the node Height from the base for each branch, with the first column being the ancestral side of a given branch, and the second being the descendent side of each branch. If you take the maximum of this object (the greatest number), you have the height of the entire tree. > chronopl(tree1s, lambda, age.min=y) This function is part of the ape package, and transforms the tree into a chronogram, which is another term for an ultrametric tree. The first argument is the tree you are transforming, the second is a smoothing parameter (just use 1 for this in this lab), and the third one gives the length of the tree, so y=max(nodeHeights(tree1s)). Is your new tree ultrametric? Biol 203/306 2012 – Page 4 4. Calculating Phylogenetically Independent Contrasts in R You now have a collection of trees and are ready to load some data for the species in the trees and see how they have evolved. The dataset we will be using today is a lizard body shape dataset that we haven’t used yet. Download and view the Excel spreadsheet. There are two worksheets in the file. The first, called “Raw_Data” contains measurements for adult individuals of a number of species of phrynosomatine lizards. The variables are explained to the right. The second worksheet, “Spp_Data”, contains one line for each species, with the average value for each for the relative body proportions, and the maximum SVL. All of these data are calculated from the Raw_Data worksheet. You will be using the “Spp_Data” worksheet for this lab, so save it as a tab delimited text file and load it into R, assigning it do object “ldata”. In this dataset, what are the experimental units on which we will be doing analyses? Why are most of the variables in the dataset “relative” lengths? And why do we also include maximum SVL? When you have a dataset that you plan to analyze using a phylogeny, the first step is to ensure that the same taxa are in the dataset and in the tree. Something as simple as a minor typo in one but not the other can result in a lot of time spent solving problems. The following function checks if the same taxa are on the tree and in the dataset (part of the Geiger package): >name.check(tree,data) Use the above function to confirm that the same taxa are in your tree and loaded data. Now delete one of the taxa in the tree, saving the new tree as “tree_demo”: > tree_demo <- drop.tip(tree, “Uro_ornatus”) This function is part of the Geiger package and allows you to edit trees in yet another way. Repeat your name.check on the new tree and your ldata object. What do you get? It is now time to calculate PICs for each of your variables. We will use the following function from the ape package: > pic(x, tree, scaled=TRUE, var.contrasts=FALSE) Here, x is a vector containing the variable for which you want to calculate contrasts, tree is your phylo class tree object, scaled refers to whether the contrasts should be standardized by the phylogeny branch lengths (you virtually always would want this), and var.contrasts specifies whether you want to get the variances for each contrast, calculated from the branch lengths. Note that although this function provides the variances of the contrasts, contrasts are standardized by not variances, but standard deviations! The assumption of PICs is also that the contrasts are adequately standardized by their standard deviations, not their variances. Calculate PICs for each of the variables in your dataset, assigning the output for each to a different Biol 203/306 2012 – Page 5 object. Be sure to calculate the scaled PICs and also obtain the variances. Is the output in the form of a matrix or a data frame? Now that you have the PICs and their variances calculated, it is time to test the assumption that they adequately standardized. The easiest and probably most rigorous way to do this is to calculate the Pearson correlation between each set of PICS and their standard deviations and to test if that correlation is significant. What would a significant correlation between a set of PICs and their standard deviations mean? For each set of contrasts, use the cor.test function to test your assumption. Provide the results in the table below. Variable rBW rFLL rHLL rTL SVL Original Tree R p Branches=1 R p R p Which sets of PICs (for which variables) do not meet the assumption of standardization by their branch lengths? Put an asterisk (*) beside the p-value for any that do not meet the assumption. For these variables, you will need to redo the PIC calculation, but using a tree with transformed branch lengths. Try the tree with all branch lengths equal to 1. Do this only for the variables that did not have adequately standardized branch lengths. Are any of the new PICs still not standardized properly? If they are not, then choose another of the transformed trees and redo the PIC calculations for the appropriate variables. Repeat until you have adequately standardized contrasts. 5. Analysis of Trait Co-evolution Using PICs Now that you have a set of standardized PICs for each of the body shape variables of interest, you are ready to do the actual analysis of the now-independent data points. We will use linear regressions and Pearson correlations to consider the co-evolutionary relationships among traits, and we will do this in a pairwise manner. However, there are a few extra things to consider when doing these analyses on PICs. First of all, since the sign of the contrasts is arbitrary (see above for explanation), we need to account for this. When we regress PICs for one variable Biol 203/306 2012 – Page 6 against PICs for another variable, those for the variable (generally the one on the x axis by convention) need to be “positivized”. This essentially means that we take their absolute values and use these in the regression/correlation analysis. However, when we do this, we need to reverse the sign of a PIC for the y-variable if we switched the sign for the x-variable. To do this, we should first put each pair of PICs into a new object. Then we can use an if statement to do the work. There are two ways to do this: > if(condition1=TRUE) {statements 1} else {statements 2} > ifelse(test, true_statement, false_statement) In the if() function, condition1=TRUE can be any logical statement, for example “x<0”. If that condition is true (if x is less than zero), then statements 1 are carried out, but if the condition is false, then statements 2 are carried out. In the ifelse() function, test is the condition, followed by what to do if test is true and if it is false. The ifelse() function is better if your series of statements are simple. To do what we need, we also need to place the if or ifelse functions in a for loop. Start by placing the standardized contrasts for two variables into a new object, for example: > rFLL_SVL <- cbind(PICrFLL[,1],PICSVL[,1]) Now it is time to do the if function within the for loop: > rFLL_SVL_use <- rFLL_SVL > for (i in 1:length(rFLL_SVL[,1])) > {ifelse(rFLL_SVL[i,2]<0, rFLL_SVL_use[i,]<-rFLL_SVL[i,]*-1, rFLL_SVL_use[i,]<-rFLL_SVL[i,])} The first line above makes a new object that will contain your analyzable PICs (positivized as appropriate). The example just fills the object with the unpositivized PICs, which simply ensures that the dimensions are the same (you could do this in other ways as well). The second line sets up the for loop to count so that i goes from 1 to the number of rows in your newly produced PIC dataset. The third line then is applied to each row in the object holding your PICs. If the ith cell in the second column is less than zero, then the whole row is multiplied by -1 (switching the second column PIC to positive, and reversing the sign on the first column PIC). If the ith cell of the second column is not less than zero, then the line is left untouched. Again, there are other ways in which this could be implemented. Considering the lines above, the PICs for which variable are being positivized? Use the lines above for rFLL and SVL PICs. You now have an object that is ready to be analyzed using correlation and regression. Since there are five variables that we wish to study in a pairwise manner, there are ten combinations of variables that you need to do this for. Keep track of which variable you are positivizing by keeping it in the same column in each of your ten new datasets. We will then use this positivized variable as x for our analyses. The second modification that needs to be done when doing a regression analysis on PICs is that the regression must be run through the origin (0,0). What this means is that the regression is estimated with a slope but without an intercept. This is simply done in R by expressing the Biol 203/306 2012 – Page 7 linear model in an lm() function as: Y~X-1. The -1 simple means “no intercept”. Why do we need to run the regression of PICs through the origin? It is now time to do your regression and correlation analysis. For each of your (positivized) pairwise PIC datasets, do a Pearson correlation using cor.test to obtain the R-value and its associated p-value. Then do a regression through the origin using the summary() and lm() functions. Since all of the variables that we are studying have error associated with them, you will need to convert the OLS slopes to RMA slopes (Hint: Neither the RMA nor the lmodel2 functions are set up to do this – you can do this conversion using a very simple calculation). Also, since we are doing multiple pairwise comparisons, you will need to adjust your alpha values to account for this. Use the BH method but use an experiment-wide alpha of 0.1 instead of 0.05 (Hint: Refer to lab 2 for how to calculate BH alpha values). Put an asterisk beside p-values that remain significant after BH correction. You may notice that the lm() function provides a p-value and correlation coefficient as well. However, the p-value corresponds to the OLS slope, testing the null hypothesis that it is not significantly different from zero. Since we are using RMA slopes, that p-value is not valid, which is why we need to use cor.test as well. X var Y var n R p OLS RMA At long last, you have completed an analysis of phylogenetically independent contrasts. Provide a biological interpretation of the results you presented in the above table. Attach an extra page if needed. Biol 203/306 2012 – Page 8