A Brief Introduction to Phylogenetics with R César Sánchez (csanc15@lsu.edu) 11/03/09 Given the flexibility that R posses as a language and environment in which data, graphical and statistical analysis can be performed, it has become an important tool for scientists working on computational molecular biology. In the last decade, there has been an outburst of programs that use molecular sequences to build evolutionary hypothesis in order to understand species relationships, divergence times, speciation rates (among many other), analyses that now can be done with R. Phylogenetic analyses with R have been done since the early 2000’s, and a number have been implemented to address different types of analysis. Some of those more commonly used packages include APE (Paradis 2004), LASER (Likelihood Analysis of Speciation/Extinction Rates from Phylogenies, Rabosky 2006), GEIGER (Analysis of evolutionary diversification , Harmon 2008), MATICCE (Mapping Transitions in Continuous Character Evolution, Hipp 2007), ouch (Ornstein-Uhlenbeck models for phylogenetic comparative hypotheses, Butler & King 2004). Here I will show the use of the program “ape”, and some of its multiple functions. The package ape can be downloaded from http://cran.r-project.org/web/packages/ape/index.html, as well as other programs that function together with ape, such as apTreeshape, seqinr, orr ade4. All these programs have to be installed directly by the user, and given that they can be used jointly, the best is to have them all open during the same session. Furthermore, if someone is interested in learning more about the program ape, you should consult the book by Paradis (2006), as well as the R-phylo wiki (http://www.r-phylo.org/wiki/Main_Page). Sequence downloading from internet I will start by exploring data from the internet from one the multiple databases holding genetic sequences. For the examples used here, sequences will be downloaded from GenBank. A list of databases can be obtained by entering the function choosebank in R: > choosebank() [1] "genbank" "embl" "emblwgs" "swissprot" "ensembl" "refseq" [7] "refseqViruses" "nrsub" "hobacnucl" "hobacprot" "hovergendna" "hovergen" [13] "hogenom5" "hogenom5dna" "hogenom" "hogenomdna" "homolens" "homolensdna" [19] "greviews" "polymorphix" "emglib" "taxobacgen" "apis" "anopheles" [25] "caenorhabditis" "cionasavignyi" "danio" "drosophila" "felis" "gallus" [31] "human" "mouse" "saccharomyces" "tetraodon" "xenopus" -1- One of these libraries can be selected, just by using the function s: > s < choosebank("genbank") Now, both ape and seqinr could be used to find and download sequences from Genbank, but seqinr, is more robust in this sense. Here we can make use of the command query which retrieves the sequences given an array criteria that we can determine. >query("cb", "sp=Catharus bicknelli") # query retrieves the sequences for the species Catharus bicknelli > sapply(cb$req[1:4], getName) # sapply retrieves the sequences for the sequences in positions one to four, just to be sure we have the information > sapply(cb$req[], getName) # sapply retrieves the names of all sequences of the species [1] "AF373783" "AF373784" "AF373785" "AF373786" "AF373787" "AF373788" [7] "AF373789" "AF529137" "AF529151" "AY049475" "AY049490" "AY049520" [13] "DQ432826" "DQ432827" "DQ433438" "DQ433439" "DQ433440" "DQ433441" [19] "DQ433442" "DQ433443" "DQ433444" "DQ433445" "DQ433446" "DQ433447" > sapply(cb$req[], getSequence, as.string = TRUE) # retrieves all the sequences from above: [1] "gatgcactttgaccccattcacgagggggaggctatttacctcttaagtatgcagatagtgtaatggtcaccgcacatattta gattgtttcccctttctaggaatttccatctaaacccctaaaaatcatcatttttttcgttcgtttatttttatcatgacattttcgtttaaa attgaccaaatattcttagacatctccctacctttaaccaaagcattcatcatcacaaaactaacgaacaaacttcctctatttt ccccctatttatcagaaccgaaagtacaacaaacttttccatctttacacacacacacaaacagcaacccccctgacaaa ccaccccaactaaaaccaaacaaaaacacaacgcatgttcttgtagcttaac” [2] "gatgcactttgaccccattcacgagggggaggctatttacctcttaagtatgcagatagtgtaatggtcaccgcacatattta gattgtttcccctttctaggaatttccatctaaacccctaaaaatcatcatttttttcgttcgtttatttttatcatgacattttcgtttaaa attgaccaaatattcttagacatctccctacctttaaccaaagcattcatcatcacaaaactaacgaacaaacttcctctgtttt ccccctatttaccagaaccgaaagtacaacaaacttttccatctttacacacacacacaaacagcaacccccctgacaaa ccaccccaactaaaaccaaacaaaaacacaacgcatgttcttgtagcttaac" [3] .... [24] > x = getSequence(cg$req[[1]]) # calls the first sequence as x -2- > length(x) [1] 393 # lenght (number of bp) of x Now, we could use a command from ape in order call in one single sequence for which we know the sequence number by using read.GenBank: > y = read.GenBank("AF373783", species.names = TRUE, as.character = FALSE) > length(y[[1]]) [1] 393 # hence we obtain the same result as above > save(y, file = "cathb.Rdata") #to save your data Tree building with ape With R is possible to work with several types of trees (and file types such as NEWICK, NEXUS, PHYLIP), and such trees figures could be easily manipulated. Here I would create a tree from a previously created file generated with the program MrBayes. This program generates two files, which read.nexus command can read: > cathTrees = read.nexus("catharus.nex.con") #reads the nexus file previously stored Note if the file shows an error such as: Error in file(file, "r") : cannot open the connection In addition: Warning message: In file(file, "r") : cannot open file 'catharus.nex.con': No such file or directory then you have to change the directory to the one where we have the tree files, this can be done by positioning at Misc in the Menu bar, and there Change working directory to the one we are about to use.. > cathTrees 2 phylogenetic trees # check for the 2 stored tree files Now we call the tree that contains the posterior probability values, which by default is the first tree: > cathTree = cathTrees[[1]] -3- > cathTree Phylogenetic tree with 20 tips and 15 internal nodes. Tip labels: _Catharus_gracilirostris, _Catharus_ustulatus, _Catharus_ustulatus_2, _Catharus_aurantiirostris, _Catharus_fuscater, _Catharus_mexicanus, ... Node labels: , 0.98, 0.71, 0.59, 0.66, 0.77,... Unrooted; includes branch lengths. Once we have uploaded the tree, we can use the function plot to get the graphical representation of the tree. The function plot allows a number of options in regard to the graphical model of tree. Here I list a few, and include also some examples of how the trees look after modifying some of them Option Effect type Type of tree show.tip.label Shows tip labels show.node.label Shows node labels cex Relative character size root.edge Draw the root edge underscore Display the underscores in tip labels direction Direction of the tree From Paradis (2006) > plot(cathTree) > plot(cathTree, use.edge.length = FALSE) -4- > plot(cathTree, use.edge.length = FALSE, type = "r") > plot(cathTree, use.edge.length = FALSE, direction = "u") > plot(cathTree, use.edge.length = FALSE, show.node.label = TRUE, edge.color = "3") > plot(cathTree, use.edge.length = FALSE, show.node.label = TRUE, cex = 0.75) Finally we can save out tree, which would be stored in the same folder where the .con file was stored: > write.nexus(cathTree, file = "cathTree.nex") Character mapping with ape Ancestral characters estimation can be done by mapping the characters at the tip of the tree, and then running a model of character evolution; > cathmigtree = read.nexus("CathxR.nex.con") > char1=c(0,0,0,0,0,0,0,1,1,1,1,1,1,1) > names(char1)= cathmigtree$tip.label Now the ancestral states can be reconstructed via three methods, these are equal-rates model (ER), symmetrical model (SYM) and all rates different (ARD). > ERreconstruction = ace(char1, cathmigtree, type="discrete", model="ER") > SYMreconstruction = ace(char1, cathmigtree, type="discrete", model="SYM") > ARDreconstruction = ace(char1, cathmigtree, type="discrete", model="ARD") # The loglikelihoods which results from the above methods could be observed by typing: > ERreconstruction$loglik > SYMreconstruction$loglik > ARDreconstruction$loglik Finally, the model which should be preferred depends on the likelihood results, hence in order to determine if you should use a model with more parameters, a Likelihood Test must be performed. -5- BIBLIOGRAPHY Butler, M.A. & A.A. King.2004. Phylogenetic comparative analysis: a modeling approach for adaptive evolution. American Naturalist 164:683–695. Harmon, L. J., J. Weir, C. Brock, R. E. Glor, and W. Challenger. 2008. GEIGER: Investigating evolutionary radiations. Bioinformatics 24:129-131. Hipp, A.L. 2007. Non-uniform processes of chromosome evolution in sedges (Carex: Cyperaceae). Evolution 61:2175-2194. Paradis, E., J. Claude, & K. Strimmer. 2004. APE: Analyses of phylogenetics and evolution in R language. Bioinformatics 20: 289-290. Paradis, E. 2006. Analysis of phylogenetics and evolution with R. Springer. New York. Rabosky, D. L. 2006 LASER: a maximum likelihood toolkit for inferring temporal shifts in diversification rates. Evolutionary Bioinformatics 2: 257–260. -6-