doc

advertisement
A Brief Introduction to Phylogenetics with R
César Sánchez (csanc15@lsu.edu)
11/03/09
Given the flexibility that R posses as a language and environment in which data, graphical and
statistical analysis can be performed, it has become an important tool for scientists working on
computational molecular biology. In the last decade, there has been an outburst of programs that
use molecular sequences to build evolutionary hypothesis in order to understand species
relationships, divergence times, speciation rates (among many other), analyses that now can be
done with R.
Phylogenetic analyses with R have been done since the early 2000’s, and a number have been
implemented to address different types of analysis. Some of those more commonly used
packages include APE (Paradis 2004), LASER (Likelihood Analysis of Speciation/Extinction
Rates from Phylogenies, Rabosky 2006), GEIGER (Analysis of evolutionary diversification
, Harmon 2008), MATICCE (Mapping Transitions in Continuous Character Evolution, Hipp
2007), ouch (Ornstein-Uhlenbeck models for phylogenetic comparative hypotheses, Butler &
King 2004).
Here I will show the use of the program “ape”, and some of its multiple functions. The package
ape can be downloaded from http://cran.r-project.org/web/packages/ape/index.html, as well as
other programs that function together with ape, such as apTreeshape, seqinr, orr ade4. All these
programs have to be installed directly by the user, and given that they can be used jointly, the
best is to have them all open during the same session. Furthermore, if someone is interested in
learning more about the program ape, you should consult the book by Paradis (2006), as well as
the R-phylo wiki (http://www.r-phylo.org/wiki/Main_Page).
Sequence downloading from internet
I will start by exploring data from the internet from one the multiple databases holding genetic
sequences. For the examples used here, sequences will be downloaded from GenBank.
A list of databases can be obtained by entering the function choosebank in R:
> choosebank()
[1] "genbank"
"embl"
"emblwgs"
"swissprot"
"ensembl"
"refseq"
[7] "refseqViruses" "nrsub"
"hobacnucl"
"hobacprot"
"hovergendna"
"hovergen"
[13] "hogenom5"
"hogenom5dna" "hogenom"
"hogenomdna" "homolens"
"homolensdna"
[19] "greviews"
"polymorphix" "emglib"
"taxobacgen" "apis"
"anopheles"
[25] "caenorhabditis" "cionasavignyi" "danio"
"drosophila" "felis"
"gallus"
[31] "human"
"mouse"
"saccharomyces" "tetraodon"
"xenopus"
-1-
One of these libraries can be selected, just by using the function s:
> s < choosebank("genbank")
Now, both ape and seqinr could be used to find and download sequences from Genbank, but
seqinr, is more robust in this sense. Here we can make use of the command query which retrieves
the sequences given an array criteria that we can determine.
>query("cb", "sp=Catharus bicknelli")
# query retrieves the sequences for
the species Catharus bicknelli
> sapply(cb$req[1:4], getName)
# sapply retrieves the sequences for
the sequences in positions one to
four, just to be
sure we have the
information
> sapply(cb$req[], getName)
# sapply retrieves the names of all sequences
of the species
[1] "AF373783" "AF373784" "AF373785" "AF373786" "AF373787" "AF373788"
[7] "AF373789" "AF529137" "AF529151" "AY049475" "AY049490" "AY049520"
[13] "DQ432826" "DQ432827" "DQ433438" "DQ433439" "DQ433440" "DQ433441"
[19] "DQ433442" "DQ433443" "DQ433444" "DQ433445" "DQ433446" "DQ433447"
> sapply(cb$req[], getSequence, as.string = TRUE)
# retrieves all the sequences from above:
[1]
"gatgcactttgaccccattcacgagggggaggctatttacctcttaagtatgcagatagtgtaatggtcaccgcacatattta
gattgtttcccctttctaggaatttccatctaaacccctaaaaatcatcatttttttcgttcgtttatttttatcatgacattttcgtttaaa
attgaccaaatattcttagacatctccctacctttaaccaaagcattcatcatcacaaaactaacgaacaaacttcctctatttt
ccccctatttatcagaaccgaaagtacaacaaacttttccatctttacacacacacacaaacagcaacccccctgacaaa
ccaccccaactaaaaccaaacaaaaacacaacgcatgttcttgtagcttaac”
[2]
"gatgcactttgaccccattcacgagggggaggctatttacctcttaagtatgcagatagtgtaatggtcaccgcacatattta
gattgtttcccctttctaggaatttccatctaaacccctaaaaatcatcatttttttcgttcgtttatttttatcatgacattttcgtttaaa
attgaccaaatattcttagacatctccctacctttaaccaaagcattcatcatcacaaaactaacgaacaaacttcctctgtttt
ccccctatttaccagaaccgaaagtacaacaaacttttccatctttacacacacacacaaacagcaacccccctgacaaa
ccaccccaactaaaaccaaacaaaaacacaacgcatgttcttgtagcttaac"
[3] ....
[24]
> x = getSequence(cg$req[[1]])
# calls the first sequence as x
-2-
> length(x)
[1] 393
# lenght (number of bp) of x
Now, we could use a command from ape in order call in one single sequence for which we know
the sequence number by using read.GenBank:
> y = read.GenBank("AF373783", species.names = TRUE, as.character = FALSE)
> length(y[[1]])
[1] 393
# hence we obtain the same result as above
> save(y, file = "cathb.Rdata")
#to save your data
Tree building with ape
With R is possible to work with several types of trees (and file types such as NEWICK,
NEXUS, PHYLIP), and such trees figures could be easily manipulated. Here I would
create a tree from a previously created file generated with the program MrBayes. This
program generates two files, which read.nexus command can read:
> cathTrees = read.nexus("catharus.nex.con")
#reads the nexus file previously stored
Note if the file shows an error such as:
Error in file(file, "r") : cannot open the connection
In addition: Warning message:
In file(file, "r") :
cannot open file 'catharus.nex.con': No such file or directory
then you have to change the directory to the one where we have the tree files, this can be done by
positioning at Misc in the Menu bar, and there Change working directory to the one we are
about to use..
> cathTrees
2 phylogenetic trees
# check for the 2 stored tree files
Now we call the tree that contains the posterior probability values, which by default is the first
tree:
> cathTree = cathTrees[[1]]
-3-
> cathTree
Phylogenetic tree with 20 tips and 15 internal nodes.
Tip labels:
_Catharus_gracilirostris, _Catharus_ustulatus, _Catharus_ustulatus_2,
_Catharus_aurantiirostris, _Catharus_fuscater, _Catharus_mexicanus, ...
Node labels:
, 0.98, 0.71, 0.59, 0.66, 0.77,...
Unrooted; includes branch lengths.
Once we have uploaded the tree, we can use the function plot to get the graphical representation
of the tree. The function plot allows a number of options in regard to the graphical model of tree.
Here I list a few, and include also some examples of how the trees look after modifying some of
them
Option
Effect
type
Type of tree
show.tip.label
Shows tip labels
show.node.label
Shows node labels
cex
Relative character size
root.edge
Draw the root edge
underscore
Display the underscores
in tip labels
direction
Direction of the tree
From Paradis (2006)
> plot(cathTree)
> plot(cathTree, use.edge.length = FALSE)
-4-
> plot(cathTree, use.edge.length = FALSE, type = "r")
> plot(cathTree, use.edge.length = FALSE, direction = "u")
> plot(cathTree, use.edge.length = FALSE, show.node.label = TRUE, edge.color = "3")
> plot(cathTree, use.edge.length = FALSE, show.node.label = TRUE, cex = 0.75)
Finally we can save out tree, which would be stored in the same folder where the .con file was
stored:
> write.nexus(cathTree, file = "cathTree.nex")
Character mapping with ape
Ancestral characters estimation can be done by mapping the characters at the tip of the tree, and
then running a model of character evolution;
> cathmigtree = read.nexus("CathxR.nex.con")
> char1=c(0,0,0,0,0,0,0,1,1,1,1,1,1,1)
> names(char1)= cathmigtree$tip.label
Now the ancestral states can be reconstructed via three methods, these are equal-rates model
(ER), symmetrical model (SYM) and all rates different (ARD).
> ERreconstruction = ace(char1, cathmigtree, type="discrete", model="ER")
> SYMreconstruction = ace(char1, cathmigtree, type="discrete", model="SYM")
> ARDreconstruction = ace(char1, cathmigtree, type="discrete", model="ARD")
# The loglikelihoods which results from the above methods could be observed by typing:
> ERreconstruction$loglik
> SYMreconstruction$loglik
> ARDreconstruction$loglik
Finally, the model which should be preferred depends on the likelihood results, hence in order to
determine if you should use a model with more parameters, a Likelihood Test must be
performed.
-5-
BIBLIOGRAPHY
Butler, M.A. & A.A. King.2004. Phylogenetic comparative analysis: a modeling approach for
adaptive evolution. American Naturalist 164:683–695.
Harmon, L. J., J. Weir, C. Brock, R. E. Glor, and W. Challenger. 2008. GEIGER: Investigating
evolutionary radiations. Bioinformatics 24:129-131.
Hipp, A.L. 2007. Non-uniform processes of chromosome evolution in sedges (Carex:
Cyperaceae). Evolution 61:2175-2194.
Paradis, E., J. Claude, & K. Strimmer. 2004. APE: Analyses of phylogenetics and evolution in R
language. Bioinformatics 20: 289-290.
Paradis, E. 2006. Analysis of phylogenetics and evolution with R. Springer. New York.
Rabosky, D. L. 2006 LASER: a maximum likelihood toolkit for inferring temporal shifts in
diversification rates. Evolutionary Bioinformatics 2: 257–260.
-6-
Download