Practical on phylogenetic trees based on sequence alignments Kyrylo Bessonov November 26th, 2013 Talk plan • How to build phylogenetic trees of types – Unrooted – Rooted • Context – comparison of viral proteins of dengue virus • Examples on phylogenetic tree building – Dengue virus Building a phylo tree using ape • Ape - Analyses of Phylogenetics and Evolution – Functions to create and manipulate phylo trees – Graphical exploration of phylogenetic data • To build a phylogenetic tree – Download protein sequences from DB – Align sequences – Calculate pairwise distance using ape – Visualize a phylogenetic tree Building an unrooted phylogenetic tree (1) #install req. libraries install.packages("seqinr") install.packages("muscle") install.packages("ape") library("seqinr") library("muscle") library("ape") multipleSeqAlignment <- function (seqnames, seqs){ #umax is an object of class fasta from muscle package fasta_seqs_Object=umax; tmp=data.frame(V1=rep(0,length(seqs)),V2=rep(0,length(seqs))) for(i in 1:length(seqs)){ tmp[i,1]=seqnames[i] tmp[i,2]=paste(seqs[[i]],collapse="") } fasta_seqs_Object$seqs=tmp #multiple sequence alignment #remove conflicting ape library from the memory try(detach("package:ape"), silent=T) alignment=muscle(seqs=fasta_seqs_Object, out = NULL) alignment_ape=ape::as.alignment(matrix(alignment$seqs[,2])) alignment_ape$nam=alignment$seqs[,1] return (alignment_ape) } Building an unrooted phylogenetic tree (2) #main part of the code choosebank("swissprot") #selects database for query seqnames <- c("P06747", "P0C569", "O56773", "Q5VKP1") seqs=list() for(i in 1:length(seqnames)){ query <- query(paste("AC=",seqnames[i],sep="")) seqs[i]=getSequence(query) } #multipleSeqAlignment() is defined on previous slide alignment_ape <- multipleSeqAlignment(seqnames, seqs); mydist <- dist.alignment(alignment_ape) #nj() performs the neighbor-joining tree estimation by Saitou and Nei mytree <- nj(mydist) mytree$tip.label=c("Q5VKP1-\nWestern Caucasian bat virus\nphosphoprotein","P06747-\nrabies virus\nphosphoprotein","P0C569-\nMokola virus\nphosphoprotein","O56773-\nLagos bat virus\nphosphoprotein") plot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=0.8, no.margin=T, srt=50) Unrooted Phylogenetic Tree • Phylogenetic tree showing distance between 4 protein viral sequences • the genetic distance between O56773 and P0C569 is the smallest Unrooted phylogenetic tree (1) • The lengths of the branches in the plot of the tree are proportional to the amount of evolutionary change (estimated by number of mutations) along the tree branches • This is an unrooted phylogenetic tree as it does not contain an outgroup sequence, that is a sequence of a protein that is known to be more distantly related to the other proteins in the tree than they are to each other. Unrooted phylogenetic tree(2) • As a result, we cannot tell which direction evolutionary time ran in along the internal branches of the tree. For example, we cannot tell whether the node representing the common ancestor of (O56773, P0C569) was an ancestor of the node representing the common ancestor of (Q5VKP1, P06747), or the other way around. Distance matrix Q5VKP1 P06747 P0C569 P06747 0.49 P0C569 0.48 0.45 O56773 0.50 0.46 0.41 • Inspecting calculated distance matrix between aligned sequences confirms results seen in phylogenetic tree • Closest pair is O56773 and P0C559 proteins Rooted phylogenetic tree • In order to convert the unrooted tree into a rooted tree, we need to add an outgroup sequence – Outgroup • a taxon outside the group of interest • will branch off at the base of phylogeny • Caenorhabditis elegans (UniProt accession Q10572 and Caenorhabditis remanei (UniProt E3M2K8) • If we were to build a phylogenetic tree of the Fox-1 homologues in verterbrates, the distantly related sequence from worms would probably be a good choice of outgroup, since the protein is from a different taxa/group (worms) Building an rooted phylogenetic tree (1) #BUILDIN ROOTED TREE OF PROTEIN SEQUNCES (FOX1) #Q9NWB1 - Human #Q17QD3 - Cow #Q95KI0 - Monkey #A1A5R1 - Rat #Q10572 - Worm C.elegans(Root) #E1G4K8 - Eye worm seqnames <- c("Q9NWB1","Q17QD3","Q95KI0","A1A5R1","Q10572","E1G4K8") choosebank("swissprot") #selects database for query seqs=list() for(i in 1:length(seqnames)){ query <- query(paste("AC=",seqnames[i],sep="")) seqs[i]=getSequence(query) } alignment_ape <- multipleSeqAlignment(seqnames, seqs); mydist <- dist.alignment(alignment_ape) Building an rooted phylogenetic tree (2) library("ape") mytree <- nj(mydist) mytree$tip.label=c("E1G4K8-Eye worm ", "Q10572-C.elegans(Root)", "A1A5R1-Rat", "Q9NWB1-Human", "Q17QD3-Cow", "Q95KI0-Monkey") myrootedtree <- root(mytree, outgroup="Q10572-C.elegans(Root)", r=TRUE) #Phylogenetic tree with 6 tips and 5 internal nodes. #Tip labels: #[1] "E1G4K8" "Q8WS01" "Q9VT99" "A8NSK3" "Q10572" "E3M2K8" #Rooted; includes branch lengths. plot.phylo(myrootedtree, edge.color = "blue", edge.width = 3 , type="p") Rooted tree of FOX1 proteins • The invertebrates are grouped together • Worms form a distinct group yet with large genetic distance • Human FOX1 is closest to monkey and cow sequences outgroup (worms) Distance matrix E1G4K8 Q10572 A1A5R1 Q9NWB1 Q17QD3 Q10572 0.72 A1A5R1 0.75 0.63 Q9NWB1 0.72 0.62 0.44 Q17QD3 0.73 0.62 0.50 0.28 Q95KI0 0.73 0.61 0.49 0.28 0.14 Table legend: Q9NWB1 – Human Q17QD3 – Cow Q95KI0 – Monkey A1A5R1 – Rat Q10572 - Worm C.elegans (Root) E1G4K8 - Eye worm • As expected, eye worms are the mostly distantly related species to vertebrates • Cow and monkey have the closest relationship and the lowest genetic distance Rooted tree • Time runs from left to right • Monkey, Cow and Human have common ancestor 3 • Ancestor 1 is common to ancestors 2 and 3 TIME Exercises on phylogenetic tree building • • Q1. Calculate the genetic distances (i.e. genetic distance) between the following NS1 proteins from different Dengue virus strains: Dengue virus 1 NS1 protein (Uniprot ID: Q9YRR4), Dengue virus 2 NS1 protein (UniProt: Q9YP96), Dengue virus 3 NS1 protein (UniProt: B0LSS3), and Dengue virus 4 NS1 protein (UniProt: Q6TFL5). Which viruses are the most closely related, and which are the least closely related, based on the genetic distances? Note: Dengue virus causes Dengue fever, which is classified by the WHO as a neglected tropical disease. There are four main types of Dengue virus, Dengue virus 1, Dengue virus 2, Dengue virus 3, and Dengue virus 4. Q2. Build an unrooted phylogenetic tree of the NS1 proteins from Dengue virus 1, Dengue virus 2, Dengue virus 3 and Dengue virus 4, using the neighbour-joining algorithm. Which are the most closely related proteins, based on the tree? Exercises on phylogenetic tree building • Q3. The Zika virus is related to Dengue viruses, but is not a Dengue virus, and so therefore can be used as an outgroup in phylogenetic trees of Dengue virus sequences. UniProt accession Q32ZE1 consists of a sequence with similarity to the Dengue NS1 protein, so seems to be a related protein from Zika virus. Build a rooted phylogenetic tree of the Dengue NS1 proteins based on an alignment, using the Zika virus protein as the outgroup. Which are the most closely related Dengue virus proteins, based on the tree? What extra information does this tree tell you, compared to the unrooted tree in Q2? Answers Question 1: Summary of viral proteins and Uniprot accession numbers: Uniprot ID: Q9YRR4 Dengue virus 1 NS1 protein UniProt: Q9YP96 Dengue virus 2 NS1 protein UniProt: B0LSS3 Dengue virus 3 NS1 protein UniProt: Q6TFL5 Dengue virus 4 NS1 protein seqnames <- c("Q9YRR4","Q9YP96","B0LSS3","Q6TFL5") choosebank("swissprot") #selects database for query seqs=list() for(i in 1:length(seqnames)){ query <- query(paste("AC=",seqnames[i],sep="")) seqs[i]=getSequence(query) } alignment_ape <- multipleSeqAlignment(seqnames, seqs); mydist <- dist.alignment(alignment_ape); mydist Answers • Q1. The distance matrix is as follows Q9YRR4 Q9YP96 B0LSS3 Q6TFL5 Q9YRR4 Q9YP96 0.306 0.333 0.254 0.297 0.230 0.227 The most distant are Q9YP96(V2) and Q6TFL5(V4) with genetic distance of 0,33 while the most closely related are Q9YP96(V1) and BOLSS3(V3) with genetic distance of 0,227 Answers Question 2: library("ape") mytree <- nj(mydist) #plotting unrooted tree plot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=1.2, no.margin=T, srt=0) #clean the sequences from gaps seqs_trim=seqs for(i in 1:length(seqs)){ start=regexpr("DMGY", paste(seqs_trim[[i]],collapse="") ) [1] stop=regexpr("GEDG", paste(seqs_trim[[i]],collapse="") ) [1] seqs_trim[[i]]=seqs_trim[[i]][start:stop] } alignment_ape <- multipleSeqAlignment(seqnames, seqs_trim); mydist <- dist.alignment(alignment_ape);mydist library("ape") mytree <- nj(mydist) #plotting unrooted tree based on alignment of whole protein sequences plot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=1.2, no.margin=T, srt=0) Answers Question 2 (continued): alignment_ape <- multipleSeqAlignment(seqnames, seqs_trim); mydist <- dist.alignment(alignment_ape);mydist library("ape") mytree <- nj(mydist) #tree based on the best aligned portion plot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=1.2, no.margin=T, srt=0) Answers • The resulting Q2 un-rooted tree This un-rooted tree agrees with the genetic distance matrix calculated in Q1. The tree suggests that BOLSS3 and Q9YP96 are the mostly related proteins. To improve quality of the tree it is best to select region that has minimal number of gaps between protein sequences Below you can see that there are regions with lots of gaps. Let’s build another tree based on the bolded(most conserved) region to see if it is the same Alignment of proteins: Q6TFL5 Q9YRR4 Q9YP96 B0LSS3 Built using the full lengths of proteins DMGCVVSWNGKELKC…KDQKAVHADMGYWIESSKNQTWQIEKASLIEVKTCLWPKTHTL…GMEIRPLSEKEENMVKSQVTA ------------------------DMGYWIESEKNETWKLARASFIEVKTCIWPKSHTL…GMEI----------------DSGCVVSWKNKELKC…KDNRAVHADMGYWIESALNDTWKIEKASFIEVKNCHWPKSHTL…GMEIRPLKEKEENLVNSLVTA --------------------ASHADMGYWIESQKNGSWKLEKASLIEVKTCTWPKSHTL…------------------------ Answers • The resulting tree looks the same but we had achieved overall better resolution between proteins Whole protein sequences used Q9YRR4 Q9YP96 B0LSS3 Q6TFL5 Q9YRR4 Q9YP96 0.306 0.332 0.254 0.297 0.230 0.227 Best aligned portion of protein sequences used Q9YRR4 Q9YP96 B0LSS3 Q6TFL5 Q9YRR4 Q9YP96 0.317 0.317 0.264 0.292 0.233 0.216 Built using the bolded region Answers Question 3: #Q3 building rooted tree based on Q89277 (yellow fever virus) library("seqinr") library("muscle") library("ape") seqnames <- c("Q9YRR4","Q9YP96","B0LSS3","Q6TFL5", "Q89277") choosebank("swissprot") #selects database for query seqs=list() for(i in 1:length(seqnames)){ query <- query(paste("AC=",seqnames[i],sep="")) seqs[i]=getSequence(query) } alignment_ape <- multipleSeqAlignment(seqnames, seqs); mydist <- dist.alignment(alignment_ape);mydist as out group library("ape") mytree <- nj(mydist) myrootedtree <- root(mytree, outgroup="Q89277", r=TRUE) plot.phylo(myrootedtree ,type="p", edge.color = "blue", edge.width = 3, cex=1.2, no.margin=T, srt=0) Answers outgroup • Q3 asks to build a rooted tree using out-group yellow fever virus (Q89277) Q6TFL5 Q9YRR4 Q9YP96 B0LSS3 Q89277 Q6TFL5 Q9YRR4 Q9YP96 0.523 0.511 0.306 0.486 0.333 0.254 0.487 0.297 0.230 0.227 • Most closely related viruses: – BOLSS3 and Q9YP96 • This rooted tree tells you which of the Dengue virus NS1 proteins branched off the earliest from the ancestors. Unrooted tree does not provide ancestry information (i.e. time sequence) References • Ape library for phylogenetic trees and ancestry with bootstrap methods http://cran.rproject.org/web/packages/ape/ape.pdf