lecture notes

advertisement
Practical on phylogenetic trees
based on sequence alignments
Kyrylo Bessonov
November 26th, 2013
Talk plan
• How to build phylogenetic trees of types
– Unrooted
– Rooted
• Context
– comparison of viral proteins of dengue virus
• Examples on phylogenetic tree building
– Dengue virus
Building a phylo tree using ape
• Ape - Analyses of Phylogenetics and Evolution
– Functions to create and manipulate phylo trees
– Graphical exploration of phylogenetic data
• To build a phylogenetic tree
– Download protein sequences from DB
– Align sequences
– Calculate pairwise distance using ape
– Visualize a phylogenetic tree
Building an unrooted phylogenetic tree (1)
#install req. libraries
install.packages("seqinr")
install.packages("muscle")
install.packages("ape")
library("seqinr")
library("muscle")
library("ape")
multipleSeqAlignment <- function (seqnames, seqs){
#umax is an object of class fasta from muscle package
fasta_seqs_Object=umax;
tmp=data.frame(V1=rep(0,length(seqs)),V2=rep(0,length(seqs)))
for(i in 1:length(seqs)){
tmp[i,1]=seqnames[i]
tmp[i,2]=paste(seqs[[i]],collapse="")
}
fasta_seqs_Object$seqs=tmp
#multiple sequence alignment
#remove conflicting ape library from the memory
try(detach("package:ape"), silent=T)
alignment=muscle(seqs=fasta_seqs_Object, out = NULL)
alignment_ape=ape::as.alignment(matrix(alignment$seqs[,2]))
alignment_ape$nam=alignment$seqs[,1]
return (alignment_ape)
}
Building an unrooted phylogenetic tree (2)
#main part of the code
choosebank("swissprot") #selects database for query
seqnames <- c("P06747", "P0C569", "O56773", "Q5VKP1")
seqs=list()
for(i in 1:length(seqnames)){
query <- query(paste("AC=",seqnames[i],sep=""))
seqs[i]=getSequence(query)
}
#multipleSeqAlignment() is defined on previous slide
alignment_ape <- multipleSeqAlignment(seqnames, seqs);
mydist <- dist.alignment(alignment_ape)
#nj() performs the neighbor-joining tree estimation by Saitou and Nei
mytree <- nj(mydist)
mytree$tip.label=c("Q5VKP1-\nWestern Caucasian bat virus\nphosphoprotein","P06747-\nrabies
virus\nphosphoprotein","P0C569-\nMokola virus\nphosphoprotein","O56773-\nLagos bat
virus\nphosphoprotein")
plot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3,
cex=0.8, no.margin=T, srt=50)
Unrooted Phylogenetic Tree
• Phylogenetic tree
showing distance
between 4 protein
viral sequences
• the genetic
distance between
O56773 and
P0C569 is the
smallest
Unrooted phylogenetic tree (1)
• The lengths of the branches in the
plot of the tree are proportional to
the amount of evolutionary change
(estimated by number of mutations)
along the tree branches
• This is an unrooted phylogenetic tree as it does
not contain an outgroup sequence, that is a
sequence of a protein that is known to be more
distantly related to the other proteins in the tree
than they are to each other.
Unrooted phylogenetic tree(2)
• As a result, we cannot tell
which direction evolutionary
time ran in along the internal
branches of the tree. For
example, we cannot tell
whether the node
representing the common
ancestor of (O56773,
P0C569) was an ancestor of
the node representing the
common ancestor of
(Q5VKP1, P06747), or the
other way around.
Distance matrix
Q5VKP1
P06747
P0C569
P06747
0.49
P0C569
0.48
0.45
O56773
0.50
0.46
0.41
• Inspecting calculated distance matrix between
aligned sequences confirms results seen in
phylogenetic tree
• Closest pair is O56773 and P0C559 proteins
Rooted phylogenetic tree
• In order to convert the unrooted tree into a rooted
tree, we need to add an outgroup sequence
– Outgroup
• a taxon outside the group of interest
• will branch off at the base of phylogeny
• Caenorhabditis elegans (UniProt accession Q10572
and Caenorhabditis remanei (UniProt E3M2K8)
• If we were to build a phylogenetic tree of the Fox-1
homologues in verterbrates, the distantly related
sequence from worms would probably be a good
choice of outgroup, since the protein is from a different
taxa/group (worms)
Building an rooted phylogenetic tree (1)
#BUILDIN ROOTED TREE OF PROTEIN SEQUNCES (FOX1)
#Q9NWB1 - Human
#Q17QD3 - Cow
#Q95KI0 - Monkey
#A1A5R1 - Rat
#Q10572 - Worm C.elegans(Root)
#E1G4K8 - Eye worm
seqnames <- c("Q9NWB1","Q17QD3","Q95KI0","A1A5R1","Q10572","E1G4K8")
choosebank("swissprot") #selects database for query
seqs=list()
for(i in 1:length(seqnames)){
query <- query(paste("AC=",seqnames[i],sep=""))
seqs[i]=getSequence(query)
}
alignment_ape <- multipleSeqAlignment(seqnames, seqs);
mydist <- dist.alignment(alignment_ape)
Building an rooted phylogenetic tree (2)
library("ape")
mytree <- nj(mydist)
mytree$tip.label=c("E1G4K8-Eye worm ", "Q10572-C.elegans(Root)",
"A1A5R1-Rat", "Q9NWB1-Human", "Q17QD3-Cow", "Q95KI0-Monkey")
myrootedtree <- root(mytree, outgroup="Q10572-C.elegans(Root)",
r=TRUE)
#Phylogenetic tree with 6 tips and 5 internal nodes.
#Tip labels:
#[1] "E1G4K8" "Q8WS01" "Q9VT99" "A8NSK3" "Q10572" "E3M2K8"
#Rooted; includes branch lengths.
plot.phylo(myrootedtree, edge.color = "blue", edge.width = 3 ,
type="p")
Rooted tree of FOX1 proteins
• The invertebrates
are grouped
together
• Worms form a
distinct group yet
with large genetic
distance
• Human FOX1 is
closest to monkey
and cow sequences
outgroup
(worms)
Distance matrix
E1G4K8
Q10572
A1A5R1 Q9NWB1 Q17QD3
Q10572
0.72
A1A5R1
0.75
0.63
Q9NWB1
0.72
0.62
0.44
Q17QD3
0.73
0.62
0.50
0.28
Q95KI0
0.73
0.61
0.49
0.28
0.14
Table legend:
Q9NWB1 – Human
Q17QD3 – Cow
Q95KI0 – Monkey
A1A5R1 – Rat
Q10572 - Worm C.elegans (Root)
E1G4K8 - Eye worm
• As expected, eye worms are the mostly distantly
related species to vertebrates
• Cow and monkey have the closest relationship and the
lowest genetic distance
Rooted tree
• Time runs from left
to right
• Monkey, Cow and
Human have
common ancestor 3
• Ancestor 1 is
common to
ancestors 2 and 3
TIME
Exercises on phylogenetic tree building
•
•
Q1. Calculate the genetic distances (i.e. genetic distance) between the following
NS1 proteins from different Dengue virus strains: Dengue virus 1 NS1 protein
(Uniprot ID: Q9YRR4), Dengue virus 2 NS1 protein (UniProt: Q9YP96), Dengue
virus 3 NS1 protein (UniProt: B0LSS3), and Dengue virus 4 NS1 protein (UniProt:
Q6TFL5). Which viruses are the most closely related, and which are the least
closely related, based on the genetic distances? Note: Dengue virus causes
Dengue fever, which is classified by the WHO as a neglected tropical disease.
There are four main types of Dengue virus, Dengue virus 1, Dengue virus 2,
Dengue virus 3, and Dengue virus 4.
Q2. Build an unrooted phylogenetic tree of the NS1 proteins from Dengue virus 1,
Dengue virus 2, Dengue virus 3 and Dengue virus 4, using the neighbour-joining
algorithm. Which are the most closely related proteins, based on the tree?
Exercises on phylogenetic tree building
•
Q3. The Zika virus is related to Dengue viruses, but is not a Dengue virus, and so
therefore can be used as an outgroup in phylogenetic trees of Dengue virus
sequences. UniProt accession Q32ZE1 consists of a sequence with similarity to the
Dengue NS1 protein, so seems to be a related protein from Zika virus. Build a
rooted phylogenetic tree of the Dengue NS1 proteins based on an alignment, using
the Zika virus protein as the outgroup. Which are the most closely related Dengue
virus proteins, based on the tree? What extra information does this tree tell you,
compared to the unrooted tree in Q2?
Answers
Question 1:
Summary of viral proteins and Uniprot accession numbers:
Uniprot ID: Q9YRR4
Dengue virus 1 NS1 protein
UniProt: Q9YP96
Dengue virus 2 NS1 protein
UniProt: B0LSS3
Dengue virus 3 NS1 protein
UniProt: Q6TFL5
Dengue virus 4 NS1 protein
seqnames <- c("Q9YRR4","Q9YP96","B0LSS3","Q6TFL5")
choosebank("swissprot") #selects database for query
seqs=list()
for(i in 1:length(seqnames)){
query <- query(paste("AC=",seqnames[i],sep=""))
seqs[i]=getSequence(query)
}
alignment_ape <- multipleSeqAlignment(seqnames, seqs);
mydist <- dist.alignment(alignment_ape);
mydist
Answers
• Q1. The distance matrix is as follows
Q9YRR4
Q9YP96
B0LSS3
Q6TFL5
Q9YRR4 Q9YP96
0.306
0.333
0.254
0.297
0.230
0.227
The most distant are Q9YP96(V2) and
Q6TFL5(V4) with genetic distance of 0,33 while
the most closely related are Q9YP96(V1) and
BOLSS3(V3) with genetic distance of 0,227
Answers
Question 2:
library("ape")
mytree <- nj(mydist)
#plotting unrooted tree
plot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=1.2,
no.margin=T, srt=0)
#clean the sequences from gaps
seqs_trim=seqs
for(i in 1:length(seqs)){
start=regexpr("DMGY", paste(seqs_trim[[i]],collapse="") ) [1]
stop=regexpr("GEDG", paste(seqs_trim[[i]],collapse="") ) [1]
seqs_trim[[i]]=seqs_trim[[i]][start:stop]
}
alignment_ape <- multipleSeqAlignment(seqnames, seqs_trim);
mydist <- dist.alignment(alignment_ape);mydist
library("ape")
mytree <- nj(mydist)
#plotting unrooted tree based on alignment of whole protein sequences
plot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=1.2,
no.margin=T, srt=0)
Answers
Question 2 (continued):
alignment_ape <- multipleSeqAlignment(seqnames, seqs_trim);
mydist <- dist.alignment(alignment_ape);mydist
library("ape")
mytree <- nj(mydist)
#tree based on the best aligned portion
plot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=1.2,
no.margin=T, srt=0)
Answers
• The resulting Q2 un-rooted tree
This un-rooted tree agrees with the genetic
distance matrix calculated in Q1. The tree
suggests that BOLSS3 and Q9YP96 are the mostly
related proteins. To improve quality of the tree it
is best to select region that has minimal number
of gaps between protein sequences
Below you can see that there are regions with lots
of gaps. Let’s build another tree based on the
bolded(most conserved) region to see if it is the
same
Alignment of proteins:
Q6TFL5
Q9YRR4
Q9YP96
B0LSS3
Built using the full lengths of proteins
DMGCVVSWNGKELKC…KDQKAVHADMGYWIESSKNQTWQIEKASLIEVKTCLWPKTHTL…GMEIRPLSEKEENMVKSQVTA
------------------------DMGYWIESEKNETWKLARASFIEVKTCIWPKSHTL…GMEI----------------DSGCVVSWKNKELKC…KDNRAVHADMGYWIESALNDTWKIEKASFIEVKNCHWPKSHTL…GMEIRPLKEKEENLVNSLVTA
--------------------ASHADMGYWIESQKNGSWKLEKASLIEVKTCTWPKSHTL…------------------------
Answers
• The resulting tree looks the same but we had achieved
overall better resolution between proteins
Whole protein sequences used
Q9YRR4
Q9YP96
B0LSS3
Q6TFL5
Q9YRR4
Q9YP96
0.306
0.332
0.254
0.297
0.230
0.227
Best aligned portion of protein sequences used
Q9YRR4
Q9YP96
B0LSS3
Q6TFL5
Q9YRR4
Q9YP96
0.317
0.317
0.264
0.292
0.233
0.216
Built using the bolded region
Answers
Question 3:
#Q3 building rooted tree based on Q89277 (yellow fever virus)
library("seqinr")
library("muscle")
library("ape")
seqnames <- c("Q9YRR4","Q9YP96","B0LSS3","Q6TFL5", "Q89277")
choosebank("swissprot") #selects database for query
seqs=list()
for(i in 1:length(seqnames)){
query <- query(paste("AC=",seqnames[i],sep=""))
seqs[i]=getSequence(query)
}
alignment_ape <- multipleSeqAlignment(seqnames, seqs);
mydist <- dist.alignment(alignment_ape);mydist
as out group
library("ape")
mytree <- nj(mydist)
myrootedtree <- root(mytree, outgroup="Q89277", r=TRUE)
plot.phylo(myrootedtree ,type="p", edge.color = "blue", edge.width = 3,
cex=1.2, no.margin=T, srt=0)
Answers
outgroup
• Q3 asks to build a rooted tree using out-group
yellow fever virus (Q89277)
Q6TFL5
Q9YRR4
Q9YP96
B0LSS3
Q89277 Q6TFL5 Q9YRR4 Q9YP96
0.523
0.511
0.306
0.486
0.333
0.254
0.487
0.297
0.230
0.227
• Most closely related viruses:
– BOLSS3 and Q9YP96
• This rooted tree tells you which of the Dengue
virus NS1 proteins branched off the earliest from
the ancestors. Unrooted tree does not provide
ancestry information (i.e. time sequence)
References
• Ape library for phylogenetic trees and
ancestry with bootstrap methods
http://cran.rproject.org/web/packages/ape/ape.pdf
Download