doc

advertisement
CS566 Tutorial Presentation
Phylogenetic Analysis
Dayong Guo
Introduction
Phylogenetics is the study of evolutionary relatedness among various species,
populations, or among a set of sequences. It was firstly stated by Ernst Haeckel in his theory
of "Ontogeny recapitulates phylogeny"[1]. Besides the study of morphology or phenotype
with traditional definitions and concepts, molecular analysis with modern computational tools
has shown their unique strength in phylogenetics since the DNA, RNA or protein sequence
data are naturally discretely defined. However, it is difficult to infer the phylogenetic tree
from multiple sequence alignments because the ambiguity of insertion or deletion. Therefore,
several computational algorithms have been developed to build phylogenetic trees with the
input of multiple sequences. The most commonly used types of algorithms include
distance-matrix methods (e.g. neighbor-joining), maximum parsimony, maximum likelihood
and Bayesian inference, etc. The PHYLIP[2] (PHYLogeny Inference Package) is one of the
most popular tools for phylogenetic analysis. It includes parsimony, distance matrix and
likelihood methods. Therefore, we can practice and compare these algorithms in PHYLIP
with some input datasets.
Dataset
Both an artificial dataset and a real dataset were used. The original sequences were firstly
aligned with ClustalX
Artificial dataset from our textbook page 303:
>SeqA
ACGCGTTGGGCGATGGCAAC
>SeqB
ACGCGTTGGGCGACGGTAAT
>SeqC
ACGCATTGAATGATGATAAT
>SeqD
ACACATTGAGTGATAATAAT
Real dataset from NCBI: sequences of bone morphogenetic protein 2 protein (BMP2) from
mouse, rat, human and frog. BMP2 is a conserved protein with ~90% identity among species.
>human BMP2
MVAGTRCLLALLLPQVLLGGAAGLVPELGRRKFAAASSGRPSSQPSDEVLSEFELRLLSMFGLKQRPTPS
RDAVVPPYMLDLYRRHSGQPGSPAPDHRLERAASRANTVRSFHHEESLEELPETSGKTTRRFFFNLSSIP
TEEFITSAELQVFREQMQDALGNNSSFHHRINIYEIIKPATANSKFPVTRLLDTRLVNQNASRWESFDVT
PAVMRWTAQGHANHGFVVEVAHLEEKQGVSKRHVRISRSLHQDEHSWSQIRPLLVTFGHDGKGHPLHKRE
KRQAKHKQRKRLKSSCKRHPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADHLNSTNHAIVQTLVN
SVNSKIPKACCVPTELSAISMLYLDENEKVVLKNYQDMVVEGCGCR
>rat BMP2
MVAGTRCLLVLLLPQVLLGGAAGLIPELGRKKFAGASRPLSRPSEDVLSEFELRLLSMFGLKQRPTPSKD
VVVPPYMLDLYRRHSGQPGALAPDHRLERAASRANTVLSFHHEEAIEELSEMSGKTSRRFFFNLSSVPTD
EFLTSAELQIFREQMQEALGNSSFQHRINIYEIIKPATASSKFPVTRLLDTRLVTQNTSQWESFDVTPAV
MRWTAQGHTNHGFVVEVAHLEEKPGVSKRHVRISRSLHQDEHSWSQVRPLLVTFGHDGKGHPLHKREKRQ
AKHKQRKRLKSSCKRHPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADHLNSTNHAIVQTLVNSVN
SKIPKACCVPTELSAISMLYLDENEKVVLKNYQDMVVEGCGCR
>mouse BMP2
MVAGTRCLLVLLLPQVLLGGAAGLIPELGRKKFAAASSRPLSRPSEDVLSEFELRLLSMFGLKQRPTPSK
DVVVPPYMLDLYRRHSGQPGAPAPDHRLERAASRANTVRSFHHEEAVEELPEMSGKTARRFFFNLSSVPS
DEFLTSAELQIFREQIQEALGNSSFQHRINIYEIIKPAAANLKFPVTRLLDTRLVNQNTSQWESFDVTPA
VMRWTTQGHTNHGFVVEVAHLEENPGVSKRHVRISRSLHQDEHSWSQIRPLLVTFGHDGKGHPLHKREKR
QAKHKQRKRLKSSCKRHPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADHLNSTNHAIVQTLVNSV
NSKIPKACCVPTELSAISMLYLDENEKVVLKNYQDMVVEGCGCR
>frog BMP2
MVAGIHSLLLLLFYQVLLSGCTGLIPEEGKRKYTESGRSSPQQSQRVLNQFELRLLSMFGLKRRPTPGKN
VVIPPYMLDLYHLHLAQLAADEGTSAMDFQMERAASRANTVRSFHHEESMEEIPESREKTIQRFFFNLSS
IPNEELVTSAELRIFREQVQEPFESDSSKLHRINIYDIVKPAAAASRGPVVRLLDTRLVHHNESKWESFD
VTPAIARWIAHKQPNHGFVVEVTHLDNDKNVPKKHVRISRSLTPDKDNWPQIRPLLVTFSHDGKGHALHK
RQKRQARHKQRKRLKSSCRRHPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADHLNSTNHAIVQTL
VNSVNTNIPKACCVPTELSAISMLYLDENEK
Methods
For both the artificial input and real input sequences, the following steps are followed to
generate final phylogenetic trees. Fitch-Margoliash algorithm is used representing the
distance method. Protpars algorithm is used representing the parsimony method. Then, the
results are further compared and discussed.
1) Input dataset of FASTA sequences is loaded into CLUSTALW (from EBI) to generate
alignment file in PHYLIP-format;
2) The alignment file is loaded into PHYLIP programs, to generate output of file of matrix
and tree. Options of algorithms include:
Distance methods:
Dnadist DNA distance matrix calculation
Protdist Protein distance matrix calculation
Fitch Fitch-Margoliash tree drawing method without molecular clock
Kitsch Fitch-Margoliash tree drawing method with molecular clock
Neighbor Neighbor-Joining and UPGMA tree drawing method
Character based methods
Dnapars DNA parsimony
Dnapenny DNA parsimony using branch-and-bound
Dnaml DNA maximum likelihood without molecular clock
Dnamlk DNA maximum likelihood with molecular clock
Protpars Protein parsimony
Proml Protein maximum likelihood
3) Draw the tree with the result from previous step. Options include:
Drawgram Draws a rooted tree
Drawtree Draws an unrooted tree
Retree Interactive tree-rearrangement
Results
1. Using the artificial sequences with distance method Fitch-Margoliash tree drawing without
molecular clock:
1) Alignment by CLUSTALW is:
4
20
SeqC
ACGCATTGAA TGATGATAAT
SeqD
ACACATTGAG TGATAATAAT
SeqA
ACGCGTTGGG CGATGGCAAC
SeqB
ACGCGTTGGG CGACGGTAAT
2) Distance matrix generated by protdist.exe:
4
SeqC
0.000000 0.146148 0.497602 0.387604
SeqD
0.146148 0.000000 0.574539 0.456486
SeqA
0.497602 0.574539 0.000000 0.220676
SeqB
0.387604 0.456486 0.220676 0.000000
3) Tree generated by fitch.exe:
4 Populations
Fitch-Margoliash method version 3.66
__ __
2
\ \ (Obs - Exp)
Sum of squares = /_ /_ -----------2
i j
Obs
Negative branch lengths not allowed
+------SeqD
!
!
+--SeqB
1-----------------2
!
+---------SeqA
!
+-SeqC
remember: this is an unrooted tree!
Sum of squares =
0.00014
Average percent standard deviation =
0.37228
Between
And
Length
-------------1
SeqD
0.10906
1
2
0.29559
2
SeqB
0.05363
2
SeqA
0.16705
1
SeqC
0.03709
(SeqD:0.10906,(SeqB:0.05363,SeqA:0.16705):0.29559,SeqC:0.03709);
4) Using drawtree.exe:
2. Using the artificial sequences with Character based method Protpars Protein parsimony:
1) Alignment by CLUSTALW is:
4
20
SeqC
ACGCATTGAA TGATGATAAT
SeqD
ACACATTGAG TGATAATAAT
SeqA
ACGCGTTGGG CGATGGCAAC
SeqB
ACGCGTTGGG CGACGGTAAT
2) Using protpars.exe to generate tree:
Protein parsimony algorithm, version 3.66
One most parsimonious tree found:
+--SeqB
+--3
+--2 +--SeqA
! !
1 +-----SeqD
!
+--------SeqC
remember: this is an unrooted tree! requires a total of
14.000
(((SeqB,SeqA),SeqD),SeqC);
3) Using drawtree.ext:
3. Using the real sequences with distance method Fitch-Margoliash tree drawing without
molecular clock:
1) Alignment by CLUSTALW is:
4
400
rat
MVAGTRCLLV LLLPQVLLGG AAGLIPELGR KKFAGAS--R PLSRPSEDVL
mouse
MVAGTRCLLV LLLPQVLLGG AAGLIPELGR KKFAAASS-R PLSRPSEDVL
human
MVAGTRCLLA LLLPQVLLGG AAGLVPELGR RKFAAASSGR PSSQPSDEVL
frog
MVAGIHSLLL LLFYQVLLSG CTGLIPEEGK RKYTESG--R SSPQQSQRVL
SEFELRLLSM FGLKQRPTPS KDVVVPPYML DLYRRHSGQ- ---PGALAPD
SEFELRLLSM FGLKQRPTPS KDVVVPPYML DLYRRHSGQ- ---PGAPAPD
SEFELRLLSM FGLKQRPTPS RDAVVPPYML DLYRRHSGQ- ---PGSPAPD
NQFELRLLSM FGLKRRPTPG KNVVIPPYML DLYHLHLAQL AADEGTSAMD
HRLERAASRA NTVLSFHHEE AIEELSEMSG KTSRRFFFNL SSVPTDEFLT
HRLERAASRA NTVRSFHHEE AVEELPEMSG KTARRFFFNL SSVPSDEFLT
HRLERAASRA NTVRSFHHEE SLEELPETSG KTTRRFFFNL SSIPTEEFIT
FQMERAASRA NTVRSFHHEE SMEEIPESRE KTIQRFFFNL SSIPNEELVT
SAELQIFREQ MQEALGN-SS FQHRINIYEI IKPATASSKF PVTRLLDTRL
SAELQIFREQ IQEALGN-SS FQHRINIYEI IKPAAANLKF PVTRLLDTRL
SAELQVFREQ MQDALGNNSS FHHRINIYEI IKPATANSKF PVTRLLDTRL
SAELRIFREQ VQEPFESDSS KLHRINIYDI VKPAAAASRG PVVRLLDTRL
VTQNTSQWES FDVTPAVMRW TAQGHTNHGF VVEVAHLEEK PGVSKRHVRI
VNQNTSQWES FDVTPAVMRW TTQGHTNHGF VVEVAHLEEN PGVSKRHVRI
VNQNASRWES FDVTPAVMRW TAQGHANHGF VVEVAHLEEK QGVSKRHVRI
VHHNESKWES FDVTPAIARW IAHKQPNHGF VVEVTHLDND KNVPKKHVRI
SRSLHQDEHS WSQVRPLLVT FGHDGKGHPL HKREKRQAKH KQRKRLKSSC
SRSLHQDEHS WSQIRPLLVT FGHDGKGHPL HKREKRQAKH KQRKRLKSSC
SRSLHQDEHS WSQIRPLLVT FGHDGKGHPL HKREKRQAKH KQRKRLKSSC
SRSLTPDKDN WPQIRPLLVT FSHDGKGHAL HKRQKRQARH KQRKRLKSSC
KRHPLYVDFS DVGWNDWIVA PPGYHAFYCH GECPFPLADH LNSTNHAIVQ
KRHPLYVDFS DVGWNDWIVA PPGYHAFYCH GECPFPLADH LNSTNHAIVQ
KRHPLYVDFS DVGWNDWIVA PPGYHAFYCH GECPFPLADH LNSTNHAIVQ
RRHPLYVDFS DVGWNDWIVA PPGYHAFYCH GECPFPLADH LNSTNHAIVQ
TLVNSVNSKI PKACCVPTEL SAISMLYLDE NEKVVLKNYQ DMVVEGCGCR
TLVNSVNSKI PKACCVPTEL SAISMLYLDE NEKVVLKNYQ DMVVEGCGCR
TLVNSVNSKI PKACCVPTEL SAISMLYLDE NEKVVLKNYQ DMVVEGCGCR
TLVNSVNTNI PKACCVPTEL SAISMLYLDE NEK------- ----------
2) Distance matrix by protdist.exe:
4
rat
0.000000 0.038781 0.081053 0.335587
mouse
0.038781 0.000000 0.078105 0.326765
human
0.081053 0.078105 0.000000 0.317470
frog
0.335587 0.326765 0.317470 0.000000
3) Tree generated by fitch.exe:
4 Populations
Fitch-Margoliash method version 3.66
__ __
2
\ \ (Obs - Exp)
Sum of squares = /_ /_ -----------2
i j
Obs
Negative branch lengths not allowed
+mouse
!
! +----------------frog
1-2
! +-human
!
+rat
remember: this is an unrooted tree!
Sum of squares =
0.00030
Average percent standard deviation =
0.54531
Between
And
Length
-------------1
mouse
0.01776
1
2
0.02722
2
frog
0.28449
2
human
0.03298
1
rat
0.02102
(mouse:0.01776,(frog:0.28449,human:0.03298):0.02722,rat:0.02102);
4) drawtree.exe
4. Using the real sequences with character based method Protpars.exe protein parsimony:
1) Alignment by CLUSTALW is the same as 3.1.
2) Using protpars.exe to generate tree:
Protein parsimony algorithm, version 3.66
One most parsimonious tree found:
+--frog
+--3
+--2 +--human
! !
1 +-----mouse
!
+--------rat
remember: this is an unrooted tree!
requires a total of
227.000
(((frog,human),mouse),rat);
4) drawtree.exe
Discussion
For the short and simple artificial dataset, the distance method showed the similar result
to the parsimony method, which is also very close to the original result on text book page 303.
However, the two methods with real dataset of BMP2 sequences generated trees with very
different lengths. The distance method made frog the most distant group, and rat vs. mouse as
closest groups. The parsimony tree drew rat far away from other groups. According to the
biological evidence, the distance model should fit the evolutional history better with the
BMP2 example. Since the parsimony algorithm requires very high homology among
sequences, the improper structure of parsimony tree could be the result of relatively lower
than the required homology of BMP2 proteins among species.
The PHYLIP program set includes multiple algorithms and options providing
convenience and flexibility. And, different versions enable performance on various OS
platforms. However, the text-based interface is not friendly. And, there is no online service.
Reference
1.
Haeckel, E., Riddle of the Universe at the Close of the Nineteenth Century. 1866.
2.
Felsenstein, J., PHYLIP. 2006.
Download