CS566 Tutorial Presentation Phylogenetic Analysis Dayong Guo Introduction Phylogenetics is the study of evolutionary relatedness among various species, populations, or among a set of sequences. It was firstly stated by Ernst Haeckel in his theory of "Ontogeny recapitulates phylogeny"[1]. Besides the study of morphology or phenotype with traditional definitions and concepts, molecular analysis with modern computational tools has shown their unique strength in phylogenetics since the DNA, RNA or protein sequence data are naturally discretely defined. However, it is difficult to infer the phylogenetic tree from multiple sequence alignments because the ambiguity of insertion or deletion. Therefore, several computational algorithms have been developed to build phylogenetic trees with the input of multiple sequences. The most commonly used types of algorithms include distance-matrix methods (e.g. neighbor-joining), maximum parsimony, maximum likelihood and Bayesian inference, etc. The PHYLIP[2] (PHYLogeny Inference Package) is one of the most popular tools for phylogenetic analysis. It includes parsimony, distance matrix and likelihood methods. Therefore, we can practice and compare these algorithms in PHYLIP with some input datasets. Dataset Both an artificial dataset and a real dataset were used. The original sequences were firstly aligned with ClustalX Artificial dataset from our textbook page 303: >SeqA ACGCGTTGGGCGATGGCAAC >SeqB ACGCGTTGGGCGACGGTAAT >SeqC ACGCATTGAATGATGATAAT >SeqD ACACATTGAGTGATAATAAT Real dataset from NCBI: sequences of bone morphogenetic protein 2 protein (BMP2) from mouse, rat, human and frog. BMP2 is a conserved protein with ~90% identity among species. >human BMP2 MVAGTRCLLALLLPQVLLGGAAGLVPELGRRKFAAASSGRPSSQPSDEVLSEFELRLLSMFGLKQRPTPS RDAVVPPYMLDLYRRHSGQPGSPAPDHRLERAASRANTVRSFHHEESLEELPETSGKTTRRFFFNLSSIP TEEFITSAELQVFREQMQDALGNNSSFHHRINIYEIIKPATANSKFPVTRLLDTRLVNQNASRWESFDVT PAVMRWTAQGHANHGFVVEVAHLEEKQGVSKRHVRISRSLHQDEHSWSQIRPLLVTFGHDGKGHPLHKRE KRQAKHKQRKRLKSSCKRHPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADHLNSTNHAIVQTLVN SVNSKIPKACCVPTELSAISMLYLDENEKVVLKNYQDMVVEGCGCR >rat BMP2 MVAGTRCLLVLLLPQVLLGGAAGLIPELGRKKFAGASRPLSRPSEDVLSEFELRLLSMFGLKQRPTPSKD VVVPPYMLDLYRRHSGQPGALAPDHRLERAASRANTVLSFHHEEAIEELSEMSGKTSRRFFFNLSSVPTD EFLTSAELQIFREQMQEALGNSSFQHRINIYEIIKPATASSKFPVTRLLDTRLVTQNTSQWESFDVTPAV MRWTAQGHTNHGFVVEVAHLEEKPGVSKRHVRISRSLHQDEHSWSQVRPLLVTFGHDGKGHPLHKREKRQ AKHKQRKRLKSSCKRHPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADHLNSTNHAIVQTLVNSVN SKIPKACCVPTELSAISMLYLDENEKVVLKNYQDMVVEGCGCR >mouse BMP2 MVAGTRCLLVLLLPQVLLGGAAGLIPELGRKKFAAASSRPLSRPSEDVLSEFELRLLSMFGLKQRPTPSK DVVVPPYMLDLYRRHSGQPGAPAPDHRLERAASRANTVRSFHHEEAVEELPEMSGKTARRFFFNLSSVPS DEFLTSAELQIFREQIQEALGNSSFQHRINIYEIIKPAAANLKFPVTRLLDTRLVNQNTSQWESFDVTPA VMRWTTQGHTNHGFVVEVAHLEENPGVSKRHVRISRSLHQDEHSWSQIRPLLVTFGHDGKGHPLHKREKR QAKHKQRKRLKSSCKRHPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADHLNSTNHAIVQTLVNSV NSKIPKACCVPTELSAISMLYLDENEKVVLKNYQDMVVEGCGCR >frog BMP2 MVAGIHSLLLLLFYQVLLSGCTGLIPEEGKRKYTESGRSSPQQSQRVLNQFELRLLSMFGLKRRPTPGKN VVIPPYMLDLYHLHLAQLAADEGTSAMDFQMERAASRANTVRSFHHEESMEEIPESREKTIQRFFFNLSS IPNEELVTSAELRIFREQVQEPFESDSSKLHRINIYDIVKPAAAASRGPVVRLLDTRLVHHNESKWESFD VTPAIARWIAHKQPNHGFVVEVTHLDNDKNVPKKHVRISRSLTPDKDNWPQIRPLLVTFSHDGKGHALHK RQKRQARHKQRKRLKSSCRRHPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADHLNSTNHAIVQTL VNSVNTNIPKACCVPTELSAISMLYLDENEK Methods For both the artificial input and real input sequences, the following steps are followed to generate final phylogenetic trees. Fitch-Margoliash algorithm is used representing the distance method. Protpars algorithm is used representing the parsimony method. Then, the results are further compared and discussed. 1) Input dataset of FASTA sequences is loaded into CLUSTALW (from EBI) to generate alignment file in PHYLIP-format; 2) The alignment file is loaded into PHYLIP programs, to generate output of file of matrix and tree. Options of algorithms include: Distance methods: Dnadist DNA distance matrix calculation Protdist Protein distance matrix calculation Fitch Fitch-Margoliash tree drawing method without molecular clock Kitsch Fitch-Margoliash tree drawing method with molecular clock Neighbor Neighbor-Joining and UPGMA tree drawing method Character based methods Dnapars DNA parsimony Dnapenny DNA parsimony using branch-and-bound Dnaml DNA maximum likelihood without molecular clock Dnamlk DNA maximum likelihood with molecular clock Protpars Protein parsimony Proml Protein maximum likelihood 3) Draw the tree with the result from previous step. Options include: Drawgram Draws a rooted tree Drawtree Draws an unrooted tree Retree Interactive tree-rearrangement Results 1. Using the artificial sequences with distance method Fitch-Margoliash tree drawing without molecular clock: 1) Alignment by CLUSTALW is: 4 20 SeqC ACGCATTGAA TGATGATAAT SeqD ACACATTGAG TGATAATAAT SeqA ACGCGTTGGG CGATGGCAAC SeqB ACGCGTTGGG CGACGGTAAT 2) Distance matrix generated by protdist.exe: 4 SeqC 0.000000 0.146148 0.497602 0.387604 SeqD 0.146148 0.000000 0.574539 0.456486 SeqA 0.497602 0.574539 0.000000 0.220676 SeqB 0.387604 0.456486 0.220676 0.000000 3) Tree generated by fitch.exe: 4 Populations Fitch-Margoliash method version 3.66 __ __ 2 \ \ (Obs - Exp) Sum of squares = /_ /_ -----------2 i j Obs Negative branch lengths not allowed +------SeqD ! ! +--SeqB 1-----------------2 ! +---------SeqA ! +-SeqC remember: this is an unrooted tree! Sum of squares = 0.00014 Average percent standard deviation = 0.37228 Between And Length -------------1 SeqD 0.10906 1 2 0.29559 2 SeqB 0.05363 2 SeqA 0.16705 1 SeqC 0.03709 (SeqD:0.10906,(SeqB:0.05363,SeqA:0.16705):0.29559,SeqC:0.03709); 4) Using drawtree.exe: 2. Using the artificial sequences with Character based method Protpars Protein parsimony: 1) Alignment by CLUSTALW is: 4 20 SeqC ACGCATTGAA TGATGATAAT SeqD ACACATTGAG TGATAATAAT SeqA ACGCGTTGGG CGATGGCAAC SeqB ACGCGTTGGG CGACGGTAAT 2) Using protpars.exe to generate tree: Protein parsimony algorithm, version 3.66 One most parsimonious tree found: +--SeqB +--3 +--2 +--SeqA ! ! 1 +-----SeqD ! +--------SeqC remember: this is an unrooted tree! requires a total of 14.000 (((SeqB,SeqA),SeqD),SeqC); 3) Using drawtree.ext: 3. Using the real sequences with distance method Fitch-Margoliash tree drawing without molecular clock: 1) Alignment by CLUSTALW is: 4 400 rat MVAGTRCLLV LLLPQVLLGG AAGLIPELGR KKFAGAS--R PLSRPSEDVL mouse MVAGTRCLLV LLLPQVLLGG AAGLIPELGR KKFAAASS-R PLSRPSEDVL human MVAGTRCLLA LLLPQVLLGG AAGLVPELGR RKFAAASSGR PSSQPSDEVL frog MVAGIHSLLL LLFYQVLLSG CTGLIPEEGK RKYTESG--R SSPQQSQRVL SEFELRLLSM FGLKQRPTPS KDVVVPPYML DLYRRHSGQ- ---PGALAPD SEFELRLLSM FGLKQRPTPS KDVVVPPYML DLYRRHSGQ- ---PGAPAPD SEFELRLLSM FGLKQRPTPS RDAVVPPYML DLYRRHSGQ- ---PGSPAPD NQFELRLLSM FGLKRRPTPG KNVVIPPYML DLYHLHLAQL AADEGTSAMD HRLERAASRA NTVLSFHHEE AIEELSEMSG KTSRRFFFNL SSVPTDEFLT HRLERAASRA NTVRSFHHEE AVEELPEMSG KTARRFFFNL SSVPSDEFLT HRLERAASRA NTVRSFHHEE SLEELPETSG KTTRRFFFNL SSIPTEEFIT FQMERAASRA NTVRSFHHEE SMEEIPESRE KTIQRFFFNL SSIPNEELVT SAELQIFREQ MQEALGN-SS FQHRINIYEI IKPATASSKF PVTRLLDTRL SAELQIFREQ IQEALGN-SS FQHRINIYEI IKPAAANLKF PVTRLLDTRL SAELQVFREQ MQDALGNNSS FHHRINIYEI IKPATANSKF PVTRLLDTRL SAELRIFREQ VQEPFESDSS KLHRINIYDI VKPAAAASRG PVVRLLDTRL VTQNTSQWES FDVTPAVMRW TAQGHTNHGF VVEVAHLEEK PGVSKRHVRI VNQNTSQWES FDVTPAVMRW TTQGHTNHGF VVEVAHLEEN PGVSKRHVRI VNQNASRWES FDVTPAVMRW TAQGHANHGF VVEVAHLEEK QGVSKRHVRI VHHNESKWES FDVTPAIARW IAHKQPNHGF VVEVTHLDND KNVPKKHVRI SRSLHQDEHS WSQVRPLLVT FGHDGKGHPL HKREKRQAKH KQRKRLKSSC SRSLHQDEHS WSQIRPLLVT FGHDGKGHPL HKREKRQAKH KQRKRLKSSC SRSLHQDEHS WSQIRPLLVT FGHDGKGHPL HKREKRQAKH KQRKRLKSSC SRSLTPDKDN WPQIRPLLVT FSHDGKGHAL HKRQKRQARH KQRKRLKSSC KRHPLYVDFS DVGWNDWIVA PPGYHAFYCH GECPFPLADH LNSTNHAIVQ KRHPLYVDFS DVGWNDWIVA PPGYHAFYCH GECPFPLADH LNSTNHAIVQ KRHPLYVDFS DVGWNDWIVA PPGYHAFYCH GECPFPLADH LNSTNHAIVQ RRHPLYVDFS DVGWNDWIVA PPGYHAFYCH GECPFPLADH LNSTNHAIVQ TLVNSVNSKI PKACCVPTEL SAISMLYLDE NEKVVLKNYQ DMVVEGCGCR TLVNSVNSKI PKACCVPTEL SAISMLYLDE NEKVVLKNYQ DMVVEGCGCR TLVNSVNSKI PKACCVPTEL SAISMLYLDE NEKVVLKNYQ DMVVEGCGCR TLVNSVNTNI PKACCVPTEL SAISMLYLDE NEK------- ---------- 2) Distance matrix by protdist.exe: 4 rat 0.000000 0.038781 0.081053 0.335587 mouse 0.038781 0.000000 0.078105 0.326765 human 0.081053 0.078105 0.000000 0.317470 frog 0.335587 0.326765 0.317470 0.000000 3) Tree generated by fitch.exe: 4 Populations Fitch-Margoliash method version 3.66 __ __ 2 \ \ (Obs - Exp) Sum of squares = /_ /_ -----------2 i j Obs Negative branch lengths not allowed +mouse ! ! +----------------frog 1-2 ! +-human ! +rat remember: this is an unrooted tree! Sum of squares = 0.00030 Average percent standard deviation = 0.54531 Between And Length -------------1 mouse 0.01776 1 2 0.02722 2 frog 0.28449 2 human 0.03298 1 rat 0.02102 (mouse:0.01776,(frog:0.28449,human:0.03298):0.02722,rat:0.02102); 4) drawtree.exe 4. Using the real sequences with character based method Protpars.exe protein parsimony: 1) Alignment by CLUSTALW is the same as 3.1. 2) Using protpars.exe to generate tree: Protein parsimony algorithm, version 3.66 One most parsimonious tree found: +--frog +--3 +--2 +--human ! ! 1 +-----mouse ! +--------rat remember: this is an unrooted tree! requires a total of 227.000 (((frog,human),mouse),rat); 4) drawtree.exe Discussion For the short and simple artificial dataset, the distance method showed the similar result to the parsimony method, which is also very close to the original result on text book page 303. However, the two methods with real dataset of BMP2 sequences generated trees with very different lengths. The distance method made frog the most distant group, and rat vs. mouse as closest groups. The parsimony tree drew rat far away from other groups. According to the biological evidence, the distance model should fit the evolutional history better with the BMP2 example. Since the parsimony algorithm requires very high homology among sequences, the improper structure of parsimony tree could be the result of relatively lower than the required homology of BMP2 proteins among species. The PHYLIP program set includes multiple algorithms and options providing convenience and flexibility. And, different versions enable performance on various OS platforms. However, the text-based interface is not friendly. And, there is no online service. Reference 1. Haeckel, E., Riddle of the Universe at the Close of the Nineteenth Century. 1866. 2. Felsenstein, J., PHYLIP. 2006.