Distance-based methods Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca Lecture Outline • Objectives in this lecture – Grasp the basic concepts distance-based tree-building algorithms – Learn the least-squares criterion and the minimum evolution criterion and how to use them to construct a tree • Distance-based methods – Genetic distance: generally defined as the number of substitutions per site. • • • • • • JC69 distance K80 distance TN84 distance F84 distance TN93 distance LogDet distance – Tree-building algorithms (UPGMA): • • • • Xuhua Xia UPGMA Neighbor-joining Fitch-Margoliash FastME Slide 2 Genetic Distances • Genetic distances: Assuming a substitution model, we can obtain the genetic distance (i.e., difference) between two nucleotide or amino acid sequences, e.g., • JC K JC 4p ln 1 4 3 3 Y • K80 K K 80 1 1 ln ln 1 2P Q 1 2Q 2 4 Y P1 Q Q -ln 1 ln 1 R 2 T C 2 Y 2 Y R = 2 Y R • TN93: D TN 93 4 T C 1 + 4 A G 2 + 4 Y R Xuhua Xia R P2 Q Q -ln 1 Y ln 1 2 A G 2 R 2 Y R = 2 R Q ln 1 2 Y 2 R Slide 3 Calculation of KJC69 t AACGACGATCG AACGACGATCG: Species 1 t AACGACGATCG: Species 2 K The time is 2t between Species 1 to Species 2 4p ln 1 4 3 3 Sp1: AAG CCT CGG GGC CCT TAT TTT TTG || | ||| ||| | ||| ||| || Sp2: AAT CTC CGG GGC CTC TAT TTT TTT p = 6/24 = 0.25 K = 0.304099 Genetic distances are scaled to be the number of substitutions per site. Xuhua Xia Slide 4 Numerical Illustration Sp1: AAG CCT CGG GGC CCT TAT TTT TTG || | ||| ||| | ||| ||| || Sp2: AAT CTC CGG GGC CTC TAT TTT TTT What are P and Q? P = 4/24, Q = 2/24 K K 80 ln 1 2 P Q ln 1 2 Q 2 0.31507864 4 Comparison of distances: P = 0.25 Poisson P = -ln(1-p) = 0.288 KJC69 = 0.304099 KK80 = 0.3150786 Xuhua Xia Slide 5 Distance-based phylogenetic algorithms Algorithms Optimization UPGMA Local Neighbor-joining Local Minimum EvolutionGlobal Fitch-Margoliash Global FastME Global Xuhua Xia Assuming a molecular clock Yes No No No No Slide 6 A Star Tree (Completely Unresolved Tree) Human Chimpanzee Gorilla Orangutan Gibbon Xuhua Xia Slide 7 Genetic Distance Matrix Matrix of Genetic distances (Dij): Human Human Chimp Gorilla Orang Gibbon Xuhua Xia Chimp 0.015 Gorilla 0.045 0.030 Orang 0.143 0.126 0.092 Gibbon 0.198 0.179 0.179 0.179 Slide 8 UPGMA • Human Human Chimp Gorilla Orang Gibbon Chimp 0.015 Gorilla 0.045 0.030 Orang 0.143 0.126 0.092 hu-ch hu-ch Gorilla Orang Gibbon Xuhua Xia Gorilla 0.038 Orang 0.135 0.092 Human Chimp Gorilla Orang Gibbon Gorilla Orang Gibbon Human Chimp • D(hu-ch),go = (Dhu,go + Dch,go)/2 = 0.038 D(hu-ch),or = (Dhu,or + Dch,or)/2 = 0.135 D(hu-ch),gi = (Dhu,gi + Dch,gi)/2 = 0.189 • Gibbon 0.198 0.179 0.179 0.179 Gibbon 0.189 0.179 0.179 (hu,ch),(go,or,gi) Orang Gibbon Gorilla Human Chimp ((hu,ch),go),(or,gi) Slide 9 UPGMA • Human Human Chimp Gorilla Orang Gibbon • Gorilla 0.045 0.030 Orang 0.143 0.126 0.092 D(hu-ch-go),or = (Dhu,or + Dch,or + Dgo,or)/3 = 0.120 D(hu-ch-go),gi = (Dhu,gi + Dch,gi +Dgo,gi)/3 = 0.185 • hu-ch-go hu-ch-go Orangutan Gibbon • Chimp 0.015 Orang 0.120 Gibbon 0.185 0.179 Gibbon 0.198 0.179 0.179 0.179 Orang Gibbon Gorilla Human Chimp Gibbon Orang Gorilla Human Chimp (((hu,ch),go),or),gi) D(hu-ch-go-or),gi = (Dhu,gi + Dch,gi +Dgo,gi + Dor,gi)/4 = 0.184 Xuhua Xia Slide 10 Phylogenetic Relationship from UPGMA • Human Chimp 0.015 Gorilla 0.045 0.030 Orang 0.143 0.126 0.092 hu-ch Gorilla 0.038 Orang 0.135 0.092 Gibbon 0.189 0.179 0.179 Human Chimp Gorilla Orang Gibbon • hu-ch Gorilla Orang Gibbon • hu-ch-go Orang Gibbon Xuhua Xia hu-ch-go Orang 0.120 Gibbon 0.198 0.179 0.179 0.179 Gibbon 0.185 0.179 Slide 11 Branch Lengths Dhu-ch = 0.015 D(hu-ch),go = (Dhu,go + Dch,go)/2 = 0.038 D(hu-ch),or = (Dhu,or + Dch,or)/2 = 0.135 D(hu-ch),gi = (Dhu,gi + Dch,gi)/2 = 0.189 ((hu,ch),(go,or,gi)) (((hu,ch),go),(or,gi)) ((((hu,ch),go),or),gi) D(hu-ch-go),or = (Dhu,or + Dch,or + Dgo,or)/3 = 0.120 D(hu-ch-go),gi = (Dhu,gi + Dch,gi +Dgo,gi)/3 = 0.185 D(hu-ch-go-or),gi = (Dhu,gi + Dch,gi +Dgo,gi + Dor,gi)/4 = 0.184 0.0075 Chimp 0.019 0.06 ((hu:0.0075,ch:0.0075),(go,or,gi)) Human 0.092 Gorilla Orang Gibbon (((hu:0.0075,ch:0.0075):0.019,go:0.019),(or,gi)) ((((hu:0.0075,ch:0.0075):0.0115,go:0.019):0.041,or:0.06):0.032,gi:0.092) Xuhua Xia Slide 12 Final UPGMA Tree Human Chimp Gorilla Orang Gibbon 19 13 8 0.092 0.060 0.019 6 MY 0.0075 ((((hu:0.0075,ch:0.0075):0.0115,go:0.019):0.041,or:0.06):0.032,gi:0.092); Xuhua Xia Slide 13 Distance-based method • Distance matrix • Tree-building algorithms – UPGMA – Neighbor-joining – FastME – Fitch-Margoliash • Criterion-based methods – Branch-length estimation – Tree-selection criterion Xuhua Xia Slide 14 Branch Length Estimation • For three OTUs, the branch lengths can be estimated directly • For more than three OTUs, there are two commonly used methods for estimating branch lengths – The least-square method – Fitch-Margoliash method • Don’t confuse the Fitch-Margoliash method of branch length estimation with the Fitch-Margoliash criterion of tree selection • Illustration of the least-square method of branch length estimation Xuhua Xia Slide 15 For three OTUs 1 2 0.092 3 0.179 0.179 1 2 3 d12 d13 d23 1 2 3 1 2 3 1 d12 = x1 + x2 x1 x3 d13 = x1 + x3 d23 = x2 + x3 Xuhua Xia 2 3 x2 Slide 16 Least-square method 4 Sp1 Sp2 Sp3 Sp4 0.3 0.4 0.5 0.4 0.6 0.6 4 Sp1 Sp2 d12 Sp3 d13 d23 Sp4 d14 d24 1 d34 x3 x1 3 x5 2 Xuhua Xia x2 x4 4 Slide 17 Least-square method 1 x3 x1 3 x5 x2 2 x4 4 d’12 = x1 + x2 (d12 - d’12)2= [d12 – (x1 + x2)]2 d’13 = x1 + x5+ x3 (d13 - d’13)2 = [d13 – (x1 + x5+ x3)]2 d’14 = x1 + x5 + x4 (d14 - d’14)2 = [d14 – (x1 + x5 + x4)]2 d’23 = x2 + x5 + x3 (d23 - d’23)2 = [d23 – (x2 + x5 + x3)]2 d’24 = x2 + x5 + x4 (d24 - d’24)2 = [d24 – (x2 + x5 + x4)]2 d’34 = x3 + x4 (d34 - d’34)2 = [d34 – (x3 + x4)]2 n SS i j Xuhua Xia ( d ij d ij ) ' 2 Least-squares method: Find xi values that minimize SS Slide 18 Least-squares method SS = [d12 – (x1 + x2)]2 + [d13 – (x1 + x5+ x3)]2 + [d14 – (x1 + x5 + x4)]2 + [d23 – (x2 + x5 + x3)]2+ [d24 – (x2 + x5 + x4)]2+ [d34 – (x3 + x4)]2 Take the partial derivative of SS with respective to xi, we have SS/x1 := -2 d12 + 6 x1 + 2 x2 - 2 d13 + 4 x5 + 2 x3 - 2 d14 + 2 x4 SS/x2 := -2 d12 + 2 x1 + 6 x2 - 2 d23 + 4 x5 + 2 x3 - 2 d24 + 2 x4 SS/x3 := -2 d13 + 2 x1 + 4 x5 + 6 x3 - 2 d23 + 2 x2 - 2 d34 + 2 x4 SS/x4 := -2 d14 + 2 x1 + 4 x5 + 6 x4 - 2 d24 + 2 x2 - 2 d34 + 2 x3 SS/x5 := -2 d13 + 4 x1 + 8 x5 + 4 x3 - 2 d14 + 4 x4 - 2 d23 + 4 x2 - 2 d24 Setting these partial derivatives to 0 and solve for xi, we have x1 = d13/4 + d12/2 - d23/4 + d14/4 - d24/4 x2 = d12/2 - d13/4 + d23/4 - d14/4 + d24/4, x3 = d13/4 + d23/4 + d34/2 - d14/4 - d24/4, x4 = d14/4 - d13/4 - d23/4 + d34/2 + d24/4, x5 = - d12/2 + d23/4 - d34/2 + d14/4 + d24/4 + d13/4 Xuhua Xia Slide 19 Least-squares method x1 = d13/4 + d12/2 - d23/4 + d14/4 - d24/4 x2 = d12/2 - d13/4 + d23/4 - d14/4 + d24/4, x3 = d13/4 + d23/4 + d34/2 - d14/4 - d24/4, x4 = d14/4 - d13/4 - d23/4 + d34/2 + d24/4, x5 = - d12/2 + d23/4 - d34/2 + d14/4 + d24/4 + d13/4 4 Sp1 Sp2 Sp3 Sp4 0.3 0.4 0.5 0.4 0.6 0.6 x1 = 0.075 x2 = 0.225 x3 = 0.275 x4 = 0.325 x5 = 0.025 Xuhua Xia 1 x3 x1 3 x5 2 x2 x4 4 Slide 20 Minimum Evolution Criterion 1 x3 x1 3 2n3 TreeLen x5 x i i 1 2 x2 x4 x1 x3 1 4 x2 x4 x1 x3 1 of OTUs 2 x5 3 where n number 4 The minimum evolution (ME) criterion: The tree with the shortest TreeLen is the best tree. 2 x5 4 Xuhua Xia x2 x4 3 Slide 21