Distance-based methods

advertisement
Distance-based methods
Xuhua Xia
xxia@uottawa.ca
http://dambe.bio.uottawa.ca
Lecture Outline
• Objectives in this lecture
– Grasp the basic concepts distance-based tree-building algorithms
– Learn the least-squares criterion and the minimum evolution criterion and how
to use them to construct a tree
• Distance-based methods
– Genetic distance: generally defined as the number of substitutions per site.
•
•
•
•
•
•
JC69 distance
K80 distance
TN84 distance
F84 distance
TN93 distance
LogDet distance
– Tree-building algorithms (UPGMA):
•
•
•
•
Xuhua Xia
UPGMA
Neighbor-joining
Fitch-Margoliash
FastME
Slide 2
Genetic Distances
• Genetic distances: Assuming a substitution model,
we can obtain the genetic distance (i.e., difference)
between two nucleotide or amino acid sequences,
e.g.,
• JC
K JC  
4p

ln  1 

4 
3 
3
Y
• K80
K K 80


 1 
1
ln 
ln



 1 2P  Q 
 1  2Q 


2
4



 Y P1
Q 
Q
-ln  1

ln
1




R
2  T C
2 Y 
2 Y  R 


=
2 Y
R
• TN93:
D TN 93  4  T  C  1 + 4  A  G  2 + 4 Y  R 
Xuhua Xia



 R P2
Q 
Q
-ln  1   Y ln  1 

2 A G
2 R 
2 Y  R 


=
2 R

Q
 ln  1 
2 Y 

 
2
R




Slide 3
Calculation of KJC69
t
AACGACGATCG
AACGACGATCG: Species 1
t
AACGACGATCG: Species 2
K 
The time is 2t
between Species 1
to Species 2
4p 

ln  1 

4 
3 
3
Sp1: AAG CCT CGG GGC CCT TAT TTT TTG
||
|
||| ||| |
||| ||| ||
Sp2: AAT CTC CGG GGC CTC TAT TTT TTT
p = 6/24 = 0.25
K = 0.304099
Genetic distances are scaled to be the
number of substitutions per site.
Xuhua Xia
Slide 4
Numerical Illustration
Sp1: AAG CCT CGG GGC CCT TAT TTT TTG
||
|
||| ||| |
||| ||| ||
Sp2: AAT CTC CGG GGC CTC TAT TTT TTT
What are P and Q?
P = 4/24, Q = 2/24
K K 80 
 ln 1  2 P  Q 

ln 1  2 Q 
2
 0.31507864
4
Comparison of distances:
P = 0.25
Poisson P = -ln(1-p) = 0.288
KJC69 = 0.304099
KK80 = 0.3150786
Xuhua Xia
Slide 5
Distance-based phylogenetic algorithms
Algorithms
Optimization
UPGMA
Local
Neighbor-joining Local
Minimum EvolutionGlobal
Fitch-Margoliash Global
FastME
Global
Xuhua Xia
Assuming a molecular clock
Yes
No
No
No
No
Slide 6
A Star Tree (Completely Unresolved Tree)
Human
Chimpanzee
Gorilla
Orangutan
Gibbon
Xuhua Xia
Slide 7
Genetic Distance Matrix
Matrix of Genetic distances (Dij):
Human
Human
Chimp
Gorilla
Orang
Gibbon
Xuhua Xia
Chimp
0.015
Gorilla
0.045
0.030
Orang
0.143
0.126
0.092
Gibbon
0.198
0.179
0.179
0.179
Slide 8
UPGMA
•
Human
Human
Chimp
Gorilla
Orang
Gibbon
Chimp
0.015
Gorilla
0.045
0.030
Orang
0.143
0.126
0.092
hu-ch
hu-ch
Gorilla
Orang
Gibbon
Xuhua Xia
Gorilla
0.038
Orang
0.135
0.092
Human
Chimp
Gorilla
Orang
Gibbon
Gorilla
Orang
Gibbon
Human
Chimp
• D(hu-ch),go = (Dhu,go + Dch,go)/2 = 0.038
D(hu-ch),or = (Dhu,or + Dch,or)/2 = 0.135
D(hu-ch),gi = (Dhu,gi + Dch,gi)/2 = 0.189
•
Gibbon
0.198
0.179
0.179
0.179
Gibbon
0.189
0.179
0.179
(hu,ch),(go,or,gi)
Orang
Gibbon
Gorilla
Human
Chimp
((hu,ch),go),(or,gi)
Slide 9
UPGMA
•
Human
Human
Chimp
Gorilla
Orang
Gibbon
•
Gorilla
0.045
0.030
Orang
0.143
0.126
0.092
D(hu-ch-go),or = (Dhu,or + Dch,or + Dgo,or)/3 = 0.120
D(hu-ch-go),gi = (Dhu,gi + Dch,gi +Dgo,gi)/3 = 0.185
•
hu-ch-go
hu-ch-go
Orangutan
Gibbon
•
Chimp
0.015
Orang
0.120
Gibbon
0.185
0.179
Gibbon
0.198
0.179
0.179
0.179
Orang
Gibbon
Gorilla
Human
Chimp
Gibbon
Orang
Gorilla
Human
Chimp
(((hu,ch),go),or),gi)
D(hu-ch-go-or),gi = (Dhu,gi + Dch,gi +Dgo,gi + Dor,gi)/4 = 0.184
Xuhua Xia
Slide 10
Phylogenetic Relationship from UPGMA
•
Human
Chimp
0.015
Gorilla
0.045
0.030
Orang
0.143
0.126
0.092
hu-ch
Gorilla
0.038
Orang
0.135
0.092
Gibbon
0.189
0.179
0.179
Human
Chimp
Gorilla
Orang
Gibbon
•
hu-ch
Gorilla
Orang
Gibbon
•
hu-ch-go
Orang
Gibbon
Xuhua Xia
hu-ch-go Orang
0.120
Gibbon
0.198
0.179
0.179
0.179
Gibbon
0.185
0.179
Slide 11
Branch Lengths
Dhu-ch = 0.015
D(hu-ch),go = (Dhu,go + Dch,go)/2 = 0.038
D(hu-ch),or = (Dhu,or + Dch,or)/2 = 0.135
D(hu-ch),gi = (Dhu,gi + Dch,gi)/2 = 0.189
((hu,ch),(go,or,gi))
(((hu,ch),go),(or,gi))
((((hu,ch),go),or),gi)
D(hu-ch-go),or = (Dhu,or + Dch,or + Dgo,or)/3 = 0.120
D(hu-ch-go),gi = (Dhu,gi + Dch,gi +Dgo,gi)/3 = 0.185
D(hu-ch-go-or),gi = (Dhu,gi + Dch,gi +Dgo,gi + Dor,gi)/4 = 0.184
0.0075
Chimp
0.019
0.06
((hu:0.0075,ch:0.0075),(go,or,gi))
Human
0.092
Gorilla
Orang
Gibbon
(((hu:0.0075,ch:0.0075):0.019,go:0.019),(or,gi))
((((hu:0.0075,ch:0.0075):0.0115,go:0.019):0.041,or:0.06):0.032,gi:0.092)
Xuhua Xia
Slide 12
Final UPGMA Tree
Human
Chimp
Gorilla
Orang
Gibbon
19
13
8
0.092
0.060
0.019
6 MY
0.0075
((((hu:0.0075,ch:0.0075):0.0115,go:0.019):0.041,or:0.06):0.032,gi:0.092);
Xuhua Xia
Slide 13
Distance-based method
• Distance matrix
• Tree-building algorithms
– UPGMA
– Neighbor-joining
– FastME
– Fitch-Margoliash
• Criterion-based methods
– Branch-length estimation
– Tree-selection criterion
Xuhua Xia
Slide 14
Branch Length Estimation
• For three OTUs, the branch lengths can be estimated
directly
• For more than three OTUs, there are two commonly
used methods for estimating branch lengths
– The least-square method
– Fitch-Margoliash method
• Don’t confuse the Fitch-Margoliash method of
branch length estimation with the Fitch-Margoliash
criterion of tree selection
• Illustration of the least-square method of branch
length estimation
Xuhua Xia
Slide 15
For three OTUs
1
2
0.092
3
0.179
0.179
1
2
3
d12
d13
d23
1
2
3
1
2
3
1
d12 = x1 + x2
x1
x3
d13 = x1 + x3
d23 = x2 + x3
Xuhua Xia
2
3
x2
Slide 16
Least-square method
4
Sp1
Sp2
Sp3
Sp4
0.3
0.4 0.5
0.4 0.6 0.6
4
Sp1
Sp2
d12
Sp3
d13
d23
Sp4
d14
d24
1
d34
x3
x1
3
x5
2
Xuhua Xia
x2
x4
4
Slide 17
Least-square method
1
x3
x1
3
x5
x2
2
x4
4
d’12 = x1 + x2
(d12 - d’12)2= [d12 – (x1 + x2)]2
d’13 = x1 + x5+ x3
(d13 - d’13)2 = [d13 – (x1 + x5+ x3)]2
d’14 = x1 + x5 + x4
(d14 - d’14)2 = [d14 – (x1 + x5 + x4)]2
d’23 = x2 + x5 + x3
(d23 - d’23)2 = [d23 – (x2 + x5 + x3)]2
d’24 = x2 + x5 + x4
(d24 - d’24)2 = [d24 – (x2 + x5 + x4)]2
d’34 = x3 + x4
(d34 - d’34)2 = [d34 – (x3 + x4)]2
n
SS 

i j
Xuhua Xia
( d ij  d ij )
'
2
Least-squares method: Find xi
values that minimize SS
Slide 18
Least-squares method
SS = [d12 – (x1 + x2)]2 + [d13 – (x1 + x5+ x3)]2 + [d14 – (x1 + x5 + x4)]2
+ [d23 – (x2 + x5 + x3)]2+ [d24 – (x2 + x5 + x4)]2+ [d34 – (x3 + x4)]2
Take the partial derivative of SS with respective to xi, we have
SS/x1 := -2 d12 + 6 x1 + 2 x2 - 2 d13 + 4 x5 + 2 x3 - 2 d14 + 2 x4
SS/x2 := -2 d12 + 2 x1 + 6 x2 - 2 d23 + 4 x5 + 2 x3 - 2 d24 + 2 x4
SS/x3 := -2 d13 + 2 x1 + 4 x5 + 6 x3 - 2 d23 + 2 x2 - 2 d34 + 2 x4
SS/x4 := -2 d14 + 2 x1 + 4 x5 + 6 x4 - 2 d24 + 2 x2 - 2 d34 + 2 x3
SS/x5 := -2 d13 + 4 x1 + 8 x5 + 4 x3 - 2 d14 + 4 x4 - 2 d23 + 4 x2 - 2 d24
Setting these partial derivatives to 0 and solve for xi, we have
x1 = d13/4 + d12/2 - d23/4 + d14/4 - d24/4
x2 = d12/2 - d13/4 + d23/4 - d14/4 + d24/4,
x3 = d13/4 + d23/4 + d34/2 - d14/4 - d24/4,
x4 = d14/4 - d13/4 - d23/4 + d34/2 + d24/4,
x5 = - d12/2 + d23/4 - d34/2 + d14/4 + d24/4 + d13/4
Xuhua Xia
Slide 19
Least-squares method
x1 = d13/4 + d12/2 - d23/4 + d14/4 - d24/4
x2 = d12/2 - d13/4 + d23/4 - d14/4 + d24/4,
x3 = d13/4 + d23/4 + d34/2 - d14/4 - d24/4,
x4 = d14/4 - d13/4 - d23/4 + d34/2 + d24/4,
x5 = - d12/2 + d23/4 - d34/2 + d14/4 + d24/4 + d13/4
4
Sp1
Sp2
Sp3
Sp4
0.3
0.4 0.5
0.4 0.6 0.6
x1 = 0.075
x2 = 0.225
x3 = 0.275
x4 = 0.325
x5 = 0.025
Xuhua Xia
1
x3
x1
3
x5
2
x2
x4
4
Slide 20
Minimum Evolution Criterion
1
x3
x1
3
2n3
TreeLen 
x5
x
i
i 1
2
x2
x4
x1
x3
1
4
x2
x4
x1
x3
1
of OTUs
2
x5
3
where n  number
4
The minimum evolution
(ME) criterion: The tree
with the shortest
TreeLen is the best tree.
2
x5
4
Xuhua Xia
x2
x4
3
Slide 21
Download