Inferring phylogenetic trees

advertisement
Inferring phylogenetic trees:
Maximum likelihood methods
Prof. William Stafford Noble
Department of Genome Sciences
Department of Computer Science and Engineering
University of Washington
thabangh@gmail.com
One-minute responses
• First part of class was fine.
• I am struggling with Python.
• At first it was difficult to complete the program when I get the first half,
but it is getting easier now.
• The class lecture is always fine, but the Python problems are getting
tougher. However, they are really interesting and quite informative.
• We are learning a lot about programming.
• The class is more interesting every day. I enjoy the Python, especially
because I am able to fill in by myself.
• Thank you for helping us with sys.stdout.write. It will be very useful for
future work in Python.
Outline
• Parsimony
• Distance methods
– Computing distances
– Finding the tree
• Maximum likelihood
Revision
Multiple
sequence
alignment
Pairwise
distance
matrix
Phylogenetic
tree
Revision
• Ideally, distances in a phylogenetic tree would
represent time. In practice, however, what do
the distance estimate represent?
– The expected number of changes per position.
• What is a “back mutation”?
– A pair of mutations that reverse one another (e.g.,
A  C  A)
Revision
• Compute the Juke-Cantor distance between
the first yeast and mouse sequences shown
below.
3  4

K AB   ln1  d AB 
4  3

XX
X X
X
dha2_yeast
dhac_mouse
dha5_yeast
dhal_ecoli
93
93
92
92
XX
X X
X
LRYTRHEPVGVCGEIIPWNI
FTYTRREPIGVCGQIIPWNI
FAYTLKVPFGVVAQIVPWNI
LAMIVREPVGVIAAIVPWNI
3 æ 4 5ö
K AB = - ln ç1÷
4 è 3 20 ø
3 æ2ö
= - ln ç ÷
4 è3ø
= 0.304
Perform the next merger
Spar
Smik-Sbay
Skud-Scer
Scas
Sklu
Spar
0
31.5
30.5
300
229
Smik-Sbay
31.5
0
34.25
294
223
Skud-Scer
30.5
34.25
0
319.5
248
Scas
300
294
319.5
0
95
Sklu
229
223
248
95
0
Smik
Sbay
Skud
Scer
Perform the next merger
Spar
Smik-Sbay
Skud-Scer
Scas
Sklu
Spar
0
31.5
30.5
300
229
Smik-Sbay
31.5
0
34.25
294
223
Skud-Scer
30.5
34.25
0
319.5
248
Scas
300
294
319.5
0
95
Sklu
229
223
248
95
0
Smik
Sbay
Skud
Scer
Perform the next merger
Skud-ScerSpar
Smik-Sbay
Skud-ScerSpar
Scas
Sklu
Skud-ScerSpar
0
32.875
0
309.75
238.5
Smik-Sbay
32.875
0
32.875
294
223
Skud-ScerSpar
0
32.875
0
309.75
238.5
Scas
309.75
294
309.75
0
95
Sklu
238.5
223
238.5
95
0
Smik
Sbay
Skud
Scer
Extend the corresponding tree
Smik-Sbay
Skud-ScerSpar
Scas
Sklu
Smik-Sbay
0
32.875
294
223
Skud-ScerSpar
32.875
0
309.75
238.5
Scas
294
309.75
0
95
Sklu
223
2238.5
95
0
Sklu
Scas
Smik
Sbay
Spar
Skud
Scer
Maximum parsimony
for each possible tree
for each column of the alignment
compute the parsimony score of the column, given the tree
return the tree with the best parsimony score
Maximum likelihood
for each possible tree
for each column of the alignment
compute the likelihood of the column, given the tree
return the tree with the highest likelihood
• Similar to parsimony, but capable of using a model of
evolution.
• Computationally expensive.
• DNAML is the Phylip program for maximum likelihood.
FastDNAML is a fast clone.
http://evolution.genetics.washington.edu/phylip.html
http://iubio.bio.indiana.edu/soft/molbio/evolve/fastdnaml/fastDNAml.html
Problem #1
ACGCGTTGGG
ACGCGTTGGG
ACGCAATGAA
ACACAGGGAA
+
T
Pr(column|tree,model)
T A
G
• What is the probability of observing this
column, given this tree and an assumed model
of evolution?
Solution #1
C
A
A
T
A
A
T A
G
T
G
A
A
T A
G
T
A
T A
G
• Solution: Enumerate all possible assignments to the
internal nodes. Compute the probability of each
tree, and sum.
Problem #2
ACGCGTTGGG
ACGCGTTGGG
ACGCAATGAA
ACACAGGGAA
+
T
A
T
Pr(column|tree,model)
A
T A
G
• What is the probability of observing this
column, given this assigned tree and an
assumed model of evolution?
Solution #2
The probability of
observing a substitution
from A to T on a branch
of length m is given by
the evolutionary model.
πA, πC, πG, πT
The probability of the
ancestral observation
being A is just πA.
m A
T
T
A
T A
G
Solution #2
πA, πC, πG, πT
L0
A
L1
L2
T
L5
A
L3
L4
L6
T
T A
G
• The desired probability is the product of the
probabilities of the branches.
• L(tree) = L0  L1  L2  L3  L4  L5  L6
Computing the likelihood
C
A
A
T
A
A
T A
tree1
G
T
G
A
A
T A
tree2
G
T
A
T A
tree3
• The probability of the tree is the sum of the
probabilities of the individual trees.
• L(tree) = L(tree1) + L(tree2) + L(tree3) + …
G
Maximum likelihood revisited
for each possible tree
for each column of the alignment
for each assignment of internal nodes
for each branch
compute the probability of that branch
assigned tree probability ← multiply branch probabilities
column probability ← sum assigned tree probabilities
tree probability ← multiply column probabilities
return the tree with the highest probability
Maximum likelihood revisited
for each possible tree
Multiply
probabilities of
for each column of the alignment
independent
for each assignment of internal nodes
events.
for each branch
compute the probability of that branch
assigned tree probability ← multiply branch probabilities
column probability ← sum assigned tree probabilities
tree probability ← multiply column probabilities
return the tree with the highest probability
Add probabilities
of mutually
exclusive events.
Overview
• Parsimony
• Distance methods
– Computing distances
– Finding the tree
• Fitch-Margoliash
• Neighbor-joining
• UPGMA
• Maximum likelihood
Representing trees
• ((mouse, rat), (human, chimp))
mouse
rat
human
myTree = [[mouse, rat], [human, chimp]]
chimp
Problem #1
• Write a program to read a parenthesized tree
from a file and count the number of nodes.
> cat mytree.txt
(yeast, ((fly, spider), (dog, cat)))
> python read-tree.py mytree.txt
Read 5 species from mytree.txt.
Problem #2
• Modify the previous program to print the leaves
of the tree, indenting according to the depth.
> print-tree.py mytree.txt
yeast
fly
spider
dog
cat
Problem #3
• Given: a three-column file in which the first two columns
contain names of species and the third column contains the
distance between them.
• Print to standard output a formatted matrix in which the
species names are listed in the rows and columns, and
values are from the input file.
– Species should be listed in alphabetical order.
– The program should halt and complain if a value is missing.
– The matrix is assumed to be symmetric, and each pair appears
only once.
– Distances of zero along the diagonal are not included in the
input.
– Columns should be printed in the same width as the
corresponding species name.
./print-distance-matrix.py distances.txt
Read 30 values and 6 species from distances.txt.
Maximum species name width = 9.
ape
cat
dog gerbil mouse zebrafish
ape
0
0.19
0.15
0.44
0.17
0.69
cat
0.19
0
0.1
0.48
0.24
0.77
dog
0.15
0.1
0
0.43
0.25
0.78
gerbil
0.44
0.48
0.43
0
0.42
0.78
mouse
0.17
0.24
0.25
0.42
0
0.85
zebrafish
0.69
0.77
0.78
0.78
0.85
0
Download