COMP 571: Homework #4 Spring 2016

advertisement
COMP 571: Homework #4
Spring 2016
Assigned on March 31, 2016.
Due in class on April 12, 2016.
The Honor Code applies to all homework sets. Sign the pledge on your solutions.
1. The infinite sites model states that each site in a set of sequences mutates at most
once during the evolutionary history of the set of sequences from their most recent
common ancestor. The infinite alleles model states that whenever a site mutates,
it changes to a new state not seen anywhere else (for that site) in the evolutionary
tree. Are there any differences between the infinite sites and infinite alleles models?
Given a tree leaf-labeled by a set of sequences, propose a method for testing whether
the sequences evolved under the infinite alleles model, assuming you can use Fitch’s
algorithm as a “black box”.
2. Informative sites are defined as columns in an alignment that favor one tree topology
over another. Find the informative sites in the following alignment
ATGTA
TAGTA
CGCTG
GCCTG
Which tree will you find by the maximum parsimony method for this alignment?
3. Consider the following alignment:
N
N
P
P
I P E L M
I Q E L M
I E K L L
I A K L L
K T A
K T A
Q T A
A S A
N A
N A
S V
D A
D N G R E I A K K
N H G R E I A K K
E K G A A A A K K
A K G E A V F K K
(a) Construct a UPGMA tree from the Poisson corrected distances (that is, dij =
− ln(1 − pij )).
(b) Show that the distances defined by the UPGMA tree do not always coincide
with the initial distances (the ones used to build the tree).
(c) Construct an unrooted NJ tree from the Poisson corrected distances.
(d) Show that the molecular clock property fails for any rooted tree derived from
the tree you constructed in (c) by adding a root.
4. Only two types of nucleotides, C and G, are present in sequences x1 and x2 of equal
length. Given the ungapped alignment of x1 and x2 , calculate the likelihood of a
tree relating these sequences, assuming the Jukes-Cantor model of evolution. Show
that the maximum likelihood edge lengths, t1 and t2 , satisfy the following equation:
t1 + t2 =
3(n1 + n2 )
1
ln
,
4α
3n1 − n2
1
where n1 is the number of alignment sites with identical residues, and n2 is the
number of sites with mismatches.
5. In class we described Fitch’s algorithm for bifurcating trees and for equal costs for
substitutions (one unit for change from x to y, when x 6= y, and 0 when x = y).
Describe the modified algorithm so that it applies for trees in which nodes have
arbitrary numbers of children, and for a given cost matrix W , where W (σ, σ 0 ) is
the cost of a change from state σ to state σ 0 .
6. Consider tree T = ((A : 0.8, C : 0.1) : 0.05, (B : 0.8, D : 0.1) : 0.05), and base
frequencies f (A) = 0.1, f (C) = 0.2, f (G) = 0.3, and f (T ) = 0.4. Further, let the
substitution rate matrix be the following:
A
C
G
T
A
5
6
2
C
5
3
8
G
6
3
1
T
2
8
1
-
(a) Using the Seq-gen tool under the REV model and the above settings, and for
sequence lengths 100, 250, 500, 1000, 5000, and 10000, generate 20 sequence
datasets for each of the sequence lengths.
i
i
(b) For each dataset Si , construct the MP tree TM
P and the ML tree TM L (using
publicly available programs; e.g., Phylip), and compare them to tree T using
the Robinson-Foulds (RF) measure (the RF distance between two trees is the
number of bipartition’s in one, but not both of the trees, divided by 2).
(c) Plot a graph with two curves that show the RF value of the two methods as a
function of the sequence length (for each sequence length, plot the average RF
of the 20 runs of both methods). Explain the behavior of the two methods.
In this problem, submit the plot from Part (c), as well as your explanation of the
behavior and potential causes of the behavior (feel free to search the literature after
you generate the plot, but clearly cite any sources you use).
2
Download