COMP 571: Homework #4 Spring 2016 Assigned on March 31, 2016. Due in class on April 12, 2016. The Honor Code applies to all homework sets. Sign the pledge on your solutions. 1. The infinite sites model states that each site in a set of sequences mutates at most once during the evolutionary history of the set of sequences from their most recent common ancestor. The infinite alleles model states that whenever a site mutates, it changes to a new state not seen anywhere else (for that site) in the evolutionary tree. Are there any differences between the infinite sites and infinite alleles models? Given a tree leaf-labeled by a set of sequences, propose a method for testing whether the sequences evolved under the infinite alleles model, assuming you can use Fitch’s algorithm as a “black box”. 2. Informative sites are defined as columns in an alignment that favor one tree topology over another. Find the informative sites in the following alignment ATGTA TAGTA CGCTG GCCTG Which tree will you find by the maximum parsimony method for this alignment? 3. Consider the following alignment: N N P P I P E L M I Q E L M I E K L L I A K L L K T A K T A Q T A A S A N A N A S V D A D N G R E I A K K N H G R E I A K K E K G A A A A K K A K G E A V F K K (a) Construct a UPGMA tree from the Poisson corrected distances (that is, dij = − ln(1 − pij )). (b) Show that the distances defined by the UPGMA tree do not always coincide with the initial distances (the ones used to build the tree). (c) Construct an unrooted NJ tree from the Poisson corrected distances. (d) Show that the molecular clock property fails for any rooted tree derived from the tree you constructed in (c) by adding a root. 4. Only two types of nucleotides, C and G, are present in sequences x1 and x2 of equal length. Given the ungapped alignment of x1 and x2 , calculate the likelihood of a tree relating these sequences, assuming the Jukes-Cantor model of evolution. Show that the maximum likelihood edge lengths, t1 and t2 , satisfy the following equation: t1 + t2 = 3(n1 + n2 ) 1 ln , 4α 3n1 − n2 1 where n1 is the number of alignment sites with identical residues, and n2 is the number of sites with mismatches. 5. In class we described Fitch’s algorithm for bifurcating trees and for equal costs for substitutions (one unit for change from x to y, when x 6= y, and 0 when x = y). Describe the modified algorithm so that it applies for trees in which nodes have arbitrary numbers of children, and for a given cost matrix W , where W (σ, σ 0 ) is the cost of a change from state σ to state σ 0 . 6. Consider tree T = ((A : 0.8, C : 0.1) : 0.05, (B : 0.8, D : 0.1) : 0.05), and base frequencies f (A) = 0.1, f (C) = 0.2, f (G) = 0.3, and f (T ) = 0.4. Further, let the substitution rate matrix be the following: A C G T A 5 6 2 C 5 3 8 G 6 3 1 T 2 8 1 - (a) Using the Seq-gen tool under the REV model and the above settings, and for sequence lengths 100, 250, 500, 1000, 5000, and 10000, generate 20 sequence datasets for each of the sequence lengths. i i (b) For each dataset Si , construct the MP tree TM P and the ML tree TM L (using publicly available programs; e.g., Phylip), and compare them to tree T using the Robinson-Foulds (RF) measure (the RF distance between two trees is the number of bipartition’s in one, but not both of the trees, divided by 2). (c) Plot a graph with two curves that show the RF value of the two methods as a function of the sequence length (for each sequence length, plot the average RF of the 20 runs of both methods). Explain the behavior of the two methods. In this problem, submit the plot from Part (c), as well as your explanation of the behavior and potential causes of the behavior (feel free to search the literature after you generate the plot, but clearly cite any sources you use). 2