STAT 423 Fall 2009 Homeworks, Projects and Tests Important: Cooperation and discussion of homework problems is allowed and encouraged, except for the final computations and write-up, which has to be done individually. . Homework 1, due September 10, 2009, in class. Problems 5.2 – 5.6 from Ewens and Grant. Bonus problem: Problem 5.1 from Ewens and Grant. Homework 2, due September 22, 2009, in class. 1. Consider a time continuous Markov chain with transition intensity matrix 2 3 1 Q 1 3 2 . 3 1 4 Check if the chain is reversible. Begin with computing the stationary distribution. 2. Consider a time-discrete Markov chain with transition probability matrix 0 1 0 P 1 / 2 3 / 8 1 / 8 . 0 1 0 If the chain is in state 1 (states are numbered 1, 2, and 3) at time 0, what is the probability it will be in state 2 at time 2? Also, find the stationary distribution and check if the chain is reversible (conditions are identical as those for time-continuous chains). Homework 3, due October 1, 2009, in class. Problems 5.9 – 5.11 from Ewens and Grant. Bonus problem: Problem 5.12 from Ewens and Grant. Homework 4, due October 19, 2009, in class. Problems 6.2, 6.4 and 6.6 from Ewens and Grant. Homework 5, due November 3, 2009 in class. Problems 12.1, 12.2 and 12.3 from Ewens and Grant and bonus problem 12.5. Homework 6, due the last day of classes. 1. Problems 15.4 and 15.5 from Ewens and Grant. Bonus problem: Problem 15.3 from Ewens and Grant. 2. BLAST: Use the method of difference equations to find θ*, uh, wh, mh and then the asymptotic tail distribution of the excursion height as well as the expected interladder distance for a random walk, generated by alignment of two random DNA sequences, in which the GG match has score +1, all other matches have score 0, and all mismatches have score –1. Hint: Begin with finding out the probability P[S = k], for k = -1,0,1. Projects, to be due electronically by December 23, 2009. Choose one of the two enclosed topics or try to come up with your own. Selection of topic subject to instructor’s approval. Provide detailed reports (up to 10 pages). Topic 1. Consider a long stretch of DNA, in which, from time to time, appear runs of tetra-nucleotide repeats of the form ATXT, where X = A or C and which is otherwise "random". 1. 2. 3. 4. 5. 6. Propose a HMM model of such DNA. Write the Viterbi algorithm and posterior decoding algorithms for this problem. Using your favorite programming language, write codes for the algorithms in item 2. Simulate a long sequence of DNA with the properties described. Apply your algorithms. Compare outcomes of Viterbi and of posterior decoding. Try to optimize you algorithms by trying different values of coefficients. Topic 2. 1. Retrieve, from a database, an aminoacid sequence of a protein of your choice (you may use the National Library of Medicine Entrez website, http://www.ncbi.nlm.nih.gov/entrez/, and choose "Search Protein". 2. Run a BLAST search to identify evolutionary relatives of this sequence (i.e. homologous proteins present in different species of organisms, or related protein families). BLAST can be found at http://www.ncbi.nlm.nih.gov/BLAST/. Choose the right type of BLAST for your task (see Blast Overview, Blast Course and Blast Tutorial for guidance). 3. Select about 20-30 proteins providing reasonably close matches to your sequence (word "reasonable" should read either as "statistically significant" or "making a biological sense", or both). Provide comments regarding criteria you used to select these sequences as well as numerical parameters such as E-values, P-values and so forth. Comment on the score matrix (BLOSUMn, PAMn), which you used, use another matrix if you feel it is needed. 4. Find one of the Phylip web servers and carry out phylogenetic analyses of your set of protein sequences. Use UPGMA, Neighbor-Joining, and Maximum Parsimony. Check bootstrap support for the best tree you obtained. Compare trees you obtained by different methods.