Homeworks-Projects-Tests

advertisement
STAT 423
Fall 2009
Homeworks, Projects and Tests
Important: Cooperation and discussion of homework problems is allowed and encouraged, except for
the final computations and write-up, which has to be done individually.
.
Homework 1, due September 10, 2009, in class.
Problems 5.2 – 5.6 from Ewens and Grant. Bonus problem: Problem 5.1 from Ewens and Grant.
Homework 2, due September 22, 2009, in class.
1. Consider a time continuous Markov chain with transition intensity matrix
2 
 3 1


Q   1 3 2 .
 3
1  4 

Check if the chain is reversible. Begin with computing the stationary distribution.
2. Consider a time-discrete Markov chain with transition probability matrix
0
1 
 0


P  1 / 2 3 / 8 1 / 8  .
 0
1
0 

If the chain is in state 1 (states are numbered 1, 2, and 3) at time 0, what is the probability it will
be in state 2 at time 2? Also, find the stationary distribution and check if the chain is reversible
(conditions are identical as those for time-continuous chains).
Homework 3, due October 1, 2009, in class.
Problems 5.9 – 5.11 from Ewens and Grant. Bonus problem: Problem 5.12 from Ewens and Grant.
Homework 4, due October 19, 2009, in class.
Problems 6.2, 6.4 and 6.6 from Ewens and Grant.
Homework 5, due November 3, 2009 in class.
Problems 12.1, 12.2 and 12.3 from Ewens and Grant and bonus problem 12.5.
Homework 6, due the last day of classes.
1. Problems 15.4 and 15.5 from Ewens and Grant. Bonus problem: Problem 15.3 from
Ewens and Grant.
2. BLAST: Use the method of difference equations to find θ*, uh, wh, mh and then the
asymptotic tail distribution of the excursion height as well as the expected interladder
distance for a random walk, generated by alignment of two random DNA sequences, in
which the GG match has score +1, all other matches have score 0, and all mismatches
have score –1. Hint: Begin with finding out the probability P[S = k], for k = -1,0,1.
Projects, to be due electronically by December 23, 2009. Choose one of the two enclosed topics
or try to come up with your own. Selection of topic subject to instructor’s approval. Provide
detailed reports (up to 10 pages).
Topic 1. Consider a long stretch of DNA, in which, from time to time, appear runs of tetra-nucleotide
repeats of the form ATXT, where X = A or C and which is otherwise "random".
1.
2.
3.
4.
5.
6.
Propose a HMM model of such DNA.
Write the Viterbi algorithm and posterior decoding algorithms for this problem.
Using your favorite programming language, write codes for the algorithms in item 2.
Simulate a long sequence of DNA with the properties described.
Apply your algorithms. Compare outcomes of Viterbi and of posterior decoding.
Try to optimize you algorithms by trying different values of coefficients.
Topic 2.
1. Retrieve, from a database, an aminoacid sequence of a protein of your choice (you may use the
National Library of Medicine Entrez website, http://www.ncbi.nlm.nih.gov/entrez/, and choose
"Search Protein".
2. Run a BLAST search to identify evolutionary relatives of this sequence (i.e. homologous
proteins present in different species of organisms, or related protein families). BLAST can be
found at http://www.ncbi.nlm.nih.gov/BLAST/. Choose the right type of BLAST for your task (see
Blast Overview, Blast Course and Blast Tutorial for guidance).
3. Select about 20-30 proteins providing reasonably close matches to your sequence (word
"reasonable" should read either as "statistically significant" or "making a biological sense", or
both). Provide comments regarding criteria you used to select these sequences as well as
numerical parameters such as E-values, P-values and so forth. Comment on the score matrix
(BLOSUMn, PAMn), which you used, use another matrix if you feel it is needed.
4. Find one of the Phylip web servers and carry out phylogenetic analyses of your set of protein
sequences. Use UPGMA, Neighbor-Joining, and Maximum Parsimony. Check bootstrap support
for the best tree you obtained. Compare trees you obtained by different methods.
Download