Final Exam Stat 565_2015.doc

Final Exam Stat 565, DUE Dec 16th 2015, by 5 pm. Total 100 points Attestation: I ________________________________________ have worked on this exam by myself without consulting or colluding with anybody else. Signature:___________________________________________________________ This is a take-home Exam, you can use any in-animate resources to help you work the exam. Please be neat and brief and answer the specific questions asked. Please print out the exam and answer questions on the exam. ONLY give me relevant output. Please don’t write on your output, but answer the questions and attach output IF required. To differentiate from the other files, all the files for this final has the extension _final on it. Part 1 (30 points): Write TRUE or FALSE for the following WITH reasons: a. For RNA-seq data we assume a continuous distribution for the response variables. b. Large sample sizes are often required to get better power. c. I have a data set with 20 variables which are correlated. I perform PCA on this and use the 20 created principal components as inputs for my regression analysis. In doing so, I have an uncorrelated and interpretable model. d. After doing differential expression you construct a top 20 table. This means that these 20 genes are significant at .05 level by the FDR method. e. In micro-array data (with 10,000 probes) a straight Bonferroni type correction will be too conservative so use a Step-Down Bonferroni-Holm approach. f. If we have data from different conditions, that is known to come from a smaller class of variables, it is best to cluster the data and see if the data clustered according to the known variables. Since we did not use the prior information, this VALIDATES the class information. g. The “best” method for clustering is one that uses Euclidean distance and Maximum Linkage. h. The likelihood rule used in Discriminant Analysis is always equivalent to the Linear Discriminant Rule. So use the latter as it’s simpler. i. Since hierarchical clustering does not require us to know the NUMBER of clusters ahead of time, it is preferable to non-hierarchical clustering. j. The best way to analyze GWAS data is to use 2 by 2 contingency tables and perform chi-square tests. Part 2: Write your opinion and comments for the following questions. No more than 10 sentences. (28 points) 1. There is a school of thought that does not believe in pre-processing genomic data. Here the idea is use linear models for differential expressions and allow the random error with possibly systematic components to account for the preprocessing issues. What is your opinion about this approach? 2. Give reasons why one uses Negative Bi-nomial distribution as opposed to the Poisson for the D-seq package. 3. Compare and contrast between Microarray and RNA-seq analysis. Which one would YOU prefer if you had to design a study? 4. Give a brief description of what you think and UNDERSTAND about GWAS. As a Statistician how would you be able to contribute? Part 3: Problems: (Start each problem on a new page and write your name on each page.) 1. A. (14 points) For the data on the web Pr1_Fin_2015JW you are given 9 files. The result is from a one factor design with 3 levels, A, B and C with three replicates. We are interested in comparing the treatments to each other. Please use rma to normalize the affy-batch data and write contrasts. Provide the top 20 table for each contrast. How many genes are common in all three lists? Provide Venn diagram to comment. 2. (14 points) Use the data given as file Pr2_final_2015.csv for this problem. In the data set you are given the normalized data on 10 conditions with 180 genes. It is of interest to use Hierarchical clustering to look at the clustering among the 10 conditions. Use 3 distance measures: Corr, Manhattan, Euclidean and 3 linkages: Average, Single and Complete to give 9 cluster patterns. Provide your dendograms in a 3 by 3 plot. Comment on your findings. Also use k-means clustering. Compare the clusters to each other. ONLY provide the 9 DENDOGRAMS and the k means cluster and your comments for this problem. 3. (8 points) Use the data Pr3_final_2015.csv to discriminate between the three groups. Use lda. Use 130 observations to train and 20 to test. Report your results from testing. 4. (5 points)Use the data in Pr3_final_2015.csv for this problem. Use x1 as the response and x2 as the explanatory variable and do a LOWESS plot and comment on the shape of the relationship between x1 and x2. 5. (1 point) What is ONE technique that you learned in this class that YOU think is most valuable for you? What is the one technique that will be LEAST valuable to you.

Final Exam Stat 565_2015.doc

Related documents

Products

Support

Final Exam Stat 565_2015.doc

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib