Final Exam Stat 565_2015.doc

advertisement
Final Exam Stat 565, DUE Dec 16th 2015, by 5 pm.
Total 100 points
Attestation: I ________________________________________ have worked on this
exam by myself without consulting or colluding with anybody else.
Signature:___________________________________________________________
This is a take-home Exam, you can use any in-animate resources to help you work the
exam. Please be neat and brief and answer the specific questions asked. Please print
out the exam and answer questions on the exam. ONLY give me relevant output. Please
don’t write on your output, but answer the questions and attach output IF required.
To differentiate from the other files, all the files for this final has the extension _final
on it.
Part 1 (30 points): Write TRUE or FALSE for the following WITH reasons:
a. For RNA-seq data we assume a continuous distribution for the response variables.
b. Large sample sizes are often required to get better power.
c. I have a data set with 20 variables which are correlated. I perform PCA on this
and use the 20 created principal components as inputs for my regression analysis.
In doing so, I have an uncorrelated and interpretable model.
d. After doing differential expression you construct a top 20 table. This means that
these 20 genes are significant at .05 level by the FDR method.
e. In micro-array data (with 10,000 probes) a straight Bonferroni type correction will
be too conservative so use a Step-Down Bonferroni-Holm approach.
f. If we have data from different conditions, that is known to come from a smaller
class of variables, it is best to cluster the data and see if the data clustered
according to the known variables. Since we did not use the prior information, this
VALIDATES the class information.
g. The “best” method for clustering is one that uses Euclidean distance and
Maximum Linkage.
h. The likelihood rule used in Discriminant Analysis is always equivalent to the
Linear Discriminant Rule. So use the latter as it’s simpler.
i. Since hierarchical clustering does not require us to know the NUMBER of
clusters ahead of time, it is preferable to non-hierarchical clustering.
j. The best way to analyze GWAS data is to use 2 by 2 contingency tables and
perform chi-square tests.
Part 2: Write your opinion and comments for the following questions. No more than 10
sentences. (28 points)
1.
There is a school of thought that does not believe in pre-processing genomic data.
Here the idea is use linear models for differential expressions and allow the
random error with possibly systematic components to account for the preprocessing issues. What is your opinion about this approach?
2. Give reasons why one uses Negative Bi-nomial distribution as opposed to the
Poisson for the D-seq package.
3. Compare and contrast between Microarray and RNA-seq analysis. Which one
would YOU prefer if you had to design a study?
4. Give a brief description of what you think and UNDERSTAND about GWAS.
As a Statistician how would you be able to contribute?
Part 3: Problems: (Start each problem on a new page and write your name on each
page.)
1. A. (14 points) For the data on the web Pr1_Fin_2015JW you are given 9
files. The result is from a one factor design with 3 levels, A, B and C with
three replicates. We are interested in comparing the treatments to each
other. Please use rma to normalize the affy-batch data and write contrasts.
Provide the top 20 table for each contrast. How many genes are common in
all three lists? Provide Venn diagram to comment.
2. (14 points) Use the data given as file Pr2_final_2015.csv for this problem.
In the data set you are given the normalized data on 10 conditions with 180
genes. It is of interest to use Hierarchical clustering to look at the clustering
among the 10 conditions. Use 3 distance measures: Corr, Manhattan,
Euclidean and 3 linkages: Average, Single and Complete to give 9 cluster
patterns. Provide your dendograms in a 3 by 3 plot. Comment on your
findings. Also use k-means clustering. Compare the clusters to each other.
ONLY provide the 9 DENDOGRAMS and the k means cluster and
your comments for this problem.
3.
(8 points) Use the data Pr3_final_2015.csv to discriminate between the
three groups. Use lda. Use 130 observations to train and 20 to test. Report
your results from testing.
4. (5 points)Use the data in Pr3_final_2015.csv for this problem. Use x1 as the
response and x2 as the explanatory variable and do a LOWESS plot and
comment on the shape of the relationship between x1 and x2.
5. (1 point) What is ONE technique that you learned in this class that YOU
think is most valuable for you? What is the one technique that will be
LEAST valuable to you.
Download