LAB #6 - Emerson Statistics

advertisement
Biost 536, Fall 2014
Laboratory #6
November 2, 2014, 2014, Page 1 of 2
Biost 536: Categorical Data Analysis in Epidemiology
Emerson, Fall 2014
Laboratory #6
November 2, 2014
Written problems: In this laboratory assignment, you will perform a large number of simulations
investigating one aspect of the controversy about the prevalence of “correct” results in the biomedical
literature.
On the web pages are a compilation of several papers discussing the accuracy of the medical literature:
 Fleming: “Clinical trials: Discerning hype from substance” (pp 1-7)
 Ioannidis: “Why most published medical research is false” (pp 8-13)
 Goodman, Greenland: “Assessing the unreliability of the medical literature…” (pp 14-38)
 Jager, Leek: “An estimate of the science-wise false discovery rate…” (pp 39-51)
 Ioannidis: Discussion of Jager & Leek (pp 52-60)
 Cox: Discussion of Jager & Leek (pp 61-63)
 Ioannidis: “How to make more published research true” (pp 64-69)
Much of the discussion of the attempts at formal analyses of the accuracy of the medical literature relates
to the distribution of p values that would be expected in the literature under the null and alternative
hypotheses. We will consider a setting in which
 We sample a normally distributed outcome that has variance 1.
 We consider a null hypothesis in which the mean is 0 and an alternative hypothesis in which the
mean is 0.1.
 We consider a range of sample sizes between n=100 and n=2,000.
 We only report studies having a one-sided p value less than or equal to 0.025.
We shall pretend that the variance is known, and we will find it adequate to simulate the “sufficient
statistic” of the sample mean from each simulated analysis. Each student should perform 100,000
simulated analyses for all combinations of the following parameters:
1. The mean is 0 or 0.1.
2. The sample size n is 100, 200, 500, 750, 1000, 1500, or 2000
During the discussion section on Tuesday, you should be prepared to discuss the implications of these
analyses as the prior prevalence of the alternative analysis ranges among 0.01, 0.05, 0.10, 0.20, and 0.50.
We will use the fact that if

X i ~ N ,  2

 2 
 X  0 
  Z  n 

 X ~ N   ,
n





noting that μ0 = 0 and σ = 1 in our simulated studies.
In performing the simulation you will
1. Set the number of observations in your data set to nSim = 100000.
2. You will set the seed for your random number generator to the code assigned to you for
homework #4.
3. You will generate a random variable Xbar that contains simulated estimates from nSim
randomly generated studies using a normal distribution having mean 0 (for the null) or
Biost 536, Fall 2014
Laboratory #6
November 2, 2014, 2014, Page 2 of 2
mean 0.1 (for the alternative) and standard deviation equal to 1/sqrt(n), where n is the
sample size of the study you are simulating.
4. You will derive a test statistic Z = sqrt(n) * Xbar.
5. You will generate descriptive statistcs for both Xbar and Z for those simulated studies
having significant results (i.e., Z > 1.96).
6. During discussion section, you will hand in a paper detailing the descriptive statistics for
each setting (based on null vs alternative and the sample size):
 the proportion of simulated studies that were significant,
 the mean, sd, minimum, quartiles, median, and maximum of Xbar among
significant study results, and
 the mean, sd, minimum, quartiles, median, and maximum of Z among significant
study results.
You should also explain how those statistics might be used to explore the positive
predictive value of a significant study under varying levels of prior prevalence of the
alternative.
Stata Commands That Might Be Used
I recommend creating a Stata .do file. I pretend my code was 4444. Hence, in my .do file, I will have:
1. code to set the size of my dataset to nSim = 100000: set obs 100000
2. code to set the initial seed of my random number generator: set seed 4444
3. code to first define the variables I will later “replace” in each simulation:
 g Xbar= 0
 g Z= 0
4. code to perform a single simulation (I consider the alternative 0.1 and a sample size n= 500)
a. code to generate the estimate Xbar: replace Xbar = rnormal(0.1,
1/sqrt(500))
b. code to generate the test statistic Z: replace Z = sqrt(500) * Xbar
c. code to generate descriptive statistics for Xbar and Z for significant studies
5. Now repeat the fourth step for each of the scenarios
6. Summarize your results in a table to hand in during discussion section.
R Commands That Might Be Used
I presume that you are using the uwIntroStats R package (see www.emersonstatistics.com/R).
I recommend creating a R script file. I pretend my code was 3333. Hence, in my .do file, I will have:
1. code to set the size of my dataset to nSim = 100000: nSim <- 100000
2. code to set the initial seed of my random number generator: set.seed(4444)
3. code to perform a single simulation
a. code to generate the estimate Xbar: Xbar <- rnorm(nSim,0.1,1/sqrt(500))
b. code to generate the test statistic Z: Z <- sqrt(500) * Xbar
c. code to generate descriptive statistics for Xbar and Z for significant studies
4. Now repeat the third step for each of the scenarios
5. Summarize your results in a table to hand in during discussion section.
(There are many other ways this could be done. I just supplied code that would look the most like Stata.)
Download