LAB #5 - Emerson Statistics

advertisement
Biost 536, Fall 2014
Laboratory #5
October 25, 2014, Page 1 of 3
Biost 536: Categorical Data Analysis in Epidemiology
Emerson, Fall 2014
Laboratory #5
October 25, 2014
Written problems: In this laboratory assignment, you will perform a small number of simulations
investigating two methods commonly used to test for associations between two variables: Fitting the
predictor as a linear continuous random variable, or fitting the predictor as dummy variables using a
categorization based on quartiles of the predictor distribution. Of interest is the statistical power to detect
associations. A full investigation of the properties of the various approaches would consider
 the behavior according to a range of the shape of the true relationships between the two
variables,
 the behavior according to the distribution of the predictor, and
 the behavior according the strength of associations.
Because I do not want this task to be too arduous, I am asking each student to perform only 20 simulated
logistic regressions in one of two settings:
1. (Students whose last names start with A to J inclusive): When the true log odds of an event in
linear in X, or
2. (Students whose last names start with K to Z inclusive): When the true log odds of an event is a
step function that changes at the 62nd percentile of the true distribution of X.
During the discussion section on Tuesday, we will combine results across these small simulations to
consider the relative statistical power of regressions based on the linear continuous fit versus those based
on dummy variables.
In performing the simulation you will
1. Set the number of observations in your data set to n = 200.
2. You will set the seed for your random number generator to the code assigned to you for
homework #3.
3. You will generate a random variable X that contains n randomly generated observations
from a standard normal distribution (mean 0, variance 1).
4. You will derive a categorized random variable Xctg based on the sample quartiles.
5. You will compute the linear predictor LP according to the first letter of your last name:
 (A-J):
LP = 1 + 0.5 * X
 :(K-Z):
LP = 0 if X < 0.3 and LP = 1.2 if X > 0.3.
6. You will compute a random variable p containting the event probability
pX = exp(LP) / (1 + exp(LP))
7. You will simulate a random variable Y that contains n randomly generated observations
from Bernoulli distributions having parameters pX
8. You will perform two regressions on each data set:
 a logistic regression of Y on the linear continuous random variable X
 a logistic regression of Y on dummy variables using categorized Xctg
9. During discussion section, you will hand in a paper detailing the number of the 20
simulations that had statistically significant (two-sided P < 0.05) for
 both the linear and dummy variable models,
 only the linear, but not the dummy variable model,
 only the dummy variable, but not the linear model, and
 neither the linear and dummy variable models.
You should also explain how those counts might be used to investigate which model
provided greater statistical power (i.e., what test might you perform to assess this).
Biost 536, Fall 2014
Laboratory #5
October 25, 2014, Page 2 of 3
Stata Commands That Might Be Used
I recommend creating a Stata .do file. I pretend my code was 3333. Hence, in my .do file, I will have:
1. code to set the size of my dataset to n = 200 (since I have no real data): set obs 200
2. code to set the initial seed of my random number generator: set seed 3333
3. code to first define the variables I will later “replace” in each simulation:
 g X= 0
 g Xctg= 0
 g LP= 0
 g p= 0
 g Y= 0
4. code to perform a single simulation
a. code to generate the predictor X: replace X = rnormal()
b. (we have to drop the previously categorization of X): drop Xctg
c. categorize the predictor X using quartiles: egen Xctg= cut(X), group(4)
d. code to generate the linear predictor LP (which you use here depends on your last name):
 (A-J assume true linear relationship):
o replace LP = 1 + 0.5 * X
 (K-Z assume true step function):
o replace LP = 0
o replace LP = 1.2 if X >= 0.3
e. code to compute the event probability p: replace p= exp(LP) / (1+exp(LP))
f. code to generate the binary observations Y: replace Y = rbinomial(1, p)
g. code to perform the logistic regression of Y on linear continuous X (record whether the
two-sided P value testing an association is less than 0.05)
h. code to perform the logistic regression of Y on dummy variables for categorized Xctg
(record whether the two-sided P value testing an association is less than 0.05)
5. Now repeat the fourth step 19 more times. (You should not change the size of your sample or
change the seed.)
6. Summarize your results in a two by two table to hand in during discussion section.
Biost 536, Fall 2014
Laboratory #5
October 25, 2014, Page 3 of 3
R Commands That Might Be Used
I presume that you are using the uwIntroStats R package (see www.emersonstatistics.com/R).
I recommend creating a R script file. I pretend my code was 3333. Hence, in my .do file, I will have:
1. code to set the size of my dataset to n = 200 (since I have no real data): n <- 200
2. code to set the initial seed of my random number generator: set.seed(3333)
3. code to perform a single simulation
a. code to generate the predictor X: X <- rnorm(n)
b. categorize the predictor X using quartiles:
Xctg= cut(X, breaks=quantile(X), include.lowest=T)
c. code to generate the linear predictor LP (which you use here depends on your last name):
 (A-J assume true linear relationship):
o LP <- 1 + 0.5 * X
 (K-Z assume true step function):
o LP <- 1.2 * (X >= 0.3)
d. code to compute the event probability p: p <- exp(LP) / (1+exp(LP))
e. code to generate the binary observations Y: Y <- rbinomial(n, 1, p)
f. code to perform the logistic regression of Y on linear continuous X (record whether the
two-sided P value testing an association is less than 0.05)
 regress(“odds”, Y, X)
g. code to perform the logistic regression of Y on dummy variables for categorized Xctg
(record whether the two-sided P value testing an association is less than 0.05)
 regress(“odds”, Y, dummy(X))
4. Now repeat the third step 19 more times. (You should not change the size of your sample or
change the seed.)
5. Summarize your results in a two by two table to hand in during discussion section.
(There are many other ways this could be done. I just supplied code that would look the most like Stata.)
Download