Homework 8

advertisement
Homework 8. Randomization and Bootstrapping
Due: Friday 4/11
Part I:
The goal of this part is to write R code to perform a randomization test.
Read the selected pages on randomization tests. The pages can be found as a pdf file in Moodle. You will need
to right click on the pdf to rotate the images.
You will duplicate his randomization test to decide if the average male mandible length is significantly larger than
the average female mandible length. The mandible lengths are:
Males:
120, 107, 110, 116, 114, 111, 113, 117, 114, 112
Females: 110, 111, 107, 108, 110, 105, 107, 106, 111, 111
Define D0 as the difference between Mean(Males)-Mean(Females) for the original data.
Use 4,999 randomizations, but first start off with a few hundred until you get your bugs worked out of your code.
(Hint: The sample() command will be useful for shuffling the values.)
For this problem you will turn in:
1. Your R code: Use reasonable variable names and use the # sign to include comments inside your code. Have
a single variable, say N, for which you can specify the number of randomizations. (Hint: Use sample( )
function.)
2. Calculate the p-value, which is the proportion of all D values that are greater or equal to D0 (including D0).
Count the occurrence of D0 as the (N+1)th observation in the numerator and denominator. This p-value
(called level of significance in Manly) should be automatically calculated in your code.
3. A histogram, like found below, which shows the distribution of the D’s. You will need to use the hist( ),
paste( ), signif( ), title( ), and abline( ) commands. The paste( ) command is used to include the p-value and
number of simulations into the graph; e.g., hist( …., xlab=paste(N,”simulations”)). The abline is used to place
a vertical line at the observed D0 value. The signif( ) command is to prevent the title’s p-value from having
too many digits; e.g., title( main=paste(“p-value=”,signif(pval)).
1
Part II: The goals of this part are to:
 Learn how to perform a bootstrap estimate of the standard error of a statistic.
 Perform bootstrapping with regression data
For all problems, use the “Rain” dataset. (Ignore that LA rain data are for July1-June30 whereas Eureka data are
for October 1-September30.) We will only use the precipitation totals and ignore the El Nino/La Nina variable.
To import the data and make it easier to use, start with the following commands:
rain <- read.table(file="http://users.humboldt.edu/rizzardi/Data.dir/rain.TXT" , header=T)
EKA <- rain$EurekaPrecip # Oct1-Sept30
LA <- rain$LAPrecip
# July1-June30
Problem A: In this problem you will calculate the correlation between LA and EKA values and then estimate a
95% confidence interval for the correlation. Unfortunately, the formula for the standard error (SE) of a sample
correlation assumes the data are from a multivariate normal distribution. This assumption is not saved by the
Central Limit Theorem. Consequently, you will use bootstrapping to estimate the SE and 95% confidence
interval.
1. Create a scatter plot of the LA vs. EKA with EKA on the x-axis. Have the correlation value stated in the
graph’s title.
2. Create two histograms, one of LA and one of EKA, to show that their distributions are not necessarily
symmetric. Make the ranges and breaks of the two graphs identical for easier comparison.
3. Sample pairs of the data to create B=5000 temporary bootstrapped samples. Calculate the correlation for
each bootstrapped sample as each sample is created.
a. Calculate the mean and sd of the B bootstrapped correlations. The sd is your bootstrapped
estimate of the SE for the sample correlation.
b. Calculate the 2.5th and 97.5th percentiles of the bootstrapped correlations. This is your 95%
confidence interval for the correlation. (Hint: Use the quantile() function.)
c. Create a histogram of your B bootstrapped correlation values. Place vertical lines at the original
data’s correlation value and at the bootstrapped 95% confidence values. Provide an appropriate
title. For the subtitle, state the 95% confidence interval with reasonable precision – paste() and
round() will help.
2
Problem B: In this problem you will estimate the SE for the slope of the least-squares regression line for the
model: LA = B0+B1EKA. First, use our classical methods. We will look at the residuals and see they clearly are
not normally distributed. Consequently we will use bootstrapped pairs of data to estimate the SE of the slope.
1. Use a simple linear model to predict LA from EKA. What is the estimated slope of the line? What is the
SE of the slope? What is the 95% confidence interval for the slope? Create a scatter plot of the data and
place the regression line on the plot.
2. Create a histogram of the residuals from part 1. Do they look normally distributed? Describe the shape
of their distribution.
3. Using B=1000, sample pairs of the data to create new temporary bootstrapped samples and calculate the
slope for each of the B bootstrapped samples.
a. When performing the first 50 bootstrap samples, draw the bootstrapped dataset’s regression line
on the scatter plot of the original data. (Hint: An “if” statement in your for-loop will help you do
this. See example picture below.)
b. Calculate the mean and SD of the B bootstrapped slope coefficients. The SD is your
bootstrapped estimate of the SE for the slope’s estimate.
c. Calculate the bootstrapped 95% confidence interval for the slope.
LA
5
10
15
20
25
30
35
d. Create a histogram of the B bootstrapped slopes. Include the vertical lines and labels as asked for
in problem A3c.
20
30
40
50
60
70
EKA
(Stat 580 students only.) Problem C: Repeat problem B3a-d, but using bootstrapped residuals rather than
pairs of data. The steps are: (1) Calculate the residuals and predicted y values from the regression performed on
the original data, (2) Sample residual values and add them to the predicted y values to create new bootstrapped y
values (yi*), (3) Perform regression using the original x-values and the new bootstrapped y values (xi, yi*).
3
Download