Homework 8. Randomization and Bootstrapping Due: Friday 4/11 Part I: The goal of this part is to write R code to perform a randomization test. Read the selected pages on randomization tests. The pages can be found as a pdf file in Moodle. You will need to right click on the pdf to rotate the images. You will duplicate his randomization test to decide if the average male mandible length is significantly larger than the average female mandible length. The mandible lengths are: Males: 120, 107, 110, 116, 114, 111, 113, 117, 114, 112 Females: 110, 111, 107, 108, 110, 105, 107, 106, 111, 111 Define D0 as the difference between Mean(Males)-Mean(Females) for the original data. Use 4,999 randomizations, but first start off with a few hundred until you get your bugs worked out of your code. (Hint: The sample() command will be useful for shuffling the values.) For this problem you will turn in: 1. Your R code: Use reasonable variable names and use the # sign to include comments inside your code. Have a single variable, say N, for which you can specify the number of randomizations. (Hint: Use sample( ) function.) 2. Calculate the p-value, which is the proportion of all D values that are greater or equal to D0 (including D0). Count the occurrence of D0 as the (N+1)th observation in the numerator and denominator. This p-value (called level of significance in Manly) should be automatically calculated in your code. 3. A histogram, like found below, which shows the distribution of the D’s. You will need to use the hist( ), paste( ), signif( ), title( ), and abline( ) commands. The paste( ) command is used to include the p-value and number of simulations into the graph; e.g., hist( …., xlab=paste(N,”simulations”)). The abline is used to place a vertical line at the observed D0 value. The signif( ) command is to prevent the title’s p-value from having too many digits; e.g., title( main=paste(“p-value=”,signif(pval)). 1 Part II: The goals of this part are to: Learn how to perform a bootstrap estimate of the standard error of a statistic. Perform bootstrapping with regression data For all problems, use the “Rain” dataset. (Ignore that LA rain data are for July1-June30 whereas Eureka data are for October 1-September30.) We will only use the precipitation totals and ignore the El Nino/La Nina variable. To import the data and make it easier to use, start with the following commands: rain <- read.table(file="http://users.humboldt.edu/rizzardi/Data.dir/rain.TXT" , header=T) EKA <- rain$EurekaPrecip # Oct1-Sept30 LA <- rain$LAPrecip # July1-June30 Problem A: In this problem you will calculate the correlation between LA and EKA values and then estimate a 95% confidence interval for the correlation. Unfortunately, the formula for the standard error (SE) of a sample correlation assumes the data are from a multivariate normal distribution. This assumption is not saved by the Central Limit Theorem. Consequently, you will use bootstrapping to estimate the SE and 95% confidence interval. 1. Create a scatter plot of the LA vs. EKA with EKA on the x-axis. Have the correlation value stated in the graph’s title. 2. Create two histograms, one of LA and one of EKA, to show that their distributions are not necessarily symmetric. Make the ranges and breaks of the two graphs identical for easier comparison. 3. Sample pairs of the data to create B=5000 temporary bootstrapped samples. Calculate the correlation for each bootstrapped sample as each sample is created. a. Calculate the mean and sd of the B bootstrapped correlations. The sd is your bootstrapped estimate of the SE for the sample correlation. b. Calculate the 2.5th and 97.5th percentiles of the bootstrapped correlations. This is your 95% confidence interval for the correlation. (Hint: Use the quantile() function.) c. Create a histogram of your B bootstrapped correlation values. Place vertical lines at the original data’s correlation value and at the bootstrapped 95% confidence values. Provide an appropriate title. For the subtitle, state the 95% confidence interval with reasonable precision – paste() and round() will help. 2 Problem B: In this problem you will estimate the SE for the slope of the least-squares regression line for the model: LA = B0+B1EKA. First, use our classical methods. We will look at the residuals and see they clearly are not normally distributed. Consequently we will use bootstrapped pairs of data to estimate the SE of the slope. 1. Use a simple linear model to predict LA from EKA. What is the estimated slope of the line? What is the SE of the slope? What is the 95% confidence interval for the slope? Create a scatter plot of the data and place the regression line on the plot. 2. Create a histogram of the residuals from part 1. Do they look normally distributed? Describe the shape of their distribution. 3. Using B=1000, sample pairs of the data to create new temporary bootstrapped samples and calculate the slope for each of the B bootstrapped samples. a. When performing the first 50 bootstrap samples, draw the bootstrapped dataset’s regression line on the scatter plot of the original data. (Hint: An “if” statement in your for-loop will help you do this. See example picture below.) b. Calculate the mean and SD of the B bootstrapped slope coefficients. The SD is your bootstrapped estimate of the SE for the slope’s estimate. c. Calculate the bootstrapped 95% confidence interval for the slope. LA 5 10 15 20 25 30 35 d. Create a histogram of the B bootstrapped slopes. Include the vertical lines and labels as asked for in problem A3c. 20 30 40 50 60 70 EKA (Stat 580 students only.) Problem C: Repeat problem B3a-d, but using bootstrapped residuals rather than pairs of data. The steps are: (1) Calculate the residuals and predicted y values from the regression performed on the original data, (2) Sample residual values and add them to the predicted y values to create new bootstrapped y values (yi*), (3) Perform regression using the original x-values and the new bootstrapped y values (xi, yi*). 3