Exam project in STK4170 and STK9170 Bootstrapping and resampling, fall 2011 Mathematical Institute, University of Oslo This project starts 5. December. The students have to submit a written report before 13. December 16:00 by email to magne.aldrin@nr.no . The project is divided into ten smaller problems. Problem 1e) counts 6 %, problems 2a) and 2b) count 12 % each, and the remaining seven problems count 10 % each in the evaluation of your project report. The report should contain a short description of how each problem is solved, the numerical answers and potential comments to the answers. All computer code should be documented in an appendix. You may use the R package or other computer programs for computations. If you use R, you can either use the bootstrap routines from the book, or you can program from scratch. If you use another program than R, every bootstrap routine must be programmed from scratch. It is always wise to plot the data, but it is up to you if you want to present data plots in the report or not, except if they are especially asked for. If you get problems with your computer code, it can be wise to use the R function browser(), which stop inside the code and gives you the opportunity to check the status of variables and to run single commands. Use about 10000 bootstrap replicates when performing significance tests or constructing confidence intervals for problems that are not too computer-intensive. For some problems, for instance with double bootstrap, it may be necessary to use only a few hundred replications in the inner or outer loops. For most problems, it is possible to use the bootstrap routines in the book, but for at least problems 1h) and 2a) you have to program everything by yourself. If for some problems you are unable to do the numerical computations, but know the main ideas for how to solve the problem, please describe your solution in the main part of the report. The report can be in English or Norwegian. If you have any technical problems, you can send me an email or phone me at 22 85 26 58. Magne Aldrin Problem 1 The two data sets NO2 and PM10 are available at the exam web page. They both contain 50 hourly observations of air pollution measured during a two-year period at road in Oslo. Bot data sets contain a random sample of 50 observations from a much larger data set of many thousand observations, and you can ignore correlation in time. The two data sets are collected for different hours, so there is no direct correspondence between the observations in the two dataset. The NO2 data set can for instance be read into R by read.table(“NO2.dat”,header=T) The first column of the NO2 data contains the (natural) logarithm of the NO2 concentration (denoted y in the data frame), whereas the first column of the PM10 data contains the logarithm of the PM10 concentration, that is particles smaller than 10 micro-meters. The next seven columns in both of the data sets are logarithm of number of cars, temperature 2 meters above the ground, wind speed, temperature difference between 25 meters above the ground and two meters above the ground, wind direction in degrees between 0 and 360, hour of day and day number counted from the start of the original data period. In most problems below the logarithm of NO2 or of PM10 will be treated as a response variable and the other variables as explanatory variables. a) Consider the wind speed in the NO2 data. Calculate a 95 % confidence interval for the median wind speed, using the normal, basic, percentile and BCa bootstrap methods. Perform a significance test for the median being equal or below 1.5 (H0: m<=1.5, H1: M>1.5) by using the BCa confidence interval method. b) Consider the logarithm of the hourly number of cars in the NO2 data. Test if its distribution is nonsymmetric, but this time without using a method based on confidence intervals. Calculate an adjusted p-value by using double bootstrap. Then it holds with 199 bootstrap replicates in the inner and outer loops, or more if your computer if fast. c) Calculate NO2 and PM10 by taking exp() of their logarithms. NO2 and PM10 have different levels. First, adjust NO2 by a multiplicative factor so it gets the same mean as PM10. Then, test if PM10 and the adjusted NO2 have the same distribution. d) Now, regress the logarithm of NO2 on all the seven explanatory variables, using multiple linear regression, for instance the R function glm(). We will focus on the regression coefficient for the logarithm of number of cars, here called beta1. Calculate the standard 95 % confidence interval based on normal theory. Perform then a case-wise bootstrap, and construct 95 % confidence intervals for beta1, by the normal, basic, percentile and BCa methods. e) Perform then a case-wise bootstrap, and construct 95 % confidence intervals for beta1, by the studentized method. Use then only 499 or 999 bootstrap replications. f) Perform then a model based or semi-parametric bootstrap by resampling residuals, and construct 95 % confidence intervals for beta1, by the normal, basic, percentile methods. Use also the BCa method if you know how. g) Now, also regress the logarithm of PM10 on all the seven explanatory variables in the PM10 data, again using multiple linear regression. Again, we will focus on the regression coefficient for the logarithm of number of cars, here called alpha1. Especially, we are interested in if the number of cars has different effect on PM10 than on NO2. Therefore, perform case-wise bootstrap and construct a 95 % confidence interval for the difference beta1-alpha1, by the normal, basic, percentile and BCa methods. Finally, use the percentile confidence interval to find the p value for the test H0: beta1=alpha1, H1: beta1 not equal alpha1. h) Some of the explanatory variables, for instance temperature, are expected to have a non-linear relation to the logarithm of NO2. However, with as few as 50 observations, a non-linear model may result in over-fitting. Consider the following three models of increasing complexity: M1: All explanatory variables have linear effects, i.e. the linear model used above. M2: Temperature at 2 meter, wind speed and wind direction have non-linear effects, the other four explanatory variables have linear effects. M3: All explanatory variables have non-linear effects. Models M2 and M3 can be specified as generalized linear models (GAMs) of the form y = s(x1) + b2 x2 + , where s(x1) is a smooth function of x1, with the function form estimated by the data, whereas the x2 is included as a linear term as in linear regression. M2 is such a combined linear/non-linear model, whereas M3 has only non-linear terms. GAM models can be fitted by the function gam() in the gam-library in R. Then each non-linear function use four degrees of freedom (four free parameters). Predictions can be performed by the predict.gam function as predict.gam(gamobj,newdata=test.data). It is also useful to look at the gam plots by first writing par(mfrow=c(3,3)) and then plot(gamobj). Perform a 10-fold cross validation, and compare the root mean squared prediction error (called PE in the lecture notes) for the three models. Select the best model. Repeat once and compare the results. Perform a permuted 10-fold cross validation, where the 10-fold cross validation is repeated 10 times. Select the best model. Problem 2 Two exercises on time series. To be made during the weekend. a) The data set TimeUnivariate is available at the exam web page, with 300 observations of a univariate time series from time t=1 to t=300. The data set can for instance be read into R by read.table(“TimeUnivariate.dat”,header=T). Fit an autoregressive model to the data, using maximum likelihood, and where the number of autoregressive parameters (p) is found by minimizing Akaike’s Information Criterion (AIC), You can use the R function ar(), with method=”mle”, and maximum 12 autoregressive parameters. What is the optimal value of p? Find the corresponding residual root mean squared errors (residual RMSE). For the optimal value of p, estimate the RMSE you can expect for future one-step-ahead predictions. Use forward validation, predicting the observations from number 51 to 300. Comment on the difference from the RMSE you found above. Use instead forward validation, again predicting the observations from number 51 to 300, to select the optimal order p. What is the optimal order then? Finally, use forward validation to estimate the RMSE you can expect for future one-step-ahead predictions, when the model is selected by minimizing the AIC, and you also take into account the uncertainty introduced by model selection. b) The data set TimeReg is available at the exam web page, with 300 bivariate time series observations from time t=1 to t=300. The data set can for instance be read into R by read.table(“TimeReg.dat”,header=T). Variable y is a response variable and variable x is an explanatory variable. This is a regression problem with an autocorrelated noise term. A plausible model is yt = beta0 + beta1 xt + nt, where the noise process nt follows an AR(1) model, i.e. nt = nt-1 + t . Our focus is on the regression parameter beta1. Fit the model above to the data. You can use the R function arima() (corresponds to arima.mle() in Splus). Report the estimate of beta1 and its standard error. Check also if the residuals look Gaussian by the R function qqnorm(). Find an alternative estimate of the standard error of beta1 by a block bootstrap with fixed block length 10, using overlapping blocks and circular variant of the bootstrap algorithm. Use 1000 bootstrap replicates. Are you sure that you use overlapping blocks and a circular bootstrap? Tell why. Now, use the stationary bootstrap to estimate the standard error of beta1, and vary the expected block length over the values 1, 2, 3, 4, 5, 10, 15, 20, 25 and 30. Use 1000 bootstrap replicates. Use the results to choose a suitable block length, and give your final estimate of the standard error of beta1.