Fall 2013 STATISTICS 479 Assignment #7 (45 points) Instructions: Turn in the programs, the plots and the written answers (when required) for each problem. The data supplied in the file air pollution.txt are on air pollution in 41 U.S. cities. The type of air pollution under study is the annual mean concentration of sulfur dioxide. It is desired to develop a regression model to predict air pollution using 6 explanatory variables x1 − x6 measured as described on the back of this page. A SAS procedure was used to fit a multiple regression model: y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5 + β6 x6 + using data for all 41 cities. 1. Use a scatterplot matrix and correlation matrices to carry out a preliminary assessment of the pairwise relationships between y and x1 , x2 , x3 , x4 , x5 , and x6 . On this basis alone, select a few explanatory variables that may good predictors in a multiple regression model. Using the plot and the correlation matrix, find the four variables that are most strongly correlated among the explanatory variables. Based on above analysis alone suggest the explanatory variables that are most strongly involved in multicollinearity when fitting the full model. 2. Use a SAS procedure to fit the above multiple regression model to all 41 cities. Discuss the fit of this model using the Anova table, R2 , and the estimates table. Use other diagnostic tools including output statistics and plots of residuals to examine the adequacy of the model (use the diagnostic panel of plots). Comment on these cases, identifying any possible problems with model assumptions indicated by them and cases that may be outliers and/or influential. 3. Identify any least squares estimates of the regression coefficients (β̂’s) from the fit of the model in Prob 2.), that have a sign (positive or negative) that is different from what you would expect for the parameter – an indication of multicollinearity? Use the standard errors of the parameter estimates to show that these are poorly estimated. Do the variance inflation factors (VIF’s) identify these parameters? 4. Remove the explanatory variables x3 and x6 from the five-variable model and use a multiple regression model to relate y to x1 , x2 , x4 and x5 only. What can you observe about the multicollinearity in the new model? Is there improvement in the accuracy of estimation of parmeters of this model (for e.g., decreases standard errors, more t-statisticas are significant etc.)? Justify your answers. 5. Perform a residual analysis using diagnostic statistics including relevant graphics for the above 4-variable model. Specifically, determine whether a single city does not appear to fit the model very well and provide evidence from your analysis for making this conclusion. 6. Remove the case you determined above to be a possible outlier from the data and use all 6 variables for the analyses described in the following three parts: (i) Use a SAS procedure to do all possible regressions containing no less than 2 and no more than 4 explanatory variables. Print statistics for only the 4 best models in each case. Construct a plot of the Cp statistic for all models with ‘reasonable’ Cp values. Select a single model, each with 2, 3, and 4 explanatory variables, respectively, for the purpose of predicting annual mean concentration of sulfur dioxide in a city, indicating your reasons for selection of each model. There may be several possible choices, i.e., there may be many ‘good’ models but give arguments for each of your choices. Primarily, use s2 , R2 , and Cp in your arguments. Select one of these models as your final model and provide arguments supporting your choice. 1 (ii) Use the backward elimination subset selection procedure with significance level of 0.05 for deleting variables to select a possible model. State the model selected and report estimates of parameters and the analysis of variance table for this model. (iii) Use the stepwise subset selection procedure, with significance levels of 0.10 for entry and 0.05 for deletion of variables, respectively, to select a possible model. State the model selected and report estimates of parameters and the analysis of variance table for this model. Data: Note that the first column of the data set is a city number. Data values for the variables appear in the order they are described below. A link to a text file is available on the assignment page city= y= x1= x2= x3= x4= x5= x6= city number (input as character data) the annual mean concentration of sulfur dioxide (mcg/meter3 ) average annual temperature ◦ F number of manufacturing enterprises employing 20 or more workers population size (1970 census) in thousands average annual wind speed (mph) average annual precipitation (inches) average number of days with precipitation per year 1 10 70.3 213 582 6.0 7.05 36 2 13 61.0 91 132 8.2 48.52 100 4 17 51.9 454 515 9.0 12.95 86 5 56 49.1 412 158 9.0 43.37 127 6 36 54.0 80 80 9.0 40.25 114 7 29 57.3 434 757 9.3 38.89 111 8 14 68.4 136 529 8.8 54.47 116 9 10 75.5 207 335 9.0 59.80 128 10 24 61.5 368 497 9.1 48.34 115 11 110 50.6 3344 3369 10.4 34.44 122 12 28 52.3 361 746 9.7 38.74 121 13 17 49.0 104 201 11.2 30.85 103 14 8 56.6 125 277 12.7 30.58 82 15 30 55.6 291 593 8.3 43.11 123 16 9 68.3 204 361 8.4 56.77 113 17 47 55.0 625 905 9.6 41.31 111 18 35 49.9 1064 1513 10.1 30.96 129 19 29 43.5 699 744 10.6 25.94 137 20 14 54.5 381 507 10.0 37.00 99 21 56 55.9 775 622 9.5 35.89 105 22 14 51.5 181 347 10.9 30.18 98 23 11 56.8 46 244 8.9 7.77 58 24 46 47.6 44 116 8.8 33.36 135 25 11 47.1 391 463 12.4 36.11 166 26 23 54.0 462 453 7.1 39.04 132 27 65 49.7 1007 751 10.9 34.99 155 28 26 51.5 266 540 8.6 37.01 134 29 69 54.6 1692 1950 9.6 39.93 115 30 61 50.4 347 520 9.4 36.22 147 31 94 50.0 343 179 10.6 42.75 125 32 10 61.6 337 624 9.2 49.10 105 33 18 59.4 275 448 7.9 46.00 119 34 9 66.2 641 844 10.9 35.94 78 35 10 68.9 721 1233 10.8 48.19 103 36 28 51.0 137 176 8.7 15.17 89 37 31 59.3 96 308 10.6 44.68 116 38 26 57.8 197 299 7.6 42.59 115 39 29 51.1 379 531 9.4 38.79 164 40 31 55.2 35 71 6.5 40.75 148 41 16 45.7 569 717 11.8 29.07 123 Note: It may take several computer runs to complete the required analyses. You may cut-and-paste parts of computer printout into your answer sheets. Due Thursday November 21, 2013 2