Stat 301 – Final Exam December 20, 2013 Name: ________________________ INSTRUCTIONS: Read the questions carefully and completely. Answer each question and show work in the space provided. Partial credit will not be given if work is not shown. Use the JMP output. It is not necessary to calculate something by hand that JMP has already calculated for you. When asked to explain, describe, or comment, do so within the context of the problem. Be sure to include units when discussing quantitative variables. 1. [14 pts] This question deals with the three model selection procedures; Forward, Backward and Mixed, in general. a) [3] Explain what the Prob to Enter is. b) [4] For the Forward selection procedure, what is the first variable that will enter the model? c) [3] Explain what the Prob to Leave is. d) [4] For the Backward selection procedure, what is the first variable that will be removed from the model? 1 2. [26 pts] On the first two labs this semester we looked at the concentration of NonStructural Carbohydrates (NSC in mg/g) for trees and shrubs in dry and moist tropical forests. Researchers are interested in using the NSC concentrations to predict the concentration of sugar (in mg/g) for those tropical trees and shrubs. For the multiple regression model the centered value of NSC, NSC – 65.885, is used. The Forest Indicator is 0 if it is a Moist tropical forest and 1 if it is a Dry tropical forest. Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) Analysis of Variance Source DF Sum of Squares Model 3 12684.881 Error 83 7048.866 C. Total 86 19733.747 0.643 0.630 9.216 29.161 87 Mean Square 4228.29 84.93 Parameter Estimates Term Estimate Intercept 28.296 (NSC – 65.885) 0.283 Forest Indicator 4.245 Forest Indicator*(NSC – 65.885) 0.250 Std Error 1.3365 0.0328 2.0832 0.0703 F Ratio 49.79 Prob > F <.0001* t Ratio 21.17 8.63 2.04 3.55 Prob>|t| <.0001* <.0001* 0.0448* 0.0006* a) [2] How much variation in sugar concentration can be explained by the model with (NSC – 65.885), Forest Indicator and the interaction between these two variables? 2 b) [4] Is the model useful? Support your answer with the value of the appropriate test statistic, P-value and conclusion based on the P-value. c) [4] Give an interpretation of the estimated intercept within the context of the problem. d) [4] Give an interpretation of the parameter estimate for (NSC – 65.885) within the context of the problem. e) [4] Give an interpretation of the parameter estimate for the Forest Indicator variable within the contest of the problem. f) [4] Because the P-value for the parameter estimate for Forest Indicator*(NSC – 65.885) is so small, the Forest Indicator*(NSC – 65.885) term is statistically significant. What does this indicate about the relationship between sugar concentration and NSC concentration? 3 g) [4] Is there a statistically significant difference in average sugar concentrations for tropical trees and shrubs with 65.885 mg/g NSC in Moist and Dry tropical forests? Support your answer with the value of the appropriate test statistic, Pvalue and conclusion based on the P-value. 3. [37 pts] Standardized residuals, leverage (h) values, Cook’s D, and Studentized residuals are calculated for each of the n = 87 species using the fitted model in problem 2. a) [5] The distribution of standardized residuals is given below. What does this indicate about the normal distribution condition? Be sure to make specific reference to the plots to support your answer. Standardized Residuals 4 b) [3] The plot of standardized residuals versus the type of forest is given below. What does this plot tell you about the conditions necessary for statistical inference? Level Dry Moist n Mean Std Dev 38 0 0.974 49 0 0.999 c) [5] The four most extreme standardized residuals are given below. Which are statistically significant? Explain briefly. Species Ocotea sp. 1 Pseudolmedia laevis Zeyheria tuberculosa Pouteria macrophylla Forest Moist Moist Dry Moist Sugar 52 48 81 57 NSC 230 69 92 103 Std Resid, z –2.465 2.042 3.749 1.975 P-value, z 0.0137 0.0411 0.0002 0.0482 d) [3] How large does the leverage, h, have to be to be considered high leverage? 5 The ten largest leverage values are given below. The average NSC for trees and shrubs in Dry tropical forests is 56.8 mg/g while the average NSC for trees and shrubs in Moist tropical forests is 72.9 mg/g. The average sugar concentration for Dry tropical forest trees and shrubs is 27.7 mg/g while the average sugar concentration for Moist tropical forest trees and shrubs is 30.3 mg/g. Species Ocotea sp. 1 Ocotea sp. 2 Pourouma cecropiifolia Pouteria nemorosa Swietenia macrophylla Bougainvillea modesta Chrisyophyllum gonocarpon Neea cf. steimbachii Pouteria gardneriana Zeyheria tuberculosa Forest Moist Moist Moist Moist Moist Dry Dry Dry Dry Dry Sugar 52 58 38 56 50 12 59 11 65 81 NSC 230 137 134 128 169 24 129 19 142 92 h Sugar 0.333 0.072 0.068 0.059 0.137 0.075 0.264 0.092 0.357 0.083 F 1.82 1.67 1.39 4.03 1.91 9.48 2.44 14.87 2.15 P-value, F 0.0000 0.1505 0.1806 0.2512 0.0099 0.1339 0.0000 0.0704 0.0000 0.1002 e) [5] Calculate the F statistic for Ocotea sp. 1. f) [4] What species of tropical trees and shrubs have statistically significant leverage values? Explain briefly. g) [3] What is the reason for the statistically significant leverage? 6 The 5 trees and shrubs with either the largest values of Cook’s D or the most extreme Studentized residuals are given below. Species Ocotea sp. 1 Pouteria macrophylla Pseudolmedia laevis Pouteria gardneriana Zeyheria tuberculosa Forest Moist Moist Moist Dry Dry Cook's D 1.136 0.033 0.022 0.166 0.346 Studentized Resid, t –3.02 2.01 2.06 –1.09 3.91 P-value, t 0.0034 0.0479 0.0422 0.2771 0.0002 h) [2] Which tree/shrub species has the largest value of Cook’s D? Give the name of the true/shrub species and the value of Cook’s D? i) [2] Is this considered a highly influential value? Explain briefly. j) [2] Which tree/shrub species has the most extreme Studentized residual? Give the name of the tree/shrub species and the value of the Studentized residual. k) [3] Does the tree/shrub species with the most extreme Studentized residual have statistically significant influence? Support your answer. 7 4. [48 pts] A random sample of 100 houses was selected from all houses sold in Ames in 2009-2010. We are interested in building a model for Sales Price ($1000) based on characteristics of the houses. There are 12 explanatory variables for each house. Lot Area Quality Condition Age Basement Area First Floor Area Second Floor Area Baths Bedrooms Other Rooms Fireplaces Garage Cars Area of the lot in square feet Index of quality 0 = low, 10 = high Index of condition 0 = poor, 10 = excellent Age of house at the time of sale Area of the basement in square feet Area of the first floor in square feet Area of the second floor in square feet Number of baths, Note: a half bath = ½ Number of bedrooms Number of other rooms Number of fireplaces Size of garage in terms of number of cars Below is the output for the Forward selection procedure with Prob to Enter = 0.25. The C. Total sum of squares is 423164.51. 8 a) [5] What is the best single variable model for predicting Sales Price? How do you know this is the best single variable model? How much variability in Sales Price does this single variable explain? b) [5] What variable was added to the model on the fourth step of the forward selection procedure? What was the P-value for this variable when it was entered? How much additional variability in Sales Price does adding this variable explain, given the variables entered in the first three steps? c) [4] Could the 10 variable model (Lot Area, Quality, Condition, Age, Basement Area, First Floor Area, Second Floor Area, Bedrooms, Other Rooms, and Garage Cars) be the best model? Explain briefly. 9 Below is the initial set up for the Backward selection procedure with Prob to Leave = 0.10. d) [5] What will be the first variable removed from the full model using the backward selection procedure? Give the P-value associated with this action and indicate how the value of R2 will change and by how much when that variable is removed. e) [5] After the first two variables are removed by the Backward selection procedure the Current Estimates are the same as those given at the end of the Forward selection procedure (see page 8). What will happen at the next step of the Backward procedure? Explain what will happen and why. 10 Running the Mixed procedure with Prob to Enter = Prob to Leave = 0.05 gives the Current Estimates given below. f) [5] Complete the analysis of variance table that corresponds to the model with the terms checked as Entered (Lot Area, Quality, Condition, Age, Basement Area, First Floor Area, Second Floor Area, Bedrooms and Garage Cars). Source df Sum of Squares Model Mean Square F Prob > F <0.0001 Error C. Total 99 423164.51 g) [4] Could the model with the terms checked as Entered be the best model? Explain briefly. 11 The All Possible Models option in Stepwise was used to display the top 10 (according to R2) models for 9, 10, and 11 variable models along with the full 12 variable model. Below are summary values of various measures of how well the model fits the data. Use this output when answering parts h) – l). Model 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 # variables 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11 12 RSquare 0.8692 0.8634 0.8615 0.8609 0.8606 0.8595 0.8594 0.8592 0.8591 0.8586 0.8727 0.8695 0.8692 0.8650 0.8635 0.8629 0.8629 0.8626 0.8619 0.8617 0.8733 0.8729 0.8695 0.8650 0.8634 0.8630 0.8627 0.8620 0.8614 0.8533 0.8734 RMSE 24.8034 25.3390 25.5224 25.5724 25.6011 25.6983 25.7101 25.7337 25.7365 25.7812 24.6001 24.9061 24.9419 25.3368 25.4781 25.5298 25.5318 25.5616 25.6204 25.6400 24.6840 24.7188 25.0465 25.4803 25.6304 25.6682 25.6986 25.7572 25.8161 26.5645 24.8167 AICc 940.4476 944.7207 946.1632 946.5546 946.7786 947.5363 947.6288 947.8120 947.8337 948.1805 940.2703 942.7430 943.0307 946.1721 947.2843 947.6897 947.7058 947.9384 948.3985 948.5516 942.4679 942.7496 945.3840 948.8178 949.9927 950.2874 950.5242 950.9796 951.4369 957.1521 945.1060 BIC 966.1045 970.3776 971.8201 972.2114 972.4355 973.1931 973.2856 973.4689 973.4906 973.8374 967.9461 970.4188 970.7066 973.8479 974.9602 975.3655 975.3817 975.6142 976.0743 976.2274 972.1025 972.3843 975.0187 978.4525 979.6274 979.9221 980.1588 980.6142 981.0716 986.7867 976.6372 Cp 9.9034 13.8284 15.1916 15.5649 15.7793 16.5077 16.5970 16.7741 16.7952 17.1314 9.4529 11.6423 11.9007 14.7696 15.8072 16.1882 16.2034 16.4227 16.8581 17.0035 11.0614 11.3070 13.6376 16.7690 17.8654 18.1424 18.3656 18.7963 19.2308 24.8319 13.0000 h) [2] Which model has the best R2 value? Identify by giving the model number, number of variables and value of R2. 12 i) [2] Which model has the best RMSE value? Identify by giving the model number, number of variables and value of RMSE. j) [2] Which model has the best Cp value? Identify by giving the model number, number of variables and value of Cp. k) [2] Which model has the best AICc value? Identify by giving the model number, number of variables and value of AICc. l) [2] Which model has the best BIC value? Identify by giving the model number, number of variables and value of BIC. m) [5] Model 1 includes variables Lot Area, Quality, Condition, Age, Basement Area, First Floor Area, Second Floor Area, Bedrooms and Garage Cars. Model 11 includes variables Lot Area, Quality, Condition, Age, Basement Area, First Floor Area, Second Floor Area, Bedrooms, Other Rooms and Garage Cars. Between these two models, which is best according to the definition of best we have been using in this course? Explain briefly. 13