Statistics 511 Final Exam Dec. 16, 2004 4:40-6:30 p.m. The following rules apply. 1. 2. 3. 4. You may up to 3 pages of notes, double-sided, any font. You may use a calculator. You may not collaborate or copy. Failure to comply with item 3 could lead to reduction in your grade, or disciplinary action. I have read the rules above and agree to comply with them. Signature ________________________________________________ Name (printed) ___________________________________________ Statistics 511 Final Exam Fall 2004 1) Several variables were collected on 97 men with prostate cancer. The doctors would like to be able to determine which cancers will become invasive based on measurements of PSA (a blood chemical) and the cancer volume (CancerVol) which can be estimated noninvasively. Below are loess fits to the regression of invasion probability on PSA (top plot) and invasion probability on Cancer Volume (lower plot) (2 separate regression fits). Does it look like logistic regression will provide an adequate fit to the data? Briefly justify your response. S mo o t h i n g P a r a me t e r = 0 . 7 I nvasi ve 1. 10000 O 1. 00000 O 0. 90000 O 0. 80000 0. 70000 O 0. 60000 O O OO O 0. 50000 OO OO O O O O O O O O O 0. 40000 O O O O O O O O O O O O O 0. 30000 0. 20000 O O O O O O O O O 0. 10000 0 O O O O O O O O O O O O O O O O O O O O O O O O O O O O - 0. 10000 0 100. 00000 200. 00000 300. 00000 PSA I nvasi ve 1. 10000 O 1. 00000 0. 90000 O 0. 80000 O O 0. 70000 OO O 0. 60000 O OO O O O O OO 0. 50000 O OO O 0. 40000 O O O 0. 30000 O O O O O O O 0. 20000 O O O O O O O O O O O OO O O O O O O OO 0. 10000 0 OO O O O O O O OO O O O O OO O O O O O O O O O O - 0. 10000 0 10. 00000 20. 00000 30. 00000 40. 00000 Ca n c e r V o l 2 of 18 2 50. 00000 Statistics 511 Final Exam Fall 2004 Below is the SAS output for the logistic regression of Invasive (1=invasive cancer, 0=noninvasive cancer) on PSA and CancerVol. The LOGISTIC Procedure Model Information Data Set Response Variable Number of Response Levels Number of Observations Model Optimization Technique WORK.PROSTATE Invasive 2 97 binary logit Fisher's scoring Response Profile Ordered Value 1 2 Total Frequency 21 76 Invasive 1 0 Probability modeled is Invasive=1. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Criterion Intercept Only Intercept and Covariates 103.353 105.927 101.353 65.503 73.227 59.503 AIC SC -2 Log L Test Testing Global Null Hypothesis: BETA=0 Chi-Square DF Pr > ChiSq Likelihood Ratio Score Wald 41.8498 37.1807 18.1894 2 2 2 <.0001 <.0001 0.0001 Analysis of Maximum Likelihood Estimates Parameter Intercept PSA CancerVol Effect PSA CancerVol 3 of 18 DF 1 1 1 Estimate -3.8156 0.0675 0.1141 Standard Error 0.6962 0.0254 0.0547 Wald Chi-Square 30.0399 7.0624 4.3440 Pr > ChiSq <.0001 0.0079 0.0371 Odds Ratio Estimates Point 95% Wald Estimate Confidence Limits 1.070 1.018 1.124 1.121 1.007 1.248 3 Statistics 511 E s t i ma t e d Final Exam Fall 2004 Pr o b a b i l i t y 1. 0 0. 9 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0. 0 0 100 200 300 PSA Predicted probability versus PSA (from regression on PSA and CancerVol) E s t i ma t e d Pr o b a b i l i t y 1. 0 0. 9 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0. 0 0 10 20 30 40 50 Ca n c e r V o l Predicted probability versus CancerVol (from regression on PSA and CancerVol) 4 of 18 4 Statistics 511 b) Test H0: 1 = 2= 0 Final Exam Fall 2004 versus HA: at least one of the two coefficients is not zero. Test Statistic: Distribution of the Test Statistic Under the Null Hypothesis P-value Conclusion (stated in words) c) Compute a 95% confidence interval for the regression coefficient for PSA. 5 of 18 5 Statistics 511 Final Exam Fall 2004 d) A patient comes to the clinic with prostate cancer. His PSA is 100 and his Cancer Volume is 10. What is his predicted probability of having invasive cancer? e) A new drug has been developed that lowers PSA in men with prostate cancer. The drug company argues that the data presented here provide evidence that taking this drug will lower the risk of invasive cancer in men that have been diagnosed with prostate cancer and have elevated PSA. Do you agree with the drug company? Briefly support your answer. 6 of 18 6 Statistics 511 Final Exam Fall 2004 Sulfur dioxide (SO2) is an important atmospheric pollutant. The level of SO2 varies considerably across the country. An investigator wants to predict the level of SO2 using the 22 variables in the table below. V1-V6 are composite variables taken from the Current Population Index. Variables for predicting SO2 YEARTEMP Mean annual temperature MaxTemp MANUFACT POP70 SPEEDWIN Manufacturing output Population - 1970 Mean annual wind speed ALTITUDE FOREST TRUCKS PRECIP Mean total annual precipitation COAL DAYNUM CARS70 Mean days of precipitation Number of registered cars 1970 Mean annual gasoline sales Mean daily humidity Miles of roads - 1970 Mean Minimum Temperature V1 V2 GAS HUMIDITY ROADS MinTemp Mean Maximum Temperature Altitude Percent forested Number of registered trucks - 1970 Percent of electrical power generated by coal V3 V4 V5 V6 Questions 2, 3 and 4 all refer to these data. 2. The investigator felt that 22 variables were too many for practical use. Hence he decided to use all subsets regression to select a smaller set of variables. a. Below is some output from all subsets regression. Number in Model R-Square SBC Variables in Model 1 0.4157 243.15913 MANUFACT ------------------------------------------------------------------------------2 0.5863 232.71636 MANUFACT POP70 ------------------------------------------------------------------------------3 0.6198 232.96993 MANUFACT POP70 V5 ------------------------------------------------------------------------------4 0.6680 231.13065 MANUFACT POP70 V3 V5 ------------------------------------------------------------------------------5 0.7195 227.92702 MANUFACT POP70 FOREST TRUCKS V3 ------------------------------------------------------------------------------6 0.7860 220.54826 MANUFACT POP70 PRECIP GAS ALTITUDE V3 ------------------------------------------------------------------------------7 0.8070 220.02561 POP70 PRECIP CARS HUMIDITY ROADS ALTITUDE V3 ------------------------------------------------------------------------------8 0.8268 219.30165 MANUFACT POP70 PRECIP FOREST COAL V2 V3 V6 ------------------------------------------------------------------------------9 0.8346 221.11842 MANUFACT POP70 PRECIP GAS FOREST COAL V2 V3 V6 ------------------------------------------------------------------------------Number in Model R-Square SBC Variables in Model 10 0.8453 222.10022 MANUFACT POP70 PRECIP HUMIDITY ROADS FOREST COAL V2 V3 V6 ------------------------------------------------------------------------------- 7 of 18 7 Statistics 511 Final Exam Fall 2004 11 0.8561 222.84264 MANUFACT POP70 PRECIP DAYNUM GAS ALTITUDE V1 V2 V3 V5 V6 ------------------------------------------------------------------------------12 0.8639 224.27753 MANUFACT POP70 PRECIP DAYNUM GAS MINTEMP MAXTEMP ALTITUDE V1 V2 V3 V5 ------------------------------------------------------------------------------13 0.8706 225.91078 MANUFACT POP70 PRECIP DAYNUM GAS MINTEMP MAXTEMP ALTITUDE V1 V2 V3 V5 V6 ------------------------------------------------------------------------------14 0.8775 227.39496 YEARTEMP MANUFACT POP70 DAYNUM CARS MINTEMP MAXTEMP ALTITUDE TRUCKS COAL V1 V2 V3 V5 ------------------------------------------------------------------------------15 0.8802 230.17261 MANUFACT POP70 DAYNUM CARS GAS MINTEMP MAXTEMP ALTITUDE TRUCKS COAL V1 V2 V3 V5 V6 ------------------------------------------------------------------------------16 0.8829 232.96952 MANUFACT POP70 DAYNUM CARS GAS ROADS MINTEMP MAXTEMP ALTITUDE TRUCKS COAL V1 V2 V3 V5 V6 ------------------------------------------------------------------------------17 0.8878 234.93011 MANUFACT POP70 DAYNUM CARS GAS HUMIDITY ROADS MINTEMP MAXTEMP FOREST TRUCKS COAL V1 V2 V3 V5 V6 ------------------------------------------------------------------------------18 0.8884 238.41900 MANUFACT POP70 SPEEDWIN DAYNUM CARS GAS HUMIDITY ROADS MINTEMP MAXTEMP ALTITUDE TRUCKS COAL V1 V2 V3 V5 V6 ------------------------------------------------------------------------------19 0.8888 241.96613 MANUFACT POP70 PRECIP DAYNUM CARS GAS HUMIDITY ROADS MINTEMP MAXTEMP ALTITUDE TRUCKS COAL V1 V2 V3 V4 V5 V6 ------------------------------------------------------------------------------20 0.8894 245.46001 MANUFACT POP70 SPEEDWIN PRECIP DAYNUM CARS GAS HUMIDITY ROADS MINTEMP MAXTEMP ALTITUDE FOREST TRUCKS COAL V1 V2 V3 V5 V6 ------------------------------------------------------------------------------21 0.8899 248.99479 MANUFACT POP70 SPEEDWIN PRECIP DAYNUM CARS GAS HUMIDITY ROADS MINTEMP MAXTEMP ALTITUDE FOREST TRUCKS COAL V1 V2 V3 V4 V5 V6 ------------------------------------------------------------------------------22 0.8901 252.65771 YEARTEMP MANUFACT POP70 SPEEDWIN PRECIP DAYNUM CARS GAS HUMIDITY ROADS MINTEMP MAXTEMP ALTITUDE FOREST TRUCKS COAL V1 V2 V3 V4 V5 V6 8 of 18 8 Statistics 511 Final Exam Fall 2004 Plot of SBC versus number of parameters 255 250 245 240 235 230 225 220 215 0. 0 2. 5 5. 0 7. 5 10. 0 12. 5 15. 0 17. 5 20. 0 22. 5 25. 0 15. 0 17. 5 20. 0 22. 5 25. 0 P Plot of R2 versus number of parameters 0. 9 0. 8 0. 7 0. 6 0. 5 0. 4 0. 0 2. 5 5. 0 7. 5 10. 0 12. 5 P a) Based on this output, about how many variables should be in the final model? Justify your answer briefly. 9 of 18 9 Statistics 511 Final Exam Fall 2004 b) The investigator selected a candidate model, and looked at some of the resulting residual plots. Two typical plots are below. Based on these 2 plots, the investigator decided to make some adjustments to the data and model. What advice would you give about “adjustments” such as transforming variables or removing unusual data values? (2 specific pieces of advice with supporting evidence relying on the plots.) 4 3 2 1 0 - 1 - 2 0 200 400 600 800 1000 1200 1400 1600 1800 MA NUF A CT 4 3 2 1 0 - 1 - 2 10 15 20 25 30 35 40 Pr e d i c t e d 10 of 18 45 50 55 Va l u e 10 60 65 70 Statistics 511 Final Exam Fall 2004 c. Another investigator suggested using a stepwise method to select variables for this study. Give 2 reasons why all subsets regression is better for selecting variables in this study than a stepwise method. d. Other investigators using SO2 in pollution studies, transformed to log(SO2). If this investigator decides to predict log(SO2) instead of SO2, does he need to redo the variable selection, or can he use one of the selected models? Explain your answer. 11 of 18 11 Statistics 511 Final Exam Fall 2004 3. A student looking at the pollution data decided that log(SO2)=LSO2 could probably be predicted from mean windspeed alone, using polynomial regression. Some of the relevant output is below. Plot of LSO2 versus windspeed. The loess curve is plotted using squares. L S O2 5. 00000 4. 00000 3. 00000 2. 00000 6. 00000 7. 00000 8. 00000 9. 00000 10. 00000 S P E E DWI N Dependent Variable: LSO2 Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 4 35 39 2.88858 14.38580 17.27438 0.72214 0.41102 Root MSE Dependent Mean Coeff Var 12 of 18 0.64111 3.11432 20.58592 R-Square Adj R-Sq 0.1672 0.0720 12 F Value Pr > F 1.76 0.1597 11. 00000 12. 00000 13. 00000 Statistics 511 Final Exam Fall 2004 Parameter Estimates Variable Intercept SPEEDWIN SPEEDWIN2 SPEEDWIN3 SPEEDWIN4 DF 1 1 1 1 1 Parameter Estimate 10.85596 -3.01474 0.32819 -0.00547 -0.00047328 Standard Error 81.05424 36.34083 5.98802 0.43051 0.01141 t Value 0.13 -0.08 0.05 -0.01 -0.04 Pr > |t| 0.8942 0.9344 0.9566 0.9899 0.9671 Type I SS 387.95874 0.05255 2.16155 0.67377 0.00070716 Variance Inflation 0 259310 2547486 2869922 368171 a) Do an overall F-test of whether any of the regression coefficients are non-zero. State your conclusion clearly. 13 of 18 13 Statistics 511 Final Exam Fall 2004 b) Do sequential unpooled testing to determine the appropriate degree for a polynomial fit to the data. What is the appropriate degree? How does your answer correspond to your response in part a? 14 of 18 14 Statistics 511 Final Exam Fall 2004 c) The investigator is concerned about the very high variance inflation factors. What effect does the variance inflation factor have on your tests in parts a) and b) above? 15 of 18 15 Statistics 511 Final Exam Fall 2004 4. After looking at the results obtained by the student and the variable selection results, the investigator decided to fit a polynomial of degree 2 in YEARTEMP, SPEEDWIN and PRECIP including all first order interactions. Some of the output is below: The SAS System The REG Procedure Model: MODEL1 Dependent Variable: LSO2 Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 9 30 39 11.58682 5.68756 17.27438 1.28742 0.18959 Root MSE Dependent Mean Coeff Var 0.43541 3.11432 13.98104 R-Square Adj R-Sq F Value Pr > F 6.79 <.0001 0.6708 0.5720 Parameter Estimates Variable Intercept YEARTEMP SPEEDWIN PRECIP YEARTEMP2 SPEEDWIN2 PRECIP2 SPEEDxPREC TEMPxSPEED TEMPxPREC DF Parameter Estimate Standard Error t Value Pr > |t| Type I SS Type II SS 1 1 1 1 1 1 1 1 1 1 12.50779 -0.17487 -0.10851 -0.08986 0.00033096 -0.05717 -0.00016982 0.01792 0.00782 -0.00057411 10.37681 0.24891 1.12216 0.17204 0.00199 0.03350 0.00088005 0.01035 0.01079 0.00191 1.21 -0.70 -0.10 -0.52 0.17 -1.71 -0.19 1.73 0.73 -0.30 0.2375 0.4878 0.9236 0.6053 0.8693 0.0983 0.8483 0.0938 0.4740 0.7659 387.95874 5.12074 1.13771 1.94376 0.04366 0.86568 0.31964 2.05591 0.08259 0.01712 0.27545 0.09357 0.00177 0.05172 0.00522 0.55198 0.00706 0.56773 0.09967 0.01712 Assume that the regression assumptions are satisfied. 16 of 18 16 Statistics 511 Final Exam Fall 2004 a) Do pooled sequential testing (one term at a time) to determine if any interaction effects are needed in the model. 17 of 18 17 Statistics 511 Final Exam Fall 2004 b) Below is the plot of studentized residual versus MANUFACT. Do you think that MANUFACT should be included in the model? Support your answer briefly referring to the plot. 2. 0 1. 5 1. 0 0. 5 0. 0 - 0. 5 - 1. 0 - 1. 5 - 2. 0 0 200 400 600 800 1000 1200 1400 MA NUF A CT 18 of 18 18 1600 1800