Lab Activity 6 (07/23/2013) Please finish this lab activity during the class time and submit in the drop box on Angel. Note: Please include the necessary plot(s) or Minitab output that are used to answer each part of the following question. Open the dataset “Senic”. We have: Y = InfctRisk, the risk of infection at a hospital X1 = Stay, average length of stay at the hospital X2 = Cultures = average number of bacterial cultures per day at the hospital X3 = Age, average age of patients at hospital X4= Beds, the number of beds in the hospital. X5 = Census, the average daily number of patients Part 1 a. (10pts) Plot the scatter plot of response versus each predictor and each pair of predictors, what do you observe on these plots? (Minitab: Graph>Matrix Plot. Input all variables and click “Matrix Options” and check “Upper Right”) Matrix Plot of InfctRsk, Stay, Cultures, Age, Beds, Census 10 15 200 25 50 40 50 60 0 400 800 0 400 800 7.0 4.5 InfctRsk 2.0 20 15 Stay 10 50 Cultures 25 0 60 Age 50 40 800 400 Beds 0 Census Most of the graph shows a random pattern, in other words they might not have some linear relation between each other except census and beds. Census and beds have a clear linear relationship as we can see in the graph. b. (10pts) Fit a full model with all predictors in the model. What are the values of VIFs for predictors? Can you explain by the scatter plot in previous part? What action do you want to do? (Minitab: Regression>Regression, Choose Option and check “Variance Inflation factors”) Regression Analysis: InfctRsk versus Stay, Cultures, Age, Beds, Census The regression equation is InfctRsk = 0.21 + 0.206 Stay + 0.0590 Cultures + 0.0174 Age + 0.00045 Beds + 0.00103 Census Predictor Constant Stay Cultures Age Beds Census Coef 0.205 0.20553 0.05904 0.01736 0.000448 0.001031 S = 0.992559 SE Coef 1.208 0.06609 0.01031 0.02300 0.002678 0.003494 R-Sq = 47.7% VIF for stay is 30.323. VIF for indicates there get in the last T 0.17 3.11 5.73 0.76 0.17 0.29 P 0.865 0.002 0.000 0.452 0.868 0.769 VIF 1.814 1.266 1.197 30.323 32.817 R-Sq(adj) = 45.2% 1.814. VIF for culture is 1.266. VIF for age is 1.197. VIF for beds is census is 32.817. Both VIFs of beds and census are larger than 10. It might be a linear relationship between them. As we assume from the graph we part. We definitely want to delete census. c. (15pts) Suppose you want to delete variable “Census”, and refit the model with only four predictors. Can you construct a general linear F-test using SSR(X5 | X1, X2, X3, X4) to test if H0: 5 0 Source Stay Cultures Age Beds Census DF 1 1 1 1 1 Seq SS 57.305 33.397 0.136 5.043 0.086 SSR(X5 | X1, X2, X3, X4)=0.086 F= [SSR(X5 | X1, X2, X3, X4)/1]/MSE of full= 0.08731 d. (10pts) Can you find p-value of this F-test (as we did in class)? What is your conclusion? (Minitab: Graph>Probability Distribution Plot>View Probability…..as demonstrated in class) Distribution Plot F, df1=1, df2=107 1.6 1.4 1.2 Density 1.0 0.8 0.6 0.4 0.2 0.0 0.7682 0.08731 0 X Since p-value is 0.7682 which is greater than 0.05. So we reject the null hypothesis. e. (10pts) Go back to the output of the full model, what is the t-value for X5? What does this related to the general linear F-test you did in part (c)? (Hint: Think about the situation of Simple Linear Regression) T value for X5 is 0.29^2=0.0841. It is similar to the F value 0.08731 which we calculated in part C. f. (10pts) Go back to the model output without predictor “Census”, what are VIFs for parameters now? Predictor Constant Stay Cultures Age Beds g. Coef 0.179 0.21401 0.05861 0.01648 0.0012213 SE Coef 1.199 0.05925 0.01016 0.02270 0.0005375 T 0.15 3.61 5.77 0.73 2.27 P 0.881 0.000 0.000 0.470 0.025 VIF 1.471 1.241 1.176 1.232 (Open, 5pts) Is there any other variable you want delete now? And do you think we get a perfect model after performing the action (as discussed in class)? What is your final model? Yes, I want to delete age. As we can see, the p- value of age is still greater than 0.05. Then we get : The regression equation is InfctRsk = 0.975 + 0.228 Stay + 0.0563 Cultures + 0.00116 Beds Predictor Constant Stay Cultures Beds Coef 0.9749 0.22784 0.056304 0.0011598 SE Coef 0.4858 0.05598 0.009634 0.0005296 T 2.01 4.07 5.84 2.19 P 0.047 0.000 0.000 0.031 VIF 1.319 1.120 1.201 Part 2 h. (5pts) Go back the model with all five predictors, suppose we want to test if all insignificant predictors have coefficients equal to zero simultaneously, what is the null and alternative hypothesis? (Note: Intercept does not count.) Ho: β4= β5=0 Ha: at least one of them is not 0 (β4, β5) i. (15pts) How do you construct the test statistics? What is your conclusion? F={[SSR(X1,X2,X3,X4,X5)-SSR(X1,X2,X3)]/2}/MSE(X1,X2,X3,X4,X5) = [(95.966- 90.838)] )/2]/ 0.985 =2.603 Analysis of Variance Source Regression Residual Error Total DF 3 109 112 SS 90.838 110.542 201.380 MS 30.279 1.014 F 29.86 P 0.000 Distribution Plot F, df1=2, df2=107 1.0 Density 0.8 0.6 0.4 0.2 0.0 0.07874 0 2.603 X Since p value of F test is greater than 0.05, we fail to reject Null. j. (5pts) What is your model now after adopting the decision in i? So X4 and X5 are insignificant, only X1,X2 and X3 are in the model now. k. (Open 5pts) Compare the model you get in j and g. Are they the same? If not, which one do you think is better? (Hint: you many think in many perspectives, e.g. R2, number of variables in the model, scatter plot, multicollinearity….Remember that regression is a quite subjective topic!) . They are not the same. I think g is better. Because in j) the p value of age is still larger than 0.05, it is not a perfect model. However, we deleted age instead of beds in g), we got a perfect model. That means, we should delete census and age to get a perfect model even if the VIF of beds and census are larger than 10.