Stat 8210 Final Project Katherine Morgan Part One Minitab and SAS was used to find the best model to predict the length of stay of patients in the hospital. First indicator variables were created for Medical school affiliation (1=Yes, 2=No) and Region (1=NE, 2=NC, 3=S, 4=W). Next, a model was created that included all possible variables: Age, Infection risk, routine culturing ratio, routine chest X-Ray ratio, number of beds, medical school affiliation, region, average daily census, number of nurses and available facilities and services. The multiple regression model output from Minitab suggests that the length of stay is related to at least one regressor (F = 14.18, p <0.001; Figure 1). The regressors highlighted in yellow do not contribute significantly to the model given the other regressors are still in the model. The only regressors that contribute significantly to the model are: Age (p=0.006), Infection (p=0.001), Census (p=0.001), Nurses (p=0.009), Region 1 (p<0.001) and Region 2 (p=0.009). Multicollinearity was investigated by examining the variance inflation factors (VIFs) for each regressor. VIFs larger than 10 imply serious problems with multicollinearity. This model has two regressors with VIFs larger than ten: Beds (35) and Census (34). This suggests that multicollinearity is a potential problem with this model. The normal probability plot is not linear (Figure 2). The plot curves towards the top. Therefore, we cannot be reasonably assured that the residuals have a normal distribution (p<0.005). According to the plot of the residuals there appears to be two odd points: observation 43 and 100 (Figure 3). These observations were deleted the model was re-ran. The R2ADJ value for the new model increased, 58.1%, and the PRESS statistic decreased, 113.741; suggesting, that the deletion of observation 43 and 100 improved our model. The previous model was abandoned and SAS was used to generate all possible regressions. First the stepwise regression method was used, followed by the backward method, and last the forward method. The regressors for each selection are shown in Figure 4. A multiple regression model was fit using the regressors from each selection. There were no significant differences in the R2ADJ values between the three models and the Forward selection had the smallest PRESS statistic (Figure 5). To check normality assumptions, normal probability plots and plots of the residuals were examined (Figure 6; Figure 7; Figure 8). Each of the normal plots are nearly linear. Therefore, we can be reasonably assured that the residuals have a normal distribution and no transformation is necessary. Each of the plots of the residuals have no obvious model defects because the plot indicates that the residuals can be contained in a horizontal band. The regressors from the stepwise selection were chosen because it is the simplest model. From the plot of the residuals there appears to be two odd points: observation 98 and 101. These were removed and a multiple regression model using variables Infection, Beds, Region 1, Region 4, Med School 1. According to the plots of the residuals there are no obvious model defects because the plot indicates that the residuals can be contained in a horizontal band (Figure 10). The probability plot is nearly linear, therefore no transformation is necessary (Figure 10). Each of the variables but number of beds contribute significantly to the model given the other variables are in the model (p = 0.120). This variable was taken out of the model and re-ran, but there was not a significant change in R2ADJ so it was left out (56%). The analysis of variance computed an F-Value of 29.04 suggesting that the length of stay is related to at least one of the regressors (p < 0.001; Figure 9). Using the parameter estimates, the prediction equation for length of stay in hospital is: Stay = 6.99 + 0.549 Infection + 0.652 Region_1 - 1.58 Region_4 + 0.900 MS_1 Regression Analysis: Stay versus Age, Infection, ... * Region_4 is highly correlated with other X variables * Region_4 has been removed from the equation. * Med School_2 is highly correlated with other X variables * Med School_2 has been removed from the equation. The regression equation is Stay = 1.18 + 0.0799 Age + 0.440 Infection + 0.0055 Culture + 0.0127 Xray - 0.00485 Beds + 0.0152 Census - 0.00589 Nurses - 0.0122 Services + 1.88 Region_1 + 1.07 Region_2 + 0.722 Region_3 + 0.267 Med School_1 Predictor Constant Age Infection Culture Xray Beds Census Nurses Services Region_1 Region_2 Region_3 Med School_1 Coef 1.175 0.07992 0.4397 0.00555 0.012688 -0.004851 0.015182 -0.005891 -0.01218 1.8806 1.0676 0.7223 0.2666 S = 1.23065 R-Sq = 63.0% PRESS = 198.143 SE Coef 1.638 0.02827 0.1273 0.01598 0.007147 0.003603 0.004424 0.002218 0.01377 0.4441 0.3987 0.3967 0.4411 T 0.72 2.83 3.45 0.35 1.78 -1.35 3.43 -2.66 -0.88 4.23 2.68 1.82 0.60 P 0.475 0.006 0.001 0.729 0.079 0.181 0.001 0.009 0.379 0.000 0.009 0.072 0.547 VIF 1.176 2.155 1.979 1.416 35.699 34.211 7.056 3.242 2.743 2.408 2.585 1.855 R-Sq(adj) = 58.5% R-Sq(pred) = 51.58% Analysis of Variance Source Regression Residual Error Total Source Age Infection Culture Xray Beds Census Nurses Services Region_1 Region_2 Region_3 Med School_1 DF 12 100 112 DF 1 1 1 1 1 1 1 1 1 1 1 1 SS 257.759 151.451 409.210 MS 21.480 1.515 F 14.18 P 0.000 Seq SS 14.604 116.356 3.248 8.606 31.087 39.696 14.221 0.011 18.834 5.958 4.586 0.553 Figure 1: Multiple Regression Model using all variables a possible regressors Normal Probability Plot (response is S tay)l 99.9 Mean StDev N AD P-Value 99 95 90 -5.91071E-15 1.216 113 1.167 <0.005 Percent 80 70 60 50 40 30 20 10 5 1 0.1 -5.0 -2.5 0.0 2.5 Standardized Residual 5.0 7.5 Figure2: Normal probability plot of first model using all variables as regressors Versus Fits (response is Stay) Standardized Residual 3 2 1 0 -1 -2 7 8 9 10 Fitted Value 11 12 13 Figure 3: Plot of the residuals from first model using all variables as regressors. Observation 42 and 100 are circled. Selection Stepwise Forward Number of Steps 6 8 Backward 2 Variables in final Model Infection, Beds, Region 1, Region 4, Med School 1 Age, Infection, X Ray, Beds, Region 1, Region 2, Region 4, Med School 1, Age, Infection, Beds, Region 1, Region 2, Region 3, Med School 1 Figure 4: SAS all possible regression methods Summary of the analysis of these three possible models are listed below (Table 2). R2ADJ 54.3% 58.0% 51.5% Selection Stepwise Forward Backward F-Statistic 26.63 (p< 0.001) 19.68 0.000 17.38 0.000 PRESS 115.241 109.740 124.096 Figure 5: Summary statistics for model selections Probability Plot of SRES1 Versus Fits Response is Stay (using predictors from Stepwise Selection) Mean StDev N AD P-Value 99 Percent 95 90 Response is Stay (Using Predictors from Step Wise Selection) -0.0002378 1.005 109 0.429 0.304 80 70 60 50 40 30 20 10 5 3 2 Standardized Residual 99.9 1 0 -1 -2 1 0.1 -3 -3 -2 -1 0 1 Standardized Residual 2 3 7 8 9 10 Fitted Value 11 12 13 Figure 6: Normal probability plot and plot of residuals for stepwise selection Probability Plot of SRES2 Versus Fits Response is Stay (Using Foward Selection) Response is Stay (Using Foward Selection) Mean StDev N AD P-Value 99 Percent 95 90 0.0008939 1.006 109 0.262 0.699 80 70 60 50 40 30 20 10 5 3 2 Standardized Residual 99.9 1 0 -1 -2 1 0.1 -3 -3 -2 -1 0 SRES2 1 2 3 7 8 9 10 Fitted Value 11 12 13 Figure 7: Normal probability plot and plot of the residuals for forward selection Probability Plot of SRES3 Versus Fits Response is Stay (Using Backward Selection) Mean StDev N AD P-Value 99 Percent 95 90 80 70 60 50 40 30 20 10 5 1 0.1 0.0005390 1.003 109 0.404 0.349 3 2 Standardized Residual 99.9 Response is Stay (Using Backward Selection) 1 0 -1 -2 -3 -2 -1 0 SRES3 1 2 3 7 8 9 10 Fitted Value Figure 8: Normal probability plot and plot of the residuals for backward selection. 11 12 13 The regression equation is Stay = 6.99 + 0.549 Infection + 0.652 Region_1 - 1.58 Region_4 + 0.900 MS_1 Predictor Constant Infection Region_1 Region_4 MS_1 Coef 6.9946 0.54882 0.6523 -1.5803 0.9004 S = 0.941907 SE Coef 0.3170 0.07279 0.2243 0.2704 0.2679 T 22.06 7.54 2.91 -5.84 3.36 R-Sq = 58.0% PRESS = 99.4877 P 0.000 0.000 0.004 0.000 0.001 VIF 1.080 1.086 1.063 1.044 R-Sq(adj) = 56.3% R-Sq(pred) = 53.80% Analysis of Variance Source Regression Residual Error Total Source Infection Region_1 Region_4 MS_1 DF 1 1 1 1 DF 4 102 106 SS 124.859 90.493 215.353 MS 31.215 0.887 F 35.18 P 0.000 Seq SS 66.548 17.235 31.059 10.017 Figure 9: Multiple regression final model deleting observation 98 and 101 Versus Fits Normal Probability Plot (response is Stay) Response is Stay Mean StDev N AD P-Value 99 Percent 95 90 80 70 60 50 40 30 20 10 5 -0.0002849 1.003 107 0.263 0.695 3 2 Standardized Residual 99.9 1 0 -1 -2 1 0.1 -3 -3 -2 -1 0 1 Standardized Residual 2 3 7 8 9 10 Fitted Value 11 12 Figure 10: Normal Probability Plot and Plot of the Residuals of final model 13 The LOGISTIC Procedure Model Information Data Set WORK.HW Response Variable Damage Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring Damage Number of Observations Read 30 Number of Observations Used 30 Response Profile Ordered Value Damage Total Frequency 1 1 22 2 0 8 Probability modeled is Damage=1. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Criterion Intercept Only Intercept and Covariates AIC 36.795 18.930 SC 38.196 24.535 -2 Log L 34.795 10.930 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 23.8651 3 <.0001 Score 10.8631 3 0.0125 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq 5.5804 3 0.1339 Wald Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept 1 -82.2972 35.8096 5.2817 0.0216 Load 1 12.0276 5.3673 5.0215 0.0250 Experience 1 0.8784 0.3837 5.2399 0.0221 Load_Exp 1 -0.1227 0.0549 5.0072 0.0252 Odds Ratio Estimates Effect Point Estimate Load 95% Wald Confidence Limits >999.999 4.517 >999.999 Experience 2.407 1.135 5.106 Load_Exp 0.884 0.794 0.985 Association of Predicted Probabilities and Observed Responses Percent Concordant 97.2 Somers' D 0.943 Percent Discordant 2.8 Gamma 0.943 Percent Tied 0.0 Tau-a 0.382 Pairs 176 c 0.972 Wald Confidence Interval for Odds Ratios Effect Unit Estimate 95% Confidence Limits Load 1.0000 >999.999 4.517 >999.999 Experience 1.0000 2.407 1.135 5.106 Load_Exp 1.0000 0.884 0.794 0.985 Hosmer and Lemeshow Goodness-of-Fit Test Chi-Square DF Pr > ChiSq 1.5444 8 0.9919