Stat 8210

advertisement
Stat 8210
Final Project
Katherine Morgan
Part One
Minitab and SAS was used to find the best model to predict the length of stay of patients in the hospital.
First indicator variables were created for Medical school affiliation (1=Yes, 2=No) and Region (1=NE, 2=NC,
3=S, 4=W). Next, a model was created that included all possible variables: Age, Infection risk, routine culturing
ratio, routine chest X-Ray ratio, number of beds, medical school affiliation, region, average daily census,
number of nurses and available facilities and services. The multiple regression model output from Minitab
suggests that the length of stay is related to at least one regressor (F = 14.18, p <0.001; Figure 1).
The regressors highlighted in yellow do not contribute significantly to the model given the other
regressors are still in the model. The only regressors that contribute significantly to the model are: Age
(p=0.006), Infection (p=0.001), Census (p=0.001), Nurses (p=0.009), Region 1 (p<0.001) and Region 2 (p=0.009).
Multicollinearity was investigated by examining the variance inflation factors (VIFs) for each regressor. VIFs
larger than 10 imply serious problems with multicollinearity. This model has two regressors with VIFs larger
than ten: Beds (35) and Census (34). This suggests that multicollinearity is a potential problem with this model.
The normal probability plot is not linear (Figure 2). The plot curves towards the top. Therefore, we
cannot be reasonably assured that the residuals have a normal distribution (p<0.005). According to the plot of
the residuals there appears to be two odd points: observation 43 and 100 (Figure 3). These observations were
deleted the model was re-ran. The R2ADJ value for the new model increased, 58.1%, and the PRESS statistic
decreased, 113.741; suggesting, that the deletion of observation 43 and 100 improved our model.
The previous model was abandoned and SAS was used to generate all possible regressions. First the
stepwise regression method was used, followed by the backward method, and last the forward method. The
regressors for each selection are shown in Figure 4. A multiple regression model was fit using the regressors
from each selection. There were no significant differences in the R2ADJ values between the three models and
the Forward selection had the smallest PRESS statistic (Figure 5). To check normality assumptions, normal
probability plots and plots of the residuals were examined (Figure 6; Figure 7; Figure 8). Each of the normal
plots are nearly linear. Therefore, we can be reasonably assured that the residuals have a normal distribution
and no transformation is necessary. Each of the plots of the residuals have no obvious model defects because
the plot indicates that the residuals can be contained in a horizontal band. The regressors from the stepwise
selection were chosen because it is the simplest model.
From the plot of the residuals there appears to be two odd points: observation 98 and 101. These were
removed and a multiple regression model using variables Infection, Beds, Region 1, Region 4, Med School 1.
According to the plots of the residuals there are no obvious model defects because the plot indicates that the
residuals can be contained in a horizontal band (Figure 10). The probability plot is nearly linear, therefore no
transformation is necessary (Figure 10).
Each of the variables but number of beds contribute significantly to the model given the other
variables are in the model (p = 0.120). This variable was taken out of the model and re-ran, but there was not
a significant change in R2ADJ so it was left out (56%). The analysis of variance computed an F-Value of 29.04
suggesting that the length of stay is related to at least one of the regressors (p < 0.001; Figure 9).
Using the parameter estimates, the prediction equation for length of stay in hospital is:
Stay = 6.99 + 0.549 Infection + 0.652 Region_1 - 1.58 Region_4 + 0.900 MS_1
Regression Analysis: Stay versus Age, Infection, ...
* Region_4 is highly correlated with other X variables
* Region_4 has been removed from the equation.
* Med School_2 is highly correlated with other X variables
* Med School_2 has been removed from the equation.
The regression equation is
Stay = 1.18 + 0.0799 Age + 0.440 Infection + 0.0055 Culture + 0.0127 Xray
- 0.00485 Beds + 0.0152 Census - 0.00589 Nurses - 0.0122 Services
+ 1.88 Region_1 + 1.07 Region_2 + 0.722 Region_3 + 0.267 Med School_1
Predictor
Constant
Age
Infection
Culture
Xray
Beds
Census
Nurses
Services
Region_1
Region_2
Region_3
Med School_1
Coef
1.175
0.07992
0.4397
0.00555
0.012688
-0.004851
0.015182
-0.005891
-0.01218
1.8806
1.0676
0.7223
0.2666
S = 1.23065
R-Sq = 63.0%
PRESS = 198.143
SE Coef
1.638
0.02827
0.1273
0.01598
0.007147
0.003603
0.004424
0.002218
0.01377
0.4441
0.3987
0.3967
0.4411
T
0.72
2.83
3.45
0.35
1.78
-1.35
3.43
-2.66
-0.88
4.23
2.68
1.82
0.60
P
0.475
0.006
0.001
0.729
0.079
0.181
0.001
0.009
0.379
0.000
0.009
0.072
0.547
VIF
1.176
2.155
1.979
1.416
35.699
34.211
7.056
3.242
2.743
2.408
2.585
1.855
R-Sq(adj) = 58.5%
R-Sq(pred) = 51.58%
Analysis of Variance
Source
Regression
Residual Error
Total
Source
Age
Infection
Culture
Xray
Beds
Census
Nurses
Services
Region_1
Region_2
Region_3
Med School_1
DF
12
100
112
DF
1
1
1
1
1
1
1
1
1
1
1
1
SS
257.759
151.451
409.210
MS
21.480
1.515
F
14.18
P
0.000
Seq SS
14.604
116.356
3.248
8.606
31.087
39.696
14.221
0.011
18.834
5.958
4.586
0.553
Figure 1: Multiple Regression Model using all variables a possible regressors
Normal Probability Plot
(response is S tay)l
99.9
Mean
StDev
N
AD
P-Value
99
95
90
-5.91071E-15
1.216
113
1.167
<0.005
Percent
80
70
60
50
40
30
20
10
5
1
0.1
-5.0
-2.5
0.0
2.5
Standardized Residual
5.0
7.5
Figure2: Normal probability plot of first model using all variables as regressors
Versus Fits
(response is Stay)
Standardized Residual
3
2
1
0
-1
-2
7
8
9
10
Fitted Value
11
12
13
Figure 3: Plot of the residuals from first model using all variables as regressors.
Observation 42 and 100 are circled.
Selection
Stepwise
Forward
Number of Steps
6
8
Backward
2
Variables in final Model
Infection, Beds, Region 1, Region 4, Med School 1
Age, Infection, X Ray, Beds, Region 1, Region 2,
Region 4, Med School 1,
Age, Infection, Beds, Region 1, Region 2, Region 3,
Med School 1
Figure 4: SAS all possible regression methods
Summary of the analysis of these three possible models are listed below (Table 2).
R2ADJ
54.3%
58.0%
51.5%
Selection
Stepwise
Forward
Backward
F-Statistic
26.63 (p< 0.001)
19.68 0.000
17.38 0.000
PRESS
115.241
109.740
124.096
Figure 5: Summary statistics for model selections
Probability Plot of SRES1
Versus Fits
Response is Stay (using predictors from Stepwise Selection)
Mean
StDev
N
AD
P-Value
99
Percent
95
90
Response is Stay (Using Predictors from Step Wise Selection)
-0.0002378
1.005
109
0.429
0.304
80
70
60
50
40
30
20
10
5
3
2
Standardized Residual
99.9
1
0
-1
-2
1
0.1
-3
-3
-2
-1
0
1
Standardized Residual
2
3
7
8
9
10
Fitted Value
11
12
13
Figure 6: Normal probability plot and plot of residuals for stepwise selection
Probability Plot of SRES2
Versus Fits
Response is Stay (Using Foward Selection)
Response is Stay (Using Foward Selection)
Mean
StDev
N
AD
P-Value
99
Percent
95
90
0.0008939
1.006
109
0.262
0.699
80
70
60
50
40
30
20
10
5
3
2
Standardized Residual
99.9
1
0
-1
-2
1
0.1
-3
-3
-2
-1
0
SRES2
1
2
3
7
8
9
10
Fitted Value
11
12
13
Figure 7: Normal probability plot and plot of the residuals for forward selection
Probability Plot of SRES3
Versus Fits
Response is Stay (Using Backward Selection)
Mean
StDev
N
AD
P-Value
99
Percent
95
90
80
70
60
50
40
30
20
10
5
1
0.1
0.0005390
1.003
109
0.404
0.349
3
2
Standardized Residual
99.9
Response is Stay (Using Backward Selection)
1
0
-1
-2
-3
-2
-1
0
SRES3
1
2
3
7
8
9
10
Fitted Value
Figure 8: Normal probability plot and plot of the residuals for backward selection.
11
12
13
The regression equation is
Stay = 6.99 + 0.549 Infection + 0.652 Region_1 - 1.58 Region_4 + 0.900 MS_1
Predictor
Constant
Infection
Region_1
Region_4
MS_1
Coef
6.9946
0.54882
0.6523
-1.5803
0.9004
S = 0.941907
SE Coef
0.3170
0.07279
0.2243
0.2704
0.2679
T
22.06
7.54
2.91
-5.84
3.36
R-Sq = 58.0%
PRESS = 99.4877
P
0.000
0.000
0.004
0.000
0.001
VIF
1.080
1.086
1.063
1.044
R-Sq(adj) = 56.3%
R-Sq(pred) = 53.80%
Analysis of Variance
Source
Regression
Residual Error
Total
Source
Infection
Region_1
Region_4
MS_1
DF
1
1
1
1
DF
4
102
106
SS
124.859
90.493
215.353
MS
31.215
0.887
F
35.18
P
0.000
Seq SS
66.548
17.235
31.059
10.017
Figure 9: Multiple regression final model deleting observation 98 and 101
Versus Fits
Normal Probability Plot
(response is Stay)
Response is Stay
Mean
StDev
N
AD
P-Value
99
Percent
95
90
80
70
60
50
40
30
20
10
5
-0.0002849
1.003
107
0.263
0.695
3
2
Standardized Residual
99.9
1
0
-1
-2
1
0.1
-3
-3
-2
-1
0
1
Standardized Residual
2
3
7
8
9
10
Fitted Value
11
12
Figure 10: Normal Probability Plot and Plot of the Residuals of final model
13
The LOGISTIC Procedure
Model Information
Data Set
WORK.HW
Response Variable
Damage
Number of Response Levels
2
Model
binary logit
Optimization Technique
Fisher's scoring
Damage
Number of Observations Read
30
Number of Observations Used
30
Response Profile
Ordered
Value
Damage
Total
Frequency
1
1
22
2
0
8
Probability modeled is Damage=1.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion
Intercept
Only
Intercept
and
Covariates
AIC
36.795
18.930
SC
38.196
24.535
-2 Log L
34.795
10.930
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
23.8651
3
<.0001
Score
10.8631
3
0.0125
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
5.5804
3
0.1339
Wald
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept
1
-82.2972
35.8096
5.2817
0.0216
Load
1
12.0276
5.3673
5.0215
0.0250
Experience
1
0.8784
0.3837
5.2399
0.0221
Load_Exp
1
-0.1227
0.0549
5.0072
0.0252
Odds Ratio Estimates
Effect
Point Estimate
Load
95% Wald
Confidence Limits
>999.999
4.517
>999.999
Experience
2.407
1.135
5.106
Load_Exp
0.884
0.794
0.985
Association of Predicted Probabilities and
Observed Responses
Percent Concordant
97.2
Somers' D
0.943
Percent Discordant
2.8
Gamma
0.943
Percent Tied
0.0
Tau-a
0.382
Pairs
176
c
0.972
Wald Confidence Interval for Odds Ratios
Effect
Unit
Estimate
95% Confidence Limits
Load
1.0000
>999.999
4.517
>999.999
Experience
1.0000
2.407
1.135
5.106
Load_Exp
1.0000
0.884
0.794
0.985
Hosmer and Lemeshow Goodness-of-Fit
Test
Chi-Square
DF
Pr > ChiSq
1.5444
8
0.9919
Download