Lab_Activity_6

advertisement
Lab Activity 6 (07/23/2013)
Please finish this lab activity during the class time and submit in the drop box on Angel.
Note: Please include the necessary plot(s) or Minitab output that are used to answer each part of the
following question.
Open the dataset “Senic”. We have:
Y = InfctRisk, the risk of infection at a hospital
X1 = Stay, average length of stay at the hospital
X2 = Cultures = average number of bacterial cultures per day at the hospital
X3 = Age, average age of patients at hospital
X4= Beds, the number of beds in the hospital.
X5 = Census, the average daily number of patients
Part 1
a. (10pts) Plot the scatter plot of response versus each predictor and each pair of predictors, what do
you observe on these plots?
(Minitab: Graph>Matrix Plot. Input all variables and click “Matrix Options” and check “Upper
Right”)
Matrix Plot of InfctRsk, Stay, Cultures, Age, Beds, Census
10
15
200
25
50
40
50
60
0
400
800 0
400
800
7.0
4.5
InfctRsk
2.0
20
15
Stay
10
50
Cultures
25
0
60
Age
50
40
800
400
Beds
0
Census
Most of the graph shows a random pattern, in other words they might not have some linear relation
between each other except census and beds. Census and beds have a clear linear relationship as we
can see in the graph.
b.
(10pts) Fit a full model with all predictors in the model. What are the values of VIFs for predictors?
Can you explain by the scatter plot in previous part? What action do you want to do?
(Minitab: Regression>Regression, Choose Option and check “Variance Inflation factors”)
Regression Analysis: InfctRsk versus Stay, Cultures, Age, Beds, Census
The regression equation is
InfctRsk = 0.21 + 0.206 Stay + 0.0590 Cultures + 0.0174 Age + 0.00045 Beds
+ 0.00103 Census
Predictor
Constant
Stay
Cultures
Age
Beds
Census
Coef
0.205
0.20553
0.05904
0.01736
0.000448
0.001031
S = 0.992559
SE Coef
1.208
0.06609
0.01031
0.02300
0.002678
0.003494
R-Sq = 47.7%
VIF for stay is
30.323. VIF for
indicates there
get in the last
T
0.17
3.11
5.73
0.76
0.17
0.29
P
0.865
0.002
0.000
0.452
0.868
0.769
VIF
1.814
1.266
1.197
30.323
32.817
R-Sq(adj) = 45.2%
1.814. VIF for culture is 1.266. VIF for age is 1.197. VIF for beds is
census is 32.817. Both VIFs of beds and census are larger than 10. It
might be a linear relationship between them. As we assume from the graph we
part. We definitely want to delete census.
c. (15pts) Suppose you want to delete variable “Census”, and refit the model with only four predictors.
Can you construct a general linear F-test using SSR(X5 | X1, X2, X3, X4) to test if H0:  5  0
Source
Stay
Cultures
Age
Beds
Census
DF
1
1
1
1
1
Seq SS
57.305
33.397
0.136
5.043
0.086
SSR(X5 | X1, X2, X3, X4)=0.086
F= [SSR(X5 | X1, X2, X3, X4)/1]/MSE of full= 0.08731
d. (10pts) Can you find p-value of this F-test (as we did in class)? What is your conclusion?
(Minitab: Graph>Probability Distribution Plot>View Probability…..as demonstrated in class)
Distribution Plot
F, df1=1, df2=107
1.6
1.4
1.2
Density
1.0
0.8
0.6
0.4
0.2
0.0
0.7682
0.08731
0
X
Since p-value is 0.7682 which is greater than 0.05.
So we reject the null hypothesis.
e. (10pts) Go back to the output of the full model, what is the t-value for X5? What does this related
to the general linear F-test you did in part (c)? (Hint: Think about the situation of Simple Linear
Regression)
T value for X5 is 0.29^2=0.0841.
It is similar to the F value 0.08731 which we calculated in part C.
f.
(10pts) Go back to the model output without predictor “Census”, what are VIFs for parameters
now?
Predictor
Constant
Stay
Cultures
Age
Beds
g.
Coef
0.179
0.21401
0.05861
0.01648
0.0012213
SE Coef
1.199
0.05925
0.01016
0.02270
0.0005375
T
0.15
3.61
5.77
0.73
2.27
P
0.881
0.000
0.000
0.470
0.025
VIF
1.471
1.241
1.176
1.232
(Open, 5pts) Is there any other variable you want delete now? And do you think we get a perfect
model after performing the action (as discussed in class)? What is your final model?
Yes, I want to delete age. As we can see, the p- value of age is still greater than 0.05.
Then we get :
The regression equation is
InfctRsk = 0.975 + 0.228 Stay + 0.0563 Cultures + 0.00116 Beds
Predictor
Constant
Stay
Cultures
Beds
Coef
0.9749
0.22784
0.056304
0.0011598
SE Coef
0.4858
0.05598
0.009634
0.0005296
T
2.01
4.07
5.84
2.19
P
0.047
0.000
0.000
0.031
VIF
1.319
1.120
1.201
Part 2
h. (5pts) Go back the model with all five predictors, suppose we want to test if all insignificant
predictors have coefficients equal to zero simultaneously, what is the null and alternative
hypothesis? (Note: Intercept does not count.)
Ho: β4= β5=0
Ha: at least one of them is not 0 (β4, β5)
i. (15pts) How do you construct the test statistics? What is your conclusion?
F={[SSR(X1,X2,X3,X4,X5)-SSR(X1,X2,X3)]/2}/MSE(X1,X2,X3,X4,X5)
= [(95.966- 90.838)]
)/2]/ 0.985
=2.603
Analysis of Variance
Source
Regression
Residual Error
Total
DF
3
109
112
SS
90.838
110.542
201.380
MS
30.279
1.014
F
29.86
P
0.000
Distribution Plot
F, df1=2, df2=107
1.0
Density
0.8
0.6
0.4
0.2
0.0
0.07874
0
2.603
X
Since p value of F test is greater than 0.05, we fail to reject Null.
j. (5pts) What is your model now after adopting the decision in i?
So X4 and X5 are insignificant, only X1,X2 and X3 are in the model now.
k. (Open 5pts) Compare the model you get in j and g. Are they the same? If not, which one do you
think is better? (Hint: you many think in many perspectives, e.g. R2, number of variables in the
model, scatter plot, multicollinearity….Remember that regression is a quite subjective topic!) .
They are not the same. I think g is better.
Because in j) the p value of age is still larger than 0.05, it is not a perfect model. However, we
deleted age instead of beds in g), we got a perfect model. That means, we should delete census and
age to get a perfect model even if the VIF of beds and census are larger than 10.
Download