Probs 1), 2), and 3)
Variable x1 x2 x3 x4 x5 x6 y x1 x2
Correlation x3 x4 x5 x6 y
1.0000 -0.1900 -0.0627 -0.3497 0.3863 -0.4302 -0.4336
-0.1900 1.0000 0.9553 0.2379 -0.0324 0.1318 0.6448
-0.0627 0.9553 1.0000 0.2126 -0.0261 0.0421 0.4938
-0.3497 0.2379 0.2126 1.0000 -0.0130 0.1641 0.0947
0.3863 -0.0324 -0.0261 -0.0130 1.0000 0.4961 0.0543
-0.4302 0.1318 0.0421 0.1641 0.4961 1.0000 0.3696
-0.4336 0.6448 0.4938 0.0947 0.0543 0.3696 1.0000
•
Among the explanatory variables, the variables x2 and x3 are highly correlated with a positive correlation of 0.96
. This shows the strength of the linear relationship displayed in the plot of x2 vs. x3 below. These two variables may cause multicollinearity problems when fitting the full model.
Model fitted with all explanatory variables 2
The REG Procedure
Model: MODEL1
Dependent Variable: y
Number of Observations Read 41
Number of Observations Used 41
Analysis of Variance
Source
Model
Error
DF
6
Corrected Total 40
Sum of
Squares
14755 2459.10601 11.48 <.0001
34 7283.26641 214.21372
22038
Mean
Square F Value Pr > F
Root MSE 14.63604 R-Square 0.6695
Dependent Mean 30.04878 Adj R-Sq 0.6112
Coeff Var 48.70761 x2 x3 x4 x5 x6
Variable DF
Parameter
Estimate
Parameter Estimates
Standard
Error t Value Pr > |t|
Variance
Inflation
95% Confidence
Limits
Intercept 1 111.72848 47.31810 2.36 0.0241 x1 1 -1.26794 0.62118
0 15.56653 207.89043
-2.04 0.0491 3.76400 -2.53033 -0.00555
1
1
1
1
1
0.06492 0.01575
-0.03928 0.01513
-3.18137 1.81502
0.51236 0.36276
-0.05205 0.16201
4.12 0.0002 14.70365 0.03291
-2.60 0.0138 14.34083 -0.07003
-1.75 0.0887 1.25552 -6.86993
1.41 0.1669 3.40492 -0.22485
-0.32 0.7500 3.44365 -0.38130
0.09692
-0.00852
0.50720
1.24957
0.27720
•
One would expext the estimates of the coefficients for the two variables x3 (Population Size) and x6
(Number of Days with Precipitation) to be positive because the heating needs of a community (and therefore coal burning required for heating) is expected to increase for each unit increase in these two variables.
•
The standard errors of the estimates of the coefficients for the variables x4, x5 and x6 are very high
(compared to the magnitude of the estimates). Thus these estimates are very poorly estimated as shown by the confidence intervals which are very wide and include zero. Equivalently, he t-tests are also not significant.
•
The VIF’s for all variables except x4 are large indicating that multicollinearity exists among the explanatory variables causing the above problems. x2 and x3 have the largest VIF’s indicating that they are the two variables causing most of the multicollinearity.
Model fitted with all explanatory variables
The REG Procedure
Model: MODEL2
Dependent Variable: y
DF
Analysis of Variance
Sum of
Squares
Mean
Square F Value Pr > F Source
Model
Error
4 13285 3321.25137 13.66 <.0001
36 8752.89698 243.13603
Corrected Total 40 22038
Root MSE 15.59282 R-Square 0.6028
Dependent Mean 30.04878 Adj R-Sq 0.5587
Coeff Var 51.89169
Variable DF
Parameter
Estimate
Parameter Estimates
Standard
Error t Value Pr > |t|
Variance
Inflation
95% Confidence
Limits
Intercept 1 123.11833 31.29070 3.93 0.0004 x1 1 -1.61144 0.40137
0 59.65785 186.57882
-4.01 0.0003 1.38455 -2.42546 -0.79741 x2 x4
1 0.02548 0.00454 5.62 <.0001 1.07524 0.01627 0.03468
1 -3.63024 1.89234 -1.92 0.0630 1.20243 -7.46809 0.20760 x5 1 0.52423 0.22941 2.29 0.0283 1.19975 0.05898 0.98949
• It is obvious that the fit of this model is better than that of the full model. While the coefficient of determination (R-squared) has decreased from 67% to 60%, VIF’s have all decreased and most of the standard errors are reasonably small. The variable x4 is still not significant in this model.
SAS Program: filename citydat 'C:\Documents and Settings\...\air_pollution.txt'; data auto; infile citydat; input city $2. y x1-x6;; run; ods listing close; ods rtf file="C:\Documents and Settings\...\as7f10n123_out.rtf" style=statistical; proc sgscatter data=auto;
title "Scatterplot Matrix for Cement Data";
matrix x1-x6; run; proc reg corr;
model y = x1-x6/clb vif;
model y=x1 x2 x4 x5/clb vif ;
title 'Model fitted with all explanatory variables'; run; ods rtf close; ods listing;
4
Prob 4)
The descriptions below use the SAS output(attached) from the following SAS program: filename citydat 'C:\Documents and Settings\...\air_pollution.txt'; data auto; infile citydat; input city $2. y x1-x6;; run; ods rtf file="C:\Documents and Settings\...\as7f10n4.rtf"; ods graphics on; proc reg;
model y=x1 x2 x4 x5/r influence ;
title 'Model fitted with selected explanatory variables'; run; ods graphics off; ods rtf close;
The output diagnostics RStudent , Hat Diag and Cook’s D and the Fit Diagnostics panel and the plots of Residuals vs. each explanatory variable are used in the discussion below:
•
The 5% Bonferroni critical value RStudent for n=41 k=4 is 3.52. Only one case among the RStudent values exceeds this: City#31 is a y-outlier. Looking at Cook’s D for #31, we see that it is large (well over the cut-off of .1).
This is also clearly identified in the Cook’s D vs. Observation plot in the
Fit Diagnostics panel (in addition to City#1). Thus, City#31 is a case that is a y-outlier that is also influential. We might consider setting it aside.
•
Cut-off for Hat Diag is 2x5/41~=.25. City#1 and City#11 exceed this substantially. These two points are clearly identified in the RStudent vs.
Leverage plot in the Fit Diagnostics panel. Note the two points which are to the right of the vertical line drawn at x=.25.
•
City#1 is influential (high Cook’s D) but is clearly not a y-outlier. Its
Cook’s D statistic is larger because it is an x-outlier (i.e. because hii is large).
•
City#11 is an x-outlier but is not a y-outlier nor is it an influential case.
•
Looking at the plots of Residuals vs. each explanatory variable and normal probability plot of the residuals (see the Fit Diagnostics panel) we see that these three cases affect the shapes and patterns seen in the plots.
Since the y-outlier seem to affect the fit of the model (and the x-outliers do not), it is decided that this City may not fit this model very well.
•
You were not required to this part . However, I removed City #31, re-fitted the same model and repeated the same analysis. The diagnostics and all the plots showed vast improvement from those above. This clearly shows how just one real outlier can change a model fit drastically. Particularkly, the normal probability plot supported the assumption that the errors were normally distributed.
Model fitted with selected explanatory variables
The REG Procedure
Model: MODEL1
Dependent Variable: y
Output Statistics
Obs
Dependent
Variable
Predicted
Value
Std Error
Mean Predict Residual
Std Error
Residual
Student
Residual -2-1 0 1 2
1 10.0000 -2.8249 10.9486 12.8249 11.102
Cook's
D RStudent
1.155 | |** | 0.260 1.1607
Hat Diag
H
Cov
Ratio DFFITS
0.4930 1.8801 1.1446
2 13.0000 22.8069 4.1977 -9.8069 15.017 -0.653 | *| | 0.007 -0.6478 0.0725 1.1694 -0.1811
3 12.0000 22.5379 4.5865 -10.5379 14.903 -0.707 | *| | 0.009 -0.7021 0.0865 1.1752 -0.2161
4 17.0000 25.1674 5.6259 -8.1674 14.543 -0.562 | *| | 0.009 -0.5562 0.1302 1.2666 -0.2152
5 56.0000 44.5566 4.6269 11.4434 14.891 0.768 | |* | 0.011 0.7640 0.0880 1.1622 0.2374
36.0000 26.5671 3.4135 9.4329 15.215 0.620 | |* | 0.004 0.6146 0.0479 1.1460 0.1379 6
7 29.0000 28.4657 2.5071 0.5343 15.390 0.0347 | | | 0.000 0.0342 0.0259 1.1816 0.0056
8 14.0000 12.9697 5.4771 1.0303 14.599 0.0706 | | | 0.000 0.0696 0.1234 1.3124 0.0261
7.6797 4.5946 13.571 0.339 | | | 0.007 0.3344 0.2426 1.4959 0.1892 9
10
10.0000 5.4054
24.0000 25.6963
11 110.0000 107.0704
12 28.0000 33.1324
13 17.0000 22.3213
14 8.0000 5.0225
3.6143 -1.6963
12.9164 2.9296
2.9569 -5.1324
4.8742 -5.3213
7.4697 2.9775
15.168
8.735
15.310
14.811
13.687
-0.112 | | | 0.000
0.335 | | | 0.049
-0.335 | | | 0.001
-0.359 | | | 0.003
0.218 | | | 0.003
-0.1103
0.3312
-0.3311
-0.3549
0.2146
0.0537 1.2145 -0.0263
0.6862 3.6115 0.4897
0.0360 1.1757 -0.0639
0.0977 1.2532 -0.1168
0.2295 1.4843 0.1171
30.0000 33.4046 3.7168 -3.4046 15.143 -0.225 | | | 0.001 -0.2218 0.0568 1.2121 -0.0544 15
16 9.0000 17.5210 5.6931 -8.5210 14.516 -0.587 | *| | 0.011 -0.5816 0.1333 1.2660 -0.2281
47.0000 37.2174 2.7730 9.7826 15.344 0.638 | |* | 0.003 0.6322 0.0316 1.1232 0.1143 17
18
19
20
35.0000 49.3786
29.0000 45.9464
14.0000 28.0955
3.9414 -14.3786
4.9404 -16.9464
2.6949 -14.0955
15.086
14.789
15.358
-0.953 | *| | 0.012
-1.146 | **| | 0.029
-0.918 | *| | 0.005
-0.9518
-1.1510
-0.9157
0.0639 1.0823 -0.2487
0.1004 1.0628 -0.3845
0.0299 1.0542 -0.1607
21 56.0000 37.1101 2.8250 18.8899 15.335 1.232 | |** | 0.010 1.2410 0.0328 0.9597 0.2286
2
Model fitted with selected explanatory variables
The REG Procedure
Model: MODEL1
Dependent Variable: y
Output Statistics
Obs
Dependent
Variable
Predicted
Value
Std Error
Mean Predict Residual
Std Error
Residual
Student
Residual -2-1 0 1 2
22 14.0000 20.9922 4.2266 -6.9922 15.009
Cook's
D RStudent
-0.466 | | | 0.003 -0.4607
Hat Diag
H
Cov
Ratio DFFITS
0.0735 1.2055 -0.1297
23 11.0000 4.5248 7.3297 6.4752 13.763 0.470 | | | 0.013 0.4653 0.2210 1.4329 0.2478
24 46.0000 33.0772 4.8286 12.9228 14.826 0.872 | |* | 0.016 0.8686 0.0959 1.1446 0.2829
25 11.0000 31.0957 6.0650 -20.0957 14.365 -1.399 | **| | 0.070 -1.4185 0.1513 1.0257 -0.5989
26 23.0000 42.5619 5.4272 -19.5619 14.618 -1.338 | **| | 0.049 -1.3536 0.1211 1.0150 -0.5026
65.0000 47.4573 4.0943 17.5427 15.046 1.166 | |** | 0.020 1.1720 0.0689 1.0200 0.3189 27
28 26.0000 35.0877 3.7453 -9.0877 15.136 -0.600 | *| | 0.004 -0.5950 0.0577 1.1618 -0.1472
29 69.0000 64.3211 6.0510 4.6789 14.371 0.326 | | | 0.004 0.3215 0.1506 1.3355 0.1354
61.0000 35.6055 3.3176 25.3945 15.236 1.667 | |*** | 0.026 1.7108 0.0453 0.8071 0.3725 30
31
32
94.0000 35.2151
10.0000 24.7808
4.1579 58.7849
3.7195 -14.7808
15.028
15.143
3.912 | |******| 0.234
-0.976 | *| | 0.011
5.0865
-0.9754
0.0711 0.0779 1.4073
0.0569 1.0675 -0.2396
18.0000 29.8407 4.1121 -11.8407 15.041 -0.787 | *| | 0.009 -0.7830 0.0695 1.1344 -0.2141 33
34
35
9.0000 12.0425
10.0000 16.5146
6.3613 -3.0425
6.7087 -6.5146
14.236
14.076
-0.214 | | | 0.002
-0.463 | | | 0.010
-0.2109
-0.4577
0.1664 1.3724 -0.0942
0.1851 1.3712 -0.2182
28.0000 20.7948 5.4927 7.2052 14.593 0.494 | | | 0.007 0.4885 0.1241 1.2704 0.1839 36
37 31.0000 14.9480 4.3203 16.0520 14.982 1.071 | |** | 0.019 1.0737 0.0768 1.0605 0.3096
26.0000 29.7333 4.3274 -3.7333 14.980 -0.249 | | | 0.001 -0.2459 0.0770 1.2366 -0.0710 38
39
40
41
29.0000 36.6400
31.0000 32.8247
16.0000 36.3740
3.2840 -7.6400
6.3455 -1.8247
5.3115 -20.3740
15.243
14.243
14.660
-0.501 | *| | 0.002
-0.128 | | | 0.001
-1.390 | **| | 0.051
-0.4959
-0.1263
-1.4086
0.0444 1.1632 -0.1068
0.1656 1.3766 -0.0563
0.1160 0.9885 -0.5103
3
Model fitted with selected explanatory variables
The REG Procedure
Model: MODEL1
7
Model fitted with selected explanatory variables
The REG Procedure
Model: MODEL1
8
Prob #5)
The following SAS Program was used to obtain the analysis discussed below. Not all output is shown from the variable subset selection procedures. filename citydat 'C:\Documents and Settings\...\air_pollution.txt'; data pollution; infile citydat; input city $2. y x1-x6;; run ; data new; set pollution; if _N_= 31 then delete; run ; ods listing close; ods rtf file="C:\Documents and Settings\...\as7f10n5.rtf"; proc reg data=new;
model y = x1-x6/selection=b sls= .05
;
model y = x1-x6/selection=stepwise sle= .1
sls= .05
; run ; ods graphics on; proc reg data=new plots(only)=(criteria cp(label));
model y = x1-x6/selection=rsquare start= 2 stop= 4 best= 4 cp sse mse;
title 'Models fitted with all explanatory variables and City #31 deleted'; run ; ods graphics off; ods rtf close; ods listing;
(i) All Possible regressions containing no less than 2 and no more than 4 explanatory variables
Model
Index
1
Number in
Model R-Square C(p) MSE SSE Variables in Model
2 0.6433 12.9005 172.02883 6365.06656 x2 x3
2
3
4
2
2
2
0.6193 16.0634 183.63036 6794.32327 x1 x2
0.6181 16.2248 184.22228 6816.22451 x2 x6
0.5578 24.1452 213.27393 7891.13546 x2 x4
5
6
7
8
3 0.6836 9.6026 156.83520 5646.06719 x2 x3 x6
3 0.6767 10.5169 160.28207 5770.15450 x1 x2 x4
3 0.6730 11.0018 162.11006 5835.96228 x1 x2 x3
3 0.6641 12.1666 166.50116 5994.04167 x2 x3 x4
9
10
11
12
4 0.7181 7.0704 143.74224 5030.97835 x1 x2 x3 x4
4 0.7180 7.0831 143.79175 5032.71125 x1 x2 x4 x5
4 0.7132 7.7078 146.21378 5117.48229 x2 x3 x4 x6
4 0.7101 8.1272 147.84015 5174.40522 x1 x2 x4 x6
Best two variable model selected: (y, x2 x3) based on largest R-Square, smallest
MSE and Cp.
Best three variable model selected: (y, x1 x2 x4) based on large R-Square, small
MSE and Cp. This is the only model with comparable values for these statistics but does not include the highly correlated variables x2 and x3 together in the model. However Cp is smaller for the model (y, x2 x3 x6)
Best four variable model selected: (y, x1 x2 x4 x5) based on large R-Square, small MSE and Cp. The four models selected above are all pretty close in the values for the above statistics, but this model avoids including the highly correlated variables x2 and x3 together.
The best overall model is (y, x1 x2 x4 x5) based on large R-Square, small MSE and
Cp. The estimates for this model shows that they are all well-estimated (all tstatistics are significant).
Parameter Estimates
Variable DF
Parameter
Estimate
Standard
Error t Value Pr > |t|
Intercept 1 118.71324 24.07902 4.93 <.0001 x1 1 -1.38353 0.31190 -4.44 <.0001 x2 x4 x5
1 0.02694 0.00350 7.69 <.0001
1 -4.27312 1.46074 -2.93 0.0060
1 0.40315 0.17802 2.26 0.0298
(ii) The model selected by the backward elimination method with sls=.05 is
(y, x1 x2 x3 x4) with
R-Square = 0.7181 and C(p) = 7.0704
DF
Analysis of Variance
Sum of
Squares
Mean
Square
F
Value Pr > F Source
Model
Error
4
35 5030.97835 143.74224
Corrected Total 39
12815 3203.73041 22.29 <.0001
17846
Variable
Parameter
Estimate
Standard
Error Type II SS
F
Value Pr > F
Intercept 97.66579 24.88623 2213.86455 15.40 0.0004 x1 x2
-0.80241 0.31000 963.06332 6.70 0.0139
0.05574 0.01307 2612.71906 18.18 0.0001 x3 x4
-0.02872 0.01266 739.17616 5.14 0.0296
-3.45721 1.46092 804.98394 5.60 0.0236
(iii) The model selected by the stepwise method with sle=0.1 and sls==.05 is
(y, x2 x3 x6) with
R-Square = 0.6836 and C(p) = 9.6026
Analysis of Variance
Source
Model
DF
3
Sum of
Squares
Mean
Square
F
Value Pr > F
12200 4066.61094 25.93 <.0001
Error 36 5646.06719 156.83520
Corrected Total 39 17846
Variable
Intercept x2 x3 x6
Parameter
Estimate
Standard
Error Type II SS
F
Value Pr > F
2.59500 9.82495 10.94101 0.07 0.7932
0.06008 0.01295 3377.10556 21.53 <.0001
-0.03437 0.01258 1170.15733 7.46 0.0097
0.16841 0.07865 718.99937 4.58 0.0391
Models fitted with all explanatory variables and City #31 deleted
The REG Procedure
Model: MODEL1
Dependent Variable: y
R-Square Selection Method
14
Models fitted with all explanatory variables and City #31 deleted
The REG Procedure
Model: MODEL1
Dependent Variable: y
R-Square Selection Method
15