1

advertisement

Probs 1), 2), and 3)

Variable x1 x2 x3 x4 x5 x6 y x1 x2

Correlation x3 x4 x5 x6 y

1.0000 -0.1900 -0.0627 -0.3497 0.3863 -0.4302 -0.4336

-0.1900 1.0000 0.9553 0.2379 -0.0324 0.1318 0.6448

-0.0627 0.9553 1.0000 0.2126 -0.0261 0.0421 0.4938

-0.3497 0.2379 0.2126 1.0000 -0.0130 0.1641 0.0947

0.3863 -0.0324 -0.0261 -0.0130 1.0000 0.4961 0.0543

-0.4302 0.1318 0.0421 0.1641 0.4961 1.0000 0.3696

-0.4336 0.6448 0.4938 0.0947 0.0543 0.3696 1.0000

Among the explanatory variables, the variables x2 and x3 are highly correlated with a positive correlation of 0.96

. This shows the strength of the linear relationship displayed in the plot of x2 vs. x3 below. These two variables may cause multicollinearity problems when fitting the full model.

Model fitted with all explanatory variables 2

The REG Procedure

Model: MODEL1

Dependent Variable: y

Number of Observations Read 41

Number of Observations Used 41

Analysis of Variance

Source

Model

Error

DF

6

Corrected Total 40

Sum of

Squares

14755 2459.10601 11.48 <.0001

34 7283.26641 214.21372

22038

Mean

Square F Value Pr > F

Root MSE 14.63604 R-Square 0.6695

Dependent Mean 30.04878 Adj R-Sq 0.6112

Coeff Var 48.70761 x2 x3 x4 x5 x6

Variable DF

Parameter

Estimate

Parameter Estimates

Standard

Error t Value Pr > |t|

Variance

Inflation

95% Confidence

Limits

Intercept 1 111.72848 47.31810 2.36 0.0241 x1 1 -1.26794 0.62118

0 15.56653 207.89043

-2.04 0.0491 3.76400 -2.53033 -0.00555

1

1

1

1

1

0.06492 0.01575

-0.03928 0.01513

-3.18137 1.81502

0.51236 0.36276

-0.05205 0.16201

4.12 0.0002 14.70365 0.03291

-2.60 0.0138 14.34083 -0.07003

-1.75 0.0887 1.25552 -6.86993

1.41 0.1669 3.40492 -0.22485

-0.32 0.7500 3.44365 -0.38130

0.09692

-0.00852

0.50720

1.24957

0.27720

One would expext the estimates of the coefficients for the two variables x3 (Population Size) and x6

(Number of Days with Precipitation) to be positive because the heating needs of a community (and therefore coal burning required for heating) is expected to increase for each unit increase in these two variables.

The standard errors of the estimates of the coefficients for the variables x4, x5 and x6 are very high

(compared to the magnitude of the estimates). Thus these estimates are very poorly estimated as shown by the confidence intervals which are very wide and include zero. Equivalently, he t-tests are also not significant.

The VIF’s for all variables except x4 are large indicating that multicollinearity exists among the explanatory variables causing the above problems. x2 and x3 have the largest VIF’s indicating that they are the two variables causing most of the multicollinearity.

Model fitted with all explanatory variables

The REG Procedure

Model: MODEL2

Dependent Variable: y

DF

Analysis of Variance

Sum of

Squares

Mean

Square F Value Pr > F Source

Model

Error

4 13285 3321.25137 13.66 <.0001

36 8752.89698 243.13603

Corrected Total 40 22038

Root MSE 15.59282 R-Square 0.6028

Dependent Mean 30.04878 Adj R-Sq 0.5587

Coeff Var 51.89169

Variable DF

Parameter

Estimate

Parameter Estimates

Standard

Error t Value Pr > |t|

Variance

Inflation

95% Confidence

Limits

Intercept 1 123.11833 31.29070 3.93 0.0004 x1 1 -1.61144 0.40137

0 59.65785 186.57882

-4.01 0.0003 1.38455 -2.42546 -0.79741 x2 x4

1 0.02548 0.00454 5.62 <.0001 1.07524 0.01627 0.03468

1 -3.63024 1.89234 -1.92 0.0630 1.20243 -7.46809 0.20760 x5 1 0.52423 0.22941 2.29 0.0283 1.19975 0.05898 0.98949

• It is obvious that the fit of this model is better than that of the full model. While the coefficient of determination (R-squared) has decreased from 67% to 60%, VIF’s have all decreased and most of the standard errors are reasonably small. The variable x4 is still not significant in this model.

SAS Program: filename citydat 'C:\Documents and Settings\...\air_pollution.txt'; data auto; infile citydat; input city $2. y x1-x6;; run; ods listing close; ods rtf file="C:\Documents and Settings\...\as7f10n123_out.rtf" style=statistical; proc sgscatter data=auto;

title "Scatterplot Matrix for Cement Data";

matrix x1-x6; run; proc reg corr;

model y = x1-x6/clb vif;

model y=x1 x2 x4 x5/clb vif ;

title 'Model fitted with all explanatory variables'; run; ods rtf close; ods listing;

4

Prob 4)

The descriptions below use the SAS output(attached) from the following SAS program: filename citydat 'C:\Documents and Settings\...\air_pollution.txt'; data auto; infile citydat; input city $2. y x1-x6;; run; ods rtf file="C:\Documents and Settings\...\as7f10n4.rtf"; ods graphics on; proc reg;

model y=x1 x2 x4 x5/r influence ;

title 'Model fitted with selected explanatory variables'; run; ods graphics off; ods rtf close;

The output diagnostics RStudent , Hat Diag and Cook’s D and the Fit Diagnostics panel and the plots of Residuals vs. each explanatory variable are used in the discussion below:

The 5% Bonferroni critical value RStudent for n=41 k=4 is 3.52. Only one case among the RStudent values exceeds this: City#31 is a y-outlier. Looking at Cook’s D for #31, we see that it is large (well over the cut-off of .1).

This is also clearly identified in the Cook’s D vs. Observation plot in the

Fit Diagnostics panel (in addition to City#1). Thus, City#31 is a case that is a y-outlier that is also influential. We might consider setting it aside.

Cut-off for Hat Diag is 2x5/41~=.25. City#1 and City#11 exceed this substantially. These two points are clearly identified in the RStudent vs.

Leverage plot in the Fit Diagnostics panel. Note the two points which are to the right of the vertical line drawn at x=.25.

City#1 is influential (high Cook’s D) but is clearly not a y-outlier. Its

Cook’s D statistic is larger because it is an x-outlier (i.e. because hii is large).

City#11 is an x-outlier but is not a y-outlier nor is it an influential case.

Looking at the plots of Residuals vs. each explanatory variable and normal probability plot of the residuals (see the Fit Diagnostics panel) we see that these three cases affect the shapes and patterns seen in the plots.

Since the y-outlier seem to affect the fit of the model (and the x-outliers do not), it is decided that this City may not fit this model very well.

You were not required to this part . However, I removed City #31, re-fitted the same model and repeated the same analysis. The diagnostics and all the plots showed vast improvement from those above. This clearly shows how just one real outlier can change a model fit drastically. Particularkly, the normal probability plot supported the assumption that the errors were normally distributed.

Model fitted with selected explanatory variables

The REG Procedure

Model: MODEL1

Dependent Variable: y

Output Statistics

Obs

Dependent

Variable

Predicted

Value

Std Error

Mean Predict Residual

Std Error

Residual

Student

Residual -2-1 0 1 2

1 10.0000 -2.8249 10.9486 12.8249 11.102

Cook's

D RStudent

1.155 | |** | 0.260 1.1607

Hat Diag

H

Cov

Ratio DFFITS

0.4930 1.8801 1.1446

2 13.0000 22.8069 4.1977 -9.8069 15.017 -0.653 | *| | 0.007 -0.6478 0.0725 1.1694 -0.1811

3 12.0000 22.5379 4.5865 -10.5379 14.903 -0.707 | *| | 0.009 -0.7021 0.0865 1.1752 -0.2161

4 17.0000 25.1674 5.6259 -8.1674 14.543 -0.562 | *| | 0.009 -0.5562 0.1302 1.2666 -0.2152

5 56.0000 44.5566 4.6269 11.4434 14.891 0.768 | |* | 0.011 0.7640 0.0880 1.1622 0.2374

36.0000 26.5671 3.4135 9.4329 15.215 0.620 | |* | 0.004 0.6146 0.0479 1.1460 0.1379 6

7 29.0000 28.4657 2.5071 0.5343 15.390 0.0347 | | | 0.000 0.0342 0.0259 1.1816 0.0056

8 14.0000 12.9697 5.4771 1.0303 14.599 0.0706 | | | 0.000 0.0696 0.1234 1.3124 0.0261

7.6797 4.5946 13.571 0.339 | | | 0.007 0.3344 0.2426 1.4959 0.1892 9

10

10.0000 5.4054

24.0000 25.6963

11 110.0000 107.0704

12 28.0000 33.1324

13 17.0000 22.3213

14 8.0000 5.0225

3.6143 -1.6963

12.9164 2.9296

2.9569 -5.1324

4.8742 -5.3213

7.4697 2.9775

15.168

8.735

15.310

14.811

13.687

-0.112 | | | 0.000

0.335 | | | 0.049

-0.335 | | | 0.001

-0.359 | | | 0.003

0.218 | | | 0.003

-0.1103

0.3312

-0.3311

-0.3549

0.2146

0.0537 1.2145 -0.0263

0.6862 3.6115 0.4897

0.0360 1.1757 -0.0639

0.0977 1.2532 -0.1168

0.2295 1.4843 0.1171

30.0000 33.4046 3.7168 -3.4046 15.143 -0.225 | | | 0.001 -0.2218 0.0568 1.2121 -0.0544 15

16 9.0000 17.5210 5.6931 -8.5210 14.516 -0.587 | *| | 0.011 -0.5816 0.1333 1.2660 -0.2281

47.0000 37.2174 2.7730 9.7826 15.344 0.638 | |* | 0.003 0.6322 0.0316 1.1232 0.1143 17

18

19

20

35.0000 49.3786

29.0000 45.9464

14.0000 28.0955

3.9414 -14.3786

4.9404 -16.9464

2.6949 -14.0955

15.086

14.789

15.358

-0.953 | *| | 0.012

-1.146 | **| | 0.029

-0.918 | *| | 0.005

-0.9518

-1.1510

-0.9157

0.0639 1.0823 -0.2487

0.1004 1.0628 -0.3845

0.0299 1.0542 -0.1607

21 56.0000 37.1101 2.8250 18.8899 15.335 1.232 | |** | 0.010 1.2410 0.0328 0.9597 0.2286

2

Model fitted with selected explanatory variables

The REG Procedure

Model: MODEL1

Dependent Variable: y

Output Statistics

Obs

Dependent

Variable

Predicted

Value

Std Error

Mean Predict Residual

Std Error

Residual

Student

Residual -2-1 0 1 2

22 14.0000 20.9922 4.2266 -6.9922 15.009

Cook's

D RStudent

-0.466 | | | 0.003 -0.4607

Hat Diag

H

Cov

Ratio DFFITS

0.0735 1.2055 -0.1297

23 11.0000 4.5248 7.3297 6.4752 13.763 0.470 | | | 0.013 0.4653 0.2210 1.4329 0.2478

24 46.0000 33.0772 4.8286 12.9228 14.826 0.872 | |* | 0.016 0.8686 0.0959 1.1446 0.2829

25 11.0000 31.0957 6.0650 -20.0957 14.365 -1.399 | **| | 0.070 -1.4185 0.1513 1.0257 -0.5989

26 23.0000 42.5619 5.4272 -19.5619 14.618 -1.338 | **| | 0.049 -1.3536 0.1211 1.0150 -0.5026

65.0000 47.4573 4.0943 17.5427 15.046 1.166 | |** | 0.020 1.1720 0.0689 1.0200 0.3189 27

28 26.0000 35.0877 3.7453 -9.0877 15.136 -0.600 | *| | 0.004 -0.5950 0.0577 1.1618 -0.1472

29 69.0000 64.3211 6.0510 4.6789 14.371 0.326 | | | 0.004 0.3215 0.1506 1.3355 0.1354

61.0000 35.6055 3.3176 25.3945 15.236 1.667 | |*** | 0.026 1.7108 0.0453 0.8071 0.3725 30

31

32

94.0000 35.2151

10.0000 24.7808

4.1579 58.7849

3.7195 -14.7808

15.028

15.143

3.912 | |******| 0.234

-0.976 | *| | 0.011

5.0865

-0.9754

0.0711 0.0779 1.4073

0.0569 1.0675 -0.2396

18.0000 29.8407 4.1121 -11.8407 15.041 -0.787 | *| | 0.009 -0.7830 0.0695 1.1344 -0.2141 33

34

35

9.0000 12.0425

10.0000 16.5146

6.3613 -3.0425

6.7087 -6.5146

14.236

14.076

-0.214 | | | 0.002

-0.463 | | | 0.010

-0.2109

-0.4577

0.1664 1.3724 -0.0942

0.1851 1.3712 -0.2182

28.0000 20.7948 5.4927 7.2052 14.593 0.494 | | | 0.007 0.4885 0.1241 1.2704 0.1839 36

37 31.0000 14.9480 4.3203 16.0520 14.982 1.071 | |** | 0.019 1.0737 0.0768 1.0605 0.3096

26.0000 29.7333 4.3274 -3.7333 14.980 -0.249 | | | 0.001 -0.2459 0.0770 1.2366 -0.0710 38

39

40

41

29.0000 36.6400

31.0000 32.8247

16.0000 36.3740

3.2840 -7.6400

6.3455 -1.8247

5.3115 -20.3740

15.243

14.243

14.660

-0.501 | *| | 0.002

-0.128 | | | 0.001

-1.390 | **| | 0.051

-0.4959

-0.1263

-1.4086

0.0444 1.1632 -0.1068

0.1656 1.3766 -0.0563

0.1160 0.9885 -0.5103

3

Model fitted with selected explanatory variables

The REG Procedure

Model: MODEL1

7

Model fitted with selected explanatory variables

The REG Procedure

Model: MODEL1

8

Prob #5)

The following SAS Program was used to obtain the analysis discussed below. Not all output is shown from the variable subset selection procedures. filename citydat 'C:\Documents and Settings\...\air_pollution.txt'; data pollution; infile citydat; input city $2. y x1-x6;; run ; data new; set pollution; if _N_= 31 then delete; run ; ods listing close; ods rtf file="C:\Documents and Settings\...\as7f10n5.rtf"; proc reg data=new;

model y = x1-x6/selection=b sls= .05

;

model y = x1-x6/selection=stepwise sle= .1

sls= .05

; run ; ods graphics on; proc reg data=new plots(only)=(criteria cp(label));

model y = x1-x6/selection=rsquare start= 2 stop= 4 best= 4 cp sse mse;

title 'Models fitted with all explanatory variables and City #31 deleted'; run ; ods graphics off; ods rtf close; ods listing;

(i) All Possible regressions containing no less than 2 and no more than 4 explanatory variables

Model

Index

1

Number in

Model R-Square C(p) MSE SSE Variables in Model

2 0.6433 12.9005 172.02883 6365.06656 x2 x3

2

3

4

2

2

2

0.6193 16.0634 183.63036 6794.32327 x1 x2

0.6181 16.2248 184.22228 6816.22451 x2 x6

0.5578 24.1452 213.27393 7891.13546 x2 x4

5

6

7

8

3 0.6836 9.6026 156.83520 5646.06719 x2 x3 x6

3 0.6767 10.5169 160.28207 5770.15450 x1 x2 x4

3 0.6730 11.0018 162.11006 5835.96228 x1 x2 x3

3 0.6641 12.1666 166.50116 5994.04167 x2 x3 x4

9

10

11

12

4 0.7181 7.0704 143.74224 5030.97835 x1 x2 x3 x4

4 0.7180 7.0831 143.79175 5032.71125 x1 x2 x4 x5

4 0.7132 7.7078 146.21378 5117.48229 x2 x3 x4 x6

4 0.7101 8.1272 147.84015 5174.40522 x1 x2 x4 x6

Best two variable model selected: (y, x2 x3) based on largest R-Square, smallest

MSE and Cp.

Best three variable model selected: (y, x1 x2 x4) based on large R-Square, small

MSE and Cp. This is the only model with comparable values for these statistics but does not include the highly correlated variables x2 and x3 together in the model. However Cp is smaller for the model (y, x2 x3 x6)

Best four variable model selected: (y, x1 x2 x4 x5) based on large R-Square, small MSE and Cp. The four models selected above are all pretty close in the values for the above statistics, but this model avoids including the highly correlated variables x2 and x3 together.

The best overall model is (y, x1 x2 x4 x5) based on large R-Square, small MSE and

Cp. The estimates for this model shows that they are all well-estimated (all tstatistics are significant).

Parameter Estimates

Variable DF

Parameter

Estimate

Standard

Error t Value Pr > |t|

Intercept 1 118.71324 24.07902 4.93 <.0001 x1 1 -1.38353 0.31190 -4.44 <.0001 x2 x4 x5

1 0.02694 0.00350 7.69 <.0001

1 -4.27312 1.46074 -2.93 0.0060

1 0.40315 0.17802 2.26 0.0298

(ii) The model selected by the backward elimination method with sls=.05 is

(y, x1 x2 x3 x4) with

R-Square = 0.7181 and C(p) = 7.0704

DF

Analysis of Variance

Sum of

Squares

Mean

Square

F

Value Pr > F Source

Model

Error

4

35 5030.97835 143.74224

Corrected Total 39

12815 3203.73041 22.29 <.0001

17846

Variable

Parameter

Estimate

Standard

Error Type II SS

F

Value Pr > F

Intercept 97.66579 24.88623 2213.86455 15.40 0.0004 x1 x2

-0.80241 0.31000 963.06332 6.70 0.0139

0.05574 0.01307 2612.71906 18.18 0.0001 x3 x4

-0.02872 0.01266 739.17616 5.14 0.0296

-3.45721 1.46092 804.98394 5.60 0.0236

(iii) The model selected by the stepwise method with sle=0.1 and sls==.05 is

(y, x2 x3 x6) with

R-Square = 0.6836 and C(p) = 9.6026

Analysis of Variance

Source

Model

DF

3

Sum of

Squares

Mean

Square

F

Value Pr > F

12200 4066.61094 25.93 <.0001

Error 36 5646.06719 156.83520

Corrected Total 39 17846

Variable

Intercept x2 x3 x6

Parameter

Estimate

Standard

Error Type II SS

F

Value Pr > F

2.59500 9.82495 10.94101 0.07 0.7932

0.06008 0.01295 3377.10556 21.53 <.0001

-0.03437 0.01258 1170.15733 7.46 0.0097

0.16841 0.07865 718.99937 4.58 0.0391

Models fitted with all explanatory variables and City #31 deleted

The REG Procedure

Model: MODEL1

Dependent Variable: y

R-Square Selection Method

14

Models fitted with all explanatory variables and City #31 deleted

The REG Procedure

Model: MODEL1

Dependent Variable: y

R-Square Selection Method

15

Download