The OLS estimator (b.l.u.e.) for is 7. Regression Analysis b = (X T X ) 7.1 Simple linear regression for normaltheory Gauss-Markov models. " when does this exist? Here Yi = 0 + 1Xi + i Model 1: where i NID(0; 2) for i = 1; : : : ; n. Matrix formulation: 2 3 2 Y1 1 X1 6 7 6 6 Y2 7 6 1 X2 6 . 7 = 6 . .. 4 . 5 4 . Yn 1 Xn 3 2 7 7 7 5 # n 2 1 0 + 666 2 4 . 1 n " 1X T Y 3 X 6 n Xi 6 i =1 X T X = 666 X n n X 4 Xi Xi2 3 i=1 7 7 7 5 i=1 n 2 3 X 6 6 or =1 X T Y = 666 iX n Y = X + 4 7 7 7 7 7 5 i=1 Yi XiYi 7 7 7 7 7 5 369 370 Then (X T X ) 1 b = (X T X ) 2 = n n X i=1 Xi2 1 0 @ n X i=1 2 1 XiA 6 6 6 6 6 4 2 = n n X i=1 1 (Xi X )2 6 6 4 n X Xi2 i=1 n X i=1 n X i=1 n i=1 Xi Xi2 nX X n nX n 3 7 7 5 Xi 1X T Y = 2 0 3 7 7 7 7 7 5 n n X 1 (Xi i=1 X )2 n 6 @ 6 6 6 6 4 X i=1 and 2 " # 6 6 Xi2 10 371 1 n YiA nX XiYi i=1 i=1 n n X X nX Yi + n XiYi i=1 i=1 A@ X Y b1X n X (Xi X )Yi 6 b b = 0 = 666 i=1 b1 n 6 X 4 n X i=1 (Xi X )2 3 7 7 7 7 7 7 7 5 372 3 7 7 7 7 7 5 Analysis of Variance: Covariance matrix: n X i=1 V ar(b) = V ar (X T X ) 1X T Y = (X T X ) 1X T (2I )X (X T X ) 1 = 2(X T X ) 1 1 X 2 X n + (X X )2 (X X )2 2 = 1 X (X X )2 (X X )2 2 6 6 4 i i i YT Y = = YT (I YT (I PX + PX P1 + P1)Y PX )Y + YT (PX P1)Y + YT P1Y % % 7 7 5 i 1 n 2 Y(I " "Corrected model" sum of squares " call this R(1j0) 3 Correction for the "mean" " call this R(0) (i) By Cochran's Theorem, these three sums of squares are multiples of independent chisquared random variables. Sb = MSE (X T X ) 1 MSE = SSE=(n 2) = = SSE Estimate the covariance matrix for b as where Yi2 PX )Y: (ii) By result 4.7, 12 SSE 2(n 2) if the model is correctly specied. 373 374 Correction for the overall mean: Notation: R(0) = YT P1Y = YT (I I + P1)Y = YT I Y YT (I P1)Y Reduction in residual sum of squares: R(k+1; : : : ; k+q j 0; 1; : : : ; k ) = YT (I PX 1 )Y YT (I PX )Y " = " sum of squared residuals for the smaller model sum of squared residuals for the larger model Here X = [ X1 j X2 ] % columns corresponding to 0; 1; : : : ; k - n X i=1 (Yi 0)2 n X (Yi i=1 Y )2 % sum of squared residuals from tting the model Yi = + i. The OLS estimator for = E (Yi) is ^ = (1T 1) 011T Y 1 columns corresponding to k+1 k+q n X = (n) 1 @ YiA = Y 375 i=1 376 An alternative formula Reduction in the residual sum of squares for regression on X1: R(0) = YT P1Y = YT 1(1T 1) 11T Y 0 = @ n X i=1 0 A @ Yi (n) 1 = (n) 1 = nY 2 1 0 @ n X i=1 1 Yi 2 n X i=1 R(1j0) = = = = 1 Yi A A YT (PX P1)Y YT (PX I + I P1)Y YT (I P1 (I PX ))Y YT (I P1)Y YT (I PX )Y % " sum of squared sum of squared residuals for residuals for tting the model tting the model Yi = + i Yi = 0 + 1Xi + i with df = rank(P1) = rank(1) = 1: 377 378 F-tests ANOVA table: Source of Sum of variation d.f. Squares Regression on X 1 R(1j0) = YT (PX P1)Y Residuals n 2 YT (I PX )Y Corrected total n 1 YT (I P1)Y Correction for the mean 1 YT P1Y = nY 2 379 From result 4.7 we have 1 1 R(0) = 2 YT P1Y 21(Æ2) 2 where 1 Æ2 = 2 T XT P1X 1 = 2 T XT P1T P1X 1 = 2 (P1X )T (P1X ) n = 2 (0 + B1X )2 Hypothesis test: Reject H0 : 0 + 1X = 0 if R(0) > F(1;n 2); F= MSE 380 Also use Result 4.7 to show that 1 1 SSE = 2 YT (I PX )Y 2(n 2) 2 Test the null hypothesis H0 : 1 = 0 F = and, use Result 4.8 to show that 1 SSE = 2 YT (I PX )Y is distributed independently of 1 R(0) = 2 YT P1Y : This follows from (I = R(1j0)=1 MSE [YT (PX P1)Y]=[12] ] [YT (I PX )Y]=[(n 2)2 F(1;n 2)(Æ2) where PX )P1 = 0 : 1 2 1 = 2 Æ2 = Consequently, R(0) F(1;n 2)(Æ2) F= MSE and this becomes a central F-distribution when the null hypothesis is true. T X T (PX P1)X T X T (PX P1)T (PX P1)X % The null hypothesis is H0 : (PX P1)X = 0 381 382 Here (PX P1)[1jX] P1)X = (PX Consequently, if any Xi 6= Xj then h = (PX P1)1 (PX i = PX 1 P11 PX X P1X i = [1 1 X X 1] 2 = 6 6 6 4 0 0 .. 0 j X1 X j X2 X j .. j Xn X 3 7 7 7 5 X = 0 and P1)X = 0 if and only if 1 = 0 : Hence, the null hypothesis is H0 : 1 = 0 . If any Xi 6= Xj , then we cannot have both Xj (PX P 1 )X h Note that n X 1 Æ2 = 2 12 (Xi X )2 i=1 Xi = X = 0 : 383 384 Reparameterize the model: Yi = + 1(Xi X ) + i with i NID(0; 2); i = 1; : : : ; n. Interpretation of parameters: = E (Y ) when X = X Maximize the power of the F-test for H0 : 1 = 0 vs. HA : 1 6= 0 by maximizing n X 1 Æ2 = 2 12 (Xi X )2 i=1 1 is the change in E (Y ) when X is increased by one unit. 385 Matrix formulation: 386 For this reparameterization, the columns of W are orthogonal and 3 2 Y1 1 X1 X 6 . 7 6 . .. 4 . 5 = 4 . Yn 1 Xn X 2 3 " 7 5 2 + 64 ..1 1 n # 3 7 5 2 WTW = or Y = W + (W T W ) 1 = 6 6 4 2 4 Clearly, n 0 n X i=1 0 (Xi 1 0 n 3 X )2 7 7 5 3 0 (X 1 X )2 5 i " W = X 10 " 2 # X 1 = XF WTY = # X = W 10 X 1 = WG 387 6 6 6 6 6 4 n 3 X i=1 n X Yi (Xi i=1 X )Yi 7 7 7 7 7 5 388 Then, " ^ = ^^ 1 # Analysis of variance: = (W T X ) 1W T Y 2 = 6 4 The reparamterization does not change the ANOVA table. 3 Y (X X )Y2 (X X ) i i 7 5 Note that i and PX = X (X T X ) 1X T = W (W T W ) 1W T = PW V ar(^ ) = 2(W T W ) 1 2 = 6 4 2 n 0 2 0 (X X )2 i 3 and 7 5 R(0) + R(1j0) + SSE X X )Y Hence, Y and ^1 = ( (X X )2 are uncorrelated (independent for the normal theory GaussMarkov model). i i i = YT P1Y + YT (PX P1)Y + YT (I PX )Y = YT P1Y + YT (PW P1)Y + YT (I PW )Y = R() + R(1j)+ SSE 390 389 7.2 Multiple regression analysis for the normal-theory Gauss-Markov model Suppose rank(X ) = r + 1, then (i) the OLS estimator (b.l.u.e.) for is where ^ = X b = X (X T X ) 1X T Y = PX Y (iii) Y (iv) e = Y N (0; 2I ) Y = X + 1X T Y (ii) V ar(b) = 2(X T X ) 1 i NID(0; 2) for i = 1; : : : ; n : Matrix formulation: where b = (X T X ) Yi = 0 + 1X1i + + r Xri + i ^ = (I Y PX )Y (v) By result 4.7, 2 1 X11 X21 6 6 1 X 12 X.22 6 X = 66 1 .. . 6 . .. .. 4 . 1 X1n X2n " " 1 X1 Xr 1 . Xr2 . .. Xrn " " X2 3 7 7 7 7 7 7 5 2 0 6 6 1 6 . 4 . r 3 7 7 7 5 1 1 1 SSE = 2 eTe = 2 YT (I PX )Y 2 2(n r 1) (vi) MSE = nSSE r 1 is an unbiased estimator of 2. Xr 391 392 Reduction in the residual sum of squares obtained by regression on X1; X2; : : : ; Xr is denoted as R(1; 2; : : : ; r j 0) ANOVA Source of variation Model (regression on X1; : : : ; Xr ) Error (or residuals) Corrected total Correction for the mean = Y T (I d.f. r n 1 1 = Y T (P X Sum of squares PX )Y P1 )Y Use Cochran's theorem or results 4.7 and 4.8 to show that SSE is distributed independently of R(1; : : : ; r j0) = YT (PX P1)Y n r 1 YT (I P1)Y YT (I PX )Y and YT (I P1)Y R(0) = YT P1Y =nY 2 R(1; 2; : : : ; r j0) = SSmodel 1 SSE 2(n r 1) 2 and 1 R( ; : : : ; r j0) 2(r)(Æ2) 2 1 393 394 Note that Then R(1; : : : ; r j0)=r F= F(r;n r 1)(Æ2) MSE (I =) where 1 T T X (PX I + I P1)X 2 1 h = 2 T X T (I P1)X T XT (I h (I 1 Æ2 = 2 T XT (PX P1)X =) = PX )X % This is a matrix of zeros = 12 T X T (I P1)X = 12 T X T (I P1)(I P1)X = 12 [(I P1)X ]T (I P1)X 395 i P1)X = (I P1)1(I P1)X1 (I P1)Xr h i = 0 X1 X11 Xr Xr 1 1 Æ2 = 2 + P1)X = 2 4 r X j =1 r X j =1 j2(Xj X X j 6=k 2 j (Xj Xj 1) Xj 1)T (Xj j k (Xj Xj 1) Xj 1)T (Xk Xk 1) 3 n 1 T 4X )(Xi X )T 5 (Xj X = 2 2 i=1 where 3 2 3 2 3 2 X1j X1 1 7 6 6 6 . 7 . . = 4 . 5 Xj = 4 . 75 = 4 . 5 X Xr Xrj r 396 i If n X j =1 )(Xj X )T is positive denite, (Xj X then the null hypothesis corresponding to Æ2 = 0 is Sequential sums of squares (Type I sums of squares in PROC GLM or PROC REG in SAS). H0 : = 0( or 1 = 2 = = r = 0) Let Reject H0 : = 0 if YT (PX P1)Y=r > F(r;n r PX )Y=(n r 1) F= T Y (I 1); X0 = 1 X1 = [1jX1] X2 = [1jX1jX2] .. Xr = [1jX1j jXr ] P0 = X0(X0T X0) P1 = X1(X1T X1) P2 = X2(X2T X2) .. Pr = Xr (XrT Xr ) Then = R(0) + R(1j0) + R(2j0; 1) + + R(r j0; 1; : : : ; r 1) +SSE Use Cochran's theorem to show { these sums of squares are distributed independently of each other. { Each 12 R(ij0; : : : ; i 1) has a chisquared distribution with one degree of freedom. Use Result 4.7 to show 12 SSE 2(n r 1). 399 X1T X2T XrT 398 397 YT Y = YT P0Y + YT (P1 P0)Y + YT (P2 P1)Y + + YT (Pr Pr 1)Y + YT (I Pr )Y 1X0T F= R(j j0; : : : ; j 1)=1 F1;n r 1(Æ2) MSE where 1 2 1 = 2 1 = 2 Æ2 = T X T (Pj Pj 1)X T X T (Pj Pj 1)T (Pj [(Pj Pj 1)X ]T (Pj Pj 1)X Pj 1)X Hence, this is a test of vs H0 : (Pj Pj 1)X = 0 Ha : (Pj Pj 1)X 6= 0 400 Then Note that (Pj Pj 1)X h = (Pj Pj 1) 1 X1 Xj Xj 1 (Pj Xr i h = Onj (Pj r X k=j + i Pj 1)Xj Pj 1)Xr (Pj k (Pj = j (Pj = (Pj Pj 1)1 (Pj Pj 1)X1 Pj 1)Xj 1 (Pj Pj 1)Xj (Pj h Pj 1)X = Pj 1)Xk Pj 1)Xj r X k=j +1 and the null hypothesis is i H0 : 0 = j (Pj Pj 1)Xj + k (Pj r X k=j +1 Pj 1)Xk k (Pj Pj 1)Xk 402 401 From the previous discussion: Type II sums of squares in SAS (these are also Type III and Type IV sums of squares for regression problems). R(j j0 and all other k0 s) = YT (PX F= where P j )Y where P j = X j (X T j X j ) X T j and X j is obtained by deleting the (j + 1)-th column of X . P j )Y=1 F(1;n r 1)(Æ2) MSE 1 2 1 = 2 Æ2 = T X T (PX P j )X j2XTj (PX P j )Xj This F-test provides a test of H 0 : j = 0 if (PX 403 YT (PX vs HA : j 6= 0 P j )Xj 6= 0. 404 When X1; X2; : : : ; Xr are all uncorrelated, then Type I Sums Variable of squares other 0s) = (PX P 1)Y X2 R(2j other 's) = YT (PX P 2)Y .. .. .. Xr R(r j0; 1; : : : ; r 1) R(r j0; : : : ; r 1) = YT (Pr Pr 1)Y = YT (PX P r)Y Residuals SSE = YT (I PX )Y X1 Corrected Total R(1j0) =YT (P1 P0)Y R(2j0; 1) = YT (P2 P1)Y Type II Sums of squares Y T (I R(1j YT (i) R(j j 0 and any other subset of 's) = R(j j0) and there is only one ANOVA table. (ii) R(j j0) = ^j2 Xj:)2 (Xji i=1 j0) F1;n k 1(Æ2) (iii) F = R(MSE j where P1)Y n X Æ2 = 12 j2 n X (Xji i=1 Xj:)2 and this F-statistic provides a test of H0 : j = 0 versus HA : j 6= 0. 406 405 Testable Hypothesis For any testable hypothesis, reject H0 : C = d in favor of the general alternative HA : C 6= d if (C b d)T [C (X T X ) C T ] 1(C b d)=m YT (I PX )Y=(n rank(X )) > F(m;n rank(X )); F = where m = number of rows in C = rank(C ) and Condence interval for an estimable function cT q cT b t(n rank(X ))=2 MSE cT (X T X ) c b = (X T X ) X T Y 407 Use cT = (0 0 .. 0 1 0 .. 0) " j -th position to construct a condence interval for j 1 Use cT = (1; x1; x2; : : : ; xr ) to construct a condence interval for E (YjX1 = x1; : : : ; Xr = xr ) = 0 + 1x1 + + r xr 408 /* A SAS program to perform a regression analysis of the effects of the composition of Portland cement on the amount of heat given off as the cement hardens. Posted as cement.sas */ Prediction Intervals: Predict a future observation at i.e., predict X1 = x1; : : : ; Xr = xr Y = 0 + 1x1 + + r xr + % estimate the conditional mean as b0 + b1x1 + + br xr " estimate this with its mean E () = 0 A (1 ) 100% prediction interval is q (cT b +0) t(n rank(X ));=2 MSE [1 + cT(XTX) c] where cT = (1 x1 xr ) data set1; input run x1 x2 x3 x4 y; /* label y = evolved heat (calories) x1 = tricalcium aluminate x2 = tricalcium silicate x3 = tetracalcium aluminate ferrate x4 = dicalcium silicate; */ cards; 1 7 26 6 60 78.5 2 1 29 15 52 74.3 3 11 56 8 20 104.3 4 11 31 8 47 87.6 5 7 52 6 33 95.9 6 11 55 9 22 109.2 7 3 71 17 6 102.7 8 1 31 22 44 72.5 410 409 9 2 10 21 11 1 12 11 13 10 run; 54 47 40 66 68 18 4 23 9 8 22 26 34 12 12 /* Regress y on all four explanatory variables and check residual plots and collinearity diagnostics */ 93.1 115.9 83.8 113.2 109.4 proc reg data=set1 corr; model y = x1 x2 x3 x4 / p r ss1 ss2 covb collin; output out=set2 residual=r predicted=yhat; run; proc print data=set1 uniform split='*'; var y x1 x2 x3 x4; label y = 'Evolved*heat*(calories)' x1 = 'Percent*tricalcium*aluminate' x2 = 'Percent*tricalcium*silicate' x3 = 'Percent*tetracalcium*aluminate*ferrate' x4 = 'Percent*dicalcium*silicate'; run; 411 /* Examine smaller regression models corresponding to subsets of the explanatory variables */ proc reg data=set1; model y = x1 x2 x3 x4 / selection=rsquare cp aic sbc mse stop=4 best=6; run; 412 /* Regress y on two of explanatory variables and check residual plots and collinearity diagnostics */ Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 proc reg data=set1 corr; model y = x1 x2 / p r ss1 ss2 covb collin; output out=set2 residual=r predicted=yhat; run; Percent Percent tetracalcium Percent tricalcium aluminate dicalcium silicate ferrate silicate Evolved Percent heat tricalcium (calories) aluminate 78.5 74.3 104.3 87.6 95.9 109.2 102.7 72.5 93.1 115.9 83.8 113.2 109.4 7 1 11 11 7 11 3 1 2 21 1 11 10 26 29 56 31 52 55 71 31 54 47 40 66 68 /* Use the GLM procedure to identify all estimable functions */ 6 15 8 8 6 9 17 22 18 4 23 9 8 60 52 20 47 33 22 6 44 22 26 34 12 12 Correlation Variable proc glm data=set1; model y = x1 x2 x3 x4 / ss1 ss2 e1 e2 e p; run; x1 x2 x3 x4 y x1 x2 x3 x4 y 1.0000 0.2286 -0.8241 -0.2454 0.7309 0.2286 1.0000 -0.1392 -0.9730 0.8162 -0.8241 -0.1392 1.0000 0.0295 -0.5348 -0.2454 -0.9730 0.0295 1.0000 -0.8212 0.7309 0.8162 -0.5348 -0.8212 1.0000 413 414 Parameter Estimates The REG Procedure Model: MODEL1 Dependent Variable: y Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 4 8 12 2664.52051 47.67641 2712.19692 666.13013 5.95955 Root MSE Dependent Mean Coeff Var 2.44122 95.41538 2.55852 R-Square Adj R-Sq 415 0.9824 0.9736 Variable DF Parameter Estimate Standard Error t Value Intercept x1 x2 x3 x4 1 1 1 1 1 63.16602 1.54305 0.50200 0.09419 -0.15152 69.93378 0.74331 0.72237 0.75323 0.70766 0.90 2.08 0.69 0.13 -0.21 Pr > |t| 0.3928 0.0716 0.5068 0.9036 0.8358 Variable DF Type I SS Type II SS Intercept x1 x2 x3 x4 1 1 1 1 1 118353 1448,75413 1205.70283 9.79033 0.27323 4.86191 25.68225 2.87801 0.09319 0.27323 416 Collinearity Diagnostics Number Eigenvalue Condition Index 1 2 3 4 5 4.11970 0.55389 0.28870 0.03764 0.00006614 1.00000 2.72721 3.77753 10.46207 249.57825 Obs Dep Var y Predicted Value Std Error Mean Predict Student Residual 1 2 3 4 5 6 7 8 9 10 11 12 13 78.5000 74.3000 104.3000 87.6000 95.9000 109.2000 102.7000 72.5000 93.1000 115.9000 83.8000 113.2000 109.4000 78.4929 72.8005 105.9744 89.3333 95.6360 105.2635 104.1289 75.6760 91.7218 115.6010 81.8034 112.3007 111.6675 1.8109 1.4092 1.8543 1.3265 1.4598 0.8602 1.4791 1.5604 1.3244 2.0431 1.5924 1.2519 1.3454 0.00432 0.752 -1.054 -0.846 0.135 1.723 -0.736 -1.692 0.672 0.224 1.079 0.429 -1.113 Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 Collinearity Diagnostics ---------------Proportion of Variation---------------Intercept x1 x2 x3 x4 1 2 3 4 5 0.000005 8.812E-8 3.060E-7 0.000127 0.99987 0.00037 0.01004 0.000581 0.05745 0.93157 0.00002 0.00001 0.00032 0.00278 0.99687 0.00021 0.00266 0.00159 0.04569 0.94985 0.00036 0.00010 0.00168 0.00088 0.99730 | | | | | | | | | | | | | -2-1 0 1 2 | |* **| *| | |*** *| ***| |* | |** | **| | | | | | | | | | | | | | Cook's D 0.000 0.057 0.303 0.060 0.002 0.084 0.063 0.395 0.038 0.023 0.172 0.013 0.108 418 417 The REG Procedure Model: MODEL1 R-Square Selection Method Regression Models for Dependent Variable: y Number in Model R-Square AIC SBC Variables in Model 1 0.6744 58.8383 59.96815 x4 1 0.6661 59.1672 60.29712 x2 1 0.5342 63.4964 64.62630 x1 1 0.2860 69.0481 70.17804 x3 -----------------------------------------------------2 0.9787 25.3830 27.07785 x1 x2 2 0.9726 28.6828 30.37766 x1 x4 2 0.9353 39.8308 41.52565 x3 x4 2 0.8470 51.0247 52.71951 x2 x3 2 0.6799 60.6172 62.31201 x2 x4 2 0.5484 65.0933 66.78816 x1 x3 ------------------------------------------------------3 0.9824 24.9187 27.17852 x1 x2 x4 3 0.9823 24.9676 27.22742 x1 x2 x3 3 0.9814 25.6553 27.91511 x1 x3 x4 3 0.9730 30.4953 32.75514 x2 x3 x4 ------------------------------------------------------4 0.9824 26.8933 29.71808 x1 x2 x3 x4 419 This output was produced by the e option in the model statement of the GLM procedure. It indicates that all five regression parameters are estimable. The GLM Procedure General Form of Estimable Functions Effect Coefficients Intercept L1 x1 L2 x2 L3 x3 L4 x4 L5 420 This output was produced by the e1 option in the model statement of the GLM procedure. It describes the null hypotheses that are tested with the sequential Type I sums of squares. Type II Estimable Functions ----Coefficients---x1 x2 x3 x4 Effect Type I Estimable Functions Intercept 0 0 0 0 Effect ----------------Coefficients---------------x1 x2 x3 x4 x1 L2 0 0 0 Intercept 0 0 0 0 x2 0 L3 0 0 x1 L2 0 0 0 x2 0.6047*L2 L3 0 0 x3 0 0 L4 0 x3 -0.8974*L2 0.0213*L3 L4 0 x4 0 0 0 L5 x4 -0.6984*L2 -1.0406*L3 -1.0281*L4 L5 421 > > > > > > # The commands are posted as: # # # # cement.spl The data file is stored under the name cement.dat. It has variable names on the first line. We will enter the data into a data frame. > cement <- read.table("cement.txt", header=T) > cement 1 2 3 4 5 6 7 8 9 run 1 2 3 4 5 6 7 8 9 X1 7 1 11 11 7 11 3 1 2 X2 26 29 56 31 52 55 71 31 54 X3 6 15 8 8 6 9 17 22 18 X4 60 52 20 47 33 22 6 44 22 422 10 11 12 13 10 11 12 13 21 1 11 10 47 40 66 68 4 23 9 8 26 34 12 12 115.9 83.8 113.2 109.4 > # Compute correlations and round the results > # to four significant digits > round(cor(cement[-1]),4) Y 78.5 74.3 104.3 87.6 95.9 109.2 102.7 72.5 93.1 X1 X2 X3 X4 Y 423 X1 1.0000 0.2286 -0.8241 -0.2454 0.7309 X2 0.2286 1.0000 -0.1392 -0.9730 0.8162 X3 -0.8241 -0.1392 1.0000 0.0295 -0.5348 X4 -0.2454 -0.9730 0.0295 1.0000 -0.8212 Y 0.7309 0.8162 -0.5348 -0.8212 1.0000 424 > # Create a scatterplot matrix with smooth > # curves. Unix users should first use > # motif( ) to open a graphics wundow > points.lines <- function(x, y) + { + points(x, y) + lines(loess.smooth(x, y, 0.90)) + } > par(din=c(7,7),pch=18,mkh=.15,cex=1.2,lwd=3) > pairs(cement[ ,-1], panel=points.lines) 425 10 30 50 15 20 30 40 50 60 70 426 > cement.out <- lm(Y~X1+X2+X3+X4, cement) > summary(cement.out) 70 5 10 X1 > # Fit a linear regression model (Venables > # and Ripley, Chapter 6) 50 60 Call: lm(formula = Y ~ X1+X2+X3+X4, data=cement) Residuals: Min 1Q Median 3Q Max -3.176 -1.674 0.264 1.378 3.936 15 20 30 40 X2 40 50 60 5 10 X3 Value Std. Error t value Pr(>|t|) (Intercept) 63.1660 69.9338 0.9032 0.3928 X1 1.5431 0.7433 2.0759 0.0716 X2 0.5020 0.7224 0.6949 0.5068 X3 0.0942 0.7532 0.1250 0.9036 X4 -0.1515 0.7077 -0.2141 0.8358 100 110 10 20 30 X4 Coefficients: 80 90 Y 5 10 15 20 5 10 15 20 80 90 110 Residual standard error: 2.441 on 8 d.f. Multiple R-Squared: 0.9824 F-statistic: 111.8 on 4 and 8 degrees of freedom, the p-value is 4.707e-007 427 Correlation of (Intercept) X1 -0.9678 X2 -0.9978 X3 -0.9769 X4 -0.9983 Coefficients: X1 X2 X3 > > > > > 0.9510 0.9861 0.9624 0.9568 0.9979 0.9659 > anova(cement.out) # # # # # Create a function to evaluate an orthogonal projection matrix. Then create a function to compute type II sums of squares. This uses the ginv( ) function in the MASS library, so you must attach the MASS library > library(MASS) Analysis of Variance Table Response: Y Terms added sequentially (first Df Sum of Sq Mean Sq X1 1 1448.754 1448.754 X2 1 1205.703 1205.703 X3 1 9.790 9.790 X4 1 0.273 0.273 Residuals 8 47.676 5.960 to last) F Value 243.0978 202.3144 1.6428 0.0458 Pr(F) 0.0000 0.0000 0.2358 0.8358 > > > > > > + > #======================================= # project( ) #-------------# calculate orthogonal projection matrix #======================================= project <- function(X) { X%*%ginv(crossprod(X))%*%t(X) } #======================================= 428 > > > > > > > > > > > + + + + + + + + + + + + + + #======================================== # typeII.SS( ) #-----------------# calculate Type II sum of squares # # input lmout = object made by the # lm( ) function # y = dependent variable #======================================== typeII.SS <- function(lmout,y) { # generate the model matrix model <- model.matrix(lmout) # create list of parameter names par.name <- dimnames(model)[[2]] # compute number of parameters n.par <- dim(model)[2] # Compute residual mean square SS.res <- deviance(lmout) df2 <- lmout$df.resid MS.res <- SS.res/df2 430 429 + + + + + + + + + + + + + + + + + + + + + + > result <- NULL # store results # Compute Type II SS for (i in 1:n.par) { A <- project(model)-project(model[,-i]) SS.II <- t(y) %*% A %*% y df1 <- qr(project(model))$rank qr(project(model[ ,-i]))$rank MS.II <- SS.II/df1 F.stat <- MS.II/MS.res p.val <- 1-pf(F.stat,df1,df2) temp <- cbind(df1,SS.II,MS.II,F.stat,p.val) result <- rbind(result,temp) } result<-rbind(result,c(df2,SS.res,MS.res,NA,NA)) dimnames(result)<-list(c(par.name,"Residual"), c("Df","Sum of Sq","Mean Sq","F Value","Pr(F)")) cat("Analysis of Variance (TypeII Sum of Squares) \n") round(result,6) } #========================================== 431 > > > > > > typeII.SS(cement.out, cement$Y) Analysis of Df (Inter.) 1 X1 1 X2 1 X3 1 X4 1 Residual 8 Variance (TypeII Sum of Squares) Sum of Sq Mean Sq F Value Pr(F) 4.861907 4.861907 0.815818 0.392790 25.682254 25.682254 4.309427 0.071568 2.878010 2.878010 0.482924 0.506779 0.093191 0.093191 0.015637 0.903570 0.273229 0.273229 0.045847 0.835810 47.676412 5.959551 NA NA # # # # # Venables and Ripley have supplied functions studres( ) and stdres( ) to compute studentized and standardized residuals. You must attach the MASS library before using these functions. > cement.res <- cbind(cement$Y,cement.out$fitted, + cement.out$resid, + studres(cement.out), + stdres(cement.out)) > dimnames(cement.res) <- list(cement$run, + c("Response","Predicted","Residual", + "Stud. Res.","Std. Res.")) > round(cement.res,4) 432 1 2 3 4 5 6 7 8 9 10 11 12 13 Response 78.5 74.3 104.3 87.6 95.9 109.2 102.7 72.5 93.1 115.9 83.8 113.2 109.4 Predicted 78.4929 72.8005 105.9744 89.3333 95.6360 105.2635 104.1289 75.6760 91.7218 115.6010 81.8034 112.3007 111.6675 Residual Stud. Res. 0.0071 0.0040 1.4995 0.7299 -1.6744 -1.0630 -1.7333 -0.8291 0.2640 0.1264 3.9365 2.0324 -1.4289 -0.7128 -3.1760 -1.9745 1.3782 0.6472 0.2990 0.2100 1.9966 1.0919 0.8993 0.4061 -2.2675 -1.1326 434 433 Std. Res. 0.0043 0.7522 -1.0545 -0.8458 0.1349 1.7230 -0.7358 -1.6917 0.6721 0.2237 1.0790 0.4291 -1.1131 > # Produce plots for model diagnostics including > # Cook's D. Unix users should first use motif() > # to open a graphics window > par(mfrow=c(3,2)) > plot(cement.out) > # Search for a simpler model > cement.stp <- step(cement.out, + scope=list(upper = ~X1 + X2 + X3 + X4, + lower = ~ 1), trace=F) 435 2.0 4 6 6 1.5 1.0 13 0.5 0 -2 Residuals 2 sqrt(abs(Residuals)) 8 13 8 80 90 100 110 80 90 100 110 fits 4 Fitted : X1 + X2 + X3 + X4 0 -2 80 90 Y Residuals 100 2 110 6 13 8 80 90 100 110 -1 Fitted : X1 + X2 + X3 + X4 > cement.stp$anova Stepwise Model Path Analysis of Deviance Table Initial Model: Y ~ X1 + X2 + X3 + X4 Final Model: Y ~ X1 + X2 Step Df Deviance Resid. Df Resid. Dev AIC 1 8 47.67641 107.2719 2 - X3 1 0.093191 9 47.76960 95.4460 3 - X4 1 9.970363 10 57.73997 93.4973 437 0.4 0.3 0.6 0.2 f-value 0.1 11 0.0 -20 0.2 8 3 0.2 20 10 0 -10 Cook’s Distance 20 10 -10 -20 436 1 Residuals Y 0 Fitted Values 0 Quantiles of Standard Normal 0.6 2 4 6 8 Index 10 12