The OLS estimator (b.l.u.e.) for is 7. Regression Analysis b = (X T X );1X T Y 7.1 Simple linear regression for normal-theory Gauss-Markov models. " when does this exist? Here Model 1: Yi = 0 + 1Xi + i where 1 NID(0; 2) for i = 1; : : : ; n. 2 66 n T X X = 666 X n 4 Xi Matrix formulation: 2Y 3 21 X 3 2 3 66 Y12 77 66 1 X12 77 " 0 # 66 12 77 64 .. 75 = 64 .. .. 75 + 64 .. 75 1 Yn 1 Xn n or i=1 3 Xi 77 77 i=1 n X X 2 75 n X i=1 i 2X 3 n 66 Yi 77 77 =1 X T Y = 666 iX n 4 XiYi 75 Y = X + i=1 373 374 Then (X T X );1 b = (X T X );1X T Y 2X 3 n n X 66 Xi2 ; Xi 77 i=1 77 10 = n 12 666 i=1X n 7 n X 2 @ X A 4 ; Xi n 5 n Xi ; Xi i=1 = i=1 i=1 2X n 66 Xi2 ;nX 1 n 4 i=1 X n (Xi ; X )2 ;nX n 3 77 5 i=1 375 = 2 0 n 10 n 1 3 n X X X 2 66 @ Xi A @ YiA ; nX XiYi 77 66 i=1 77 1 n ni=1 n i=1 X 6 75 X X 2 n (Xi ; X ) 4 ;nX Yi + n XiYi i=1 and i=1 2 " # 666 b b = 0 = 666 b1 64 Y ; b1X n X (Xi ; X )Yi i=1 n X (Xi ; X )2 i=1 i=1 3 77 77 77 75 376 Analysis of Variance: Covariance matrix: V ar(b) = V ar (X T X );1X T Y n X i=1 (X T X );1X T (2I )X (X T X );1 = = 2(X T X );1 3 2 ;X 2 1 + X 2 2 = 2 664 n (;XX ;X ) (X 1;X ) 775 (X ;X )2 (X ;X )2 i i Yi2 = = = YT Y YT (I ; PX + PX ; P1 + P1)Y YT (I ; PX )Y + YT (PX ; P1)Y + YT P1Y % % SSE i i Estimate the covariance matrix for b as Sb = MSE (XTX);1 where MSE = SSE =(n ; 2) = n ;1 2 Y(I ; PX)Y: " "Corrected model" sum of squares " call this R(1j0) Correction for the "mean" " call this R(0) (i) By Cochran's Theorem, these three sums of squares are independent chi-squared random variables. (ii) By result 4.7, 12 SSE 2(n;2) if the model is correctly specied. 378 377 Correction for the overall mean: Reduction in residual sum of squares: R(k+1; : : : ; k+q j 0; 1; : : : ; k ) = YT (I ; PX 1 )Y ; YT (I ; PX )Y " " sum of squared sum of squared residuals for the residuals for the smaller model larger model Here X = [ X1 j X2 ] % columns columns corresponding corresponding to 0; 1; : : : ; k to k+1 k+g 379 R ( 0 ) = = = = YT P1Y YT (I ; I + P1)Y YT I Y ; YT (I ; P1)Y n X i=1 (Yi ; 0)2 ; n X (Yi ; Y )2 i=1 % sum of squared residuals from tting the model Yi = + i. The OLS estimator for = E (Yi) is ^ = (1T 1);11T Y = P1Y = Y 380 Reduction in the residual sum of squares for regression on Xi: Note that R(0) = = Y T P1 Y R(1j0) = = = YT 1(1T 1);11T Y 0n 1 0n 1 X X ; 1 = @ YiA (n) @ YiA = i=1 nY:2 i=1 YT (PX ; P1)Y YT (;I + PX + I ; P1)Y YT (I ; P1)Y ; YT (I ; PX )Y % sum of squared residuals for tting the model Yi = + i " sum of squared residuals for tting the model Yi = 0 + 1Xi + i 381 382 Consider the following F-tests: ANOVA table: Source of variation d.f. Regression on X 1 Residuals n ; 2 Corrected total n;1 Correction for mean 1 Sum of Squares Mean Square F YT (PX ; P1)Y YT (I ; PX )Y YT (I ; P1)Y YT P1Y = nY 2 From result 4.7 we have 1 1 T 2 2 2 R(0) = 2 Y P1Y 1( ) where 2 = 12 T XT P1X = 12 T XT P1T P1X = 12 (P1X )T (P1X ) = n2 (0 + B1X )2 Hypothesis test: Reject H0 : 0 + 1X = 0 if (0) > F F = RMSE (1;n;2); 383 384 Here (PX ; P1)X = (PX ; P1)[1jX] Test the null hypothesis H0 : 1 = 0 = 1j0)=1 F = R(MSE T 2] ] = [Y[TY(I(;PXP ;)PY1])=Y[(]n=[1 ; 2)2 X = = F(1;n;2)(2) where 2 = 12 T X T (PX ; P1)X = 12 T X T (PX ; P1)T (PX ; P1)X % The null hypothesis is H0 : (PX ; P1)X = 0 = 2 3 4(PX ; P1)1 (PX ; P1)X5 2 3 4PX 1 ; P11 PX X ; P1X5 1 1 ; 1 X ; X 2 0 j X ; X 3 66 0 j X12 ; X 77 64 .. j .. 75 0 j Xn ; X If any Xi 6= Xj , then (PX ; P1)X = 0 if and only if 1 = 0. Hence, the null hypothesis is H0 : 1 = 0. Note that n X 2 = 12 12 (Xi ; X )2 i=1 385 386 Reparameterize the model: Yi = + 1(Xi ; X ) + i with i NID(0; 2); i = 1; : : : ; n. Interpretation of parameters: = E (Y ) when X = X Maximize the power of the F-test for H0 : 1 = 0 vs. HA : 1 6= 0 by maximizing n X 2 = 12 12 (Xi ; X )2 i=1 1 is the change in E (Y ) when X is increased by one unit. 387 388 Matrix formulation: 2 3 2 X 64 Y..1 75 = 64 1.. X1 ; . Yn 1 Xn ; X or Clearly, 3" # 2 3 75 + 64 ..1 75 1 n For this reparameterization, the columns of W are orthogonal and 2n 3 0 77 n W T W = 664 0 X (Xi ; X )2 5 21 (W T W );1 = 4 0n Y = W + " # W = X 01 ;1X = XF WTY " # X = W 10 ;1X = WG i=1 0 1 (X ;X )2 3 5 i 2X 3 n 66 Yi 77 6 77 i =1 = 66 X n 4 (Xi ; X )Yi 75 i=1 390 389 Then, " ^ = ^^ 1 # Analysis of variance: = (W T X );1W T Y The reparamterization does not change the ANOVA table. 2 3 Y 6 = 4 (X ;X )Y 75 (X ;X )2 i i i and V ar(^ ) = 2(W T W );1 2 2 3 0 n 75 = 64 0 (X;2 X )2 i X ;X )Y Hence, Y and ^1 = ( (X ;X )2 are uncorrelated (independent for the normal theory GaussMarkov model). i i i 391 Note that PX = X (X T X );1X T = W (W T W );1W T = PW and R(0) + R(1j0) + SSE = YT P1Y + YT (PX ; P1)Y + YT (I ; PX )Y = YT P1Y + YT (PW ; P1)Y + YT (I ; PW )Y = R() + R(1j)+ SSE 392 7.2 Multiple regression analysis for the normaltheory Gauss-Markov model Yi = 0 + 1X1i + + r Xri + i where i NID(0; 2) for i = 1; : : : ; n : Matrix formulation: Y = X + where " " 1 X1 (iii) Y^ = X b = X (X T X );1X T Y = PX Y (v) By result 4.7, 1 T 1 2 SSE = 2 e e = 12 YT (I ; PX )Y 2(n;r;1) 3 Xr1 7 2 0 3 . Xr2 77 66 1 77 77 6 . 7 .. 75 4 . 5 . Xrn r " " X2 (i) the OLS estimator (b.l.u.e.) for is b = (X T X );1X T Y (ii) V ar(b) = 2(X T X );1 (iv) e = Y ; Y^ = (I ; PX )Y N (0; 2I ) 2 X11 X21 66 11 X 12 X.22 X = 666 1 .. .. 64 .. .. . 1 X1n X2n Suppose rank(X ) = r + 1, then Xr (vi) MSE = nSSE ;r;1 is an unbiased estimator of 2. 394 393 Reduction in the residual sum of squares obtained by regression on X1; X2; : : : ; Xr is denoted as R(1; 2; : : : ; r 0) ANOVA Source of variation Model (regression on X1; : : : ; Xr) Error (or residuals) Corrected total Correction for the mean d.f. r Sum of squares R(1; : : : ; r j0) = YT (PX ; P1)Y n;r;1 YT (I ; PX )Y n;1 1 YT (I ; P1)Y R(0) = YT P1Y =nY 2 = YT (I ; P1)Y ; YT (I ; PX )Y = Y T (P X ; P 1 ) Y Use Cochran's theorem or results 4.7 and 4.8 to show that SSE is distributed independently of and 1 2 2 SSE (n;r;1) and 395 R(1; 2; : : : ; rj0) = SSmodel 1 R( ; : : : ; B j ) 2 (2) r 0 (r) 2 1 396 Note that Then : : ; r j0)=r F 2 F = R(1; :MSE (r;n;r;1)( ) where 2 = 212 T XT (PX ; P1)X = 212 T X T (PX ; I + I ; P1)X h i = 212 T X T (I ; P1)X ; T XT (I ; PX )X % This is a matrix of zeros = 212 T X T (I ; P1)X = 212 T X T (I ; P1)(I ; P1)X = 212 [(I ; P1)X ]T (I ; P1)X 2 3 4 (I ; P1)X = (I ; P1)1(I ; P1)X1 (I ; P1)Xr 5 2 3 = 4 0 X1 ; X11 Xr ; Xr15 =) (I ; P1)X = =) 2 r X j =1 j (Xj ; Xj 1) r X 2 = 12 4 j2(Xj ; Xj 1)T (Xj ; Xj 1) j =1 XX + j k (Xj ; Xj 1)T (Xk ; Xk 1) j 6=k 2 3 n X = 212 T 4 (Xj ; X )(Xi ; X )T 5 i=1 where2 3 2 3 2 3 X X 1 1 1 i = 64 .. 75 Xi = 64 .. 75 i = 1; : : : ; n = 64 .. 75 X r Xr Xri 398 397 n X If (Xi ; X )(Xj ; X )T is positive denite, i=1 then the null hypothesis corresponding to 2 = 0 is H0 : = 0( or 1 = 2 = = r = 0) Sequential sums of squares (Type I sums of squares in PROC GLM or PROC REG in SAS). Let Reject H0 : = 0 if T ; P1)Y=r > F F = YT (IY;(PPX)Y =(n ; r ; 1) (r;n;r;1); X 399 X0 = 1 X1 = [1jX1] X2 = [1jX1jX2] .. Xr = [1jX1j jXr ] P0 = X0(X0T X0);1X0T P1 = X1(X1T X1);X1T P2 = X2(X2T X2);X2T .. Pr = Xr (XrT Xr );XrT 400 YT Y = YT P0Y + YT (P1 ; P0)Y + YT (P2 ; P1)Y + + YT (Pr ; Pr;1)Y + YT (I ; Pr)Y = R(0) + R(1j0) + R(2j0; 1) + + R(rj0; 1; : : : ; r;1) +SSE Use Cochran's theorem to show { these sums of squares are distributed independently of each other. { Each 12 R(ij0; : : : ; i;1) has a chisquared distribution with one degree of freedom. Use Result 4.7 to show 12 SSE 2(n;r;1). Then : : : ; j ;1)=1 F F = R(j j0;MSE 1;n;r;1(2) where 2 = 12 T X T (Pj ; Pj ;1)X = 12 T X T (Pj ; Pj;1)T (Pj ; Pj;1)X = 12 [(Pj ; Pj;1)X ]T (Pj ; Pj;1)X Hence, this is a test of H0 : (Pj ;Pj ;1)X = 0 vs Ha : (Pj ;Pj ;1)X 6= 0 401 Note that (Pj ; Pj;1)X 2 = (Pj ; Pj;1)4X1 Xj ;1 3 Xj Xr 5 2 = 4(Pj ; Pj;1)X1 (Pj ; Pj;1)Xj;1 3 (Pj ; Pj;1)Xj 5 2 4 = Onj (Pj ; Pj;1)Xj 3 (Pj ; Pj ;1)Xr 5 403 402 Then (Pj ; Pj;1)X = r X k=j k (Pj ; Pj ;1)Xk = j (Pj ; Pj;1)Xj + r X k=j +1 and the null hypothesis is H0 : 0 = j (Pj ;Pj ;1)Xj + k (Pj ; Pj ;1)Xk r X k=j +1 k (Pj ;Pj ;1)Xk 404 From the previous discussion: Type II sums of squares in SAS (there are also Type III and Type IV sums of squares for regression problems). R(j j0 and all otherk0 s) = YT (PX ; P;j )Y where P;j = X;j (X;T j X;j );X;T j and X;j is obtained by deleting the j -th column of X . T ; P;j )Y=1 F 2 F = Y (PXMSE (1;n;r;1)( ) where 2 = 12 = 12 T X T (PX ; P;j )X j2XTj (PX ; P;j )Xj This F-test provides a test of H 0 : j = 0 vs HA : j 6= 0 if (PX ; P;j )Xj 6= 0. 405 406 When X1; X2; : : : ; Xr are all uncorrelated, then Variable Type I Sums of squares R(1j0) =YT (P1 ; P0)Y X2 R(2j0; 1) = YT (P2 ; P1)Y .. .. Xr R(r j0; 1; : : : ; r;1) = YT (Pr ; Pr;1)Y Residuals SSE = YT (I ; PX )Y X1 Corrected Total Type II Sums of squares R(1j other 0s) =YT (PX ; P;1)Y R(2j other 's) = YT (PX ; P;2)Y .. R(r j0; : : : ; r;1) = YT (PX ; P;r )Y (i) R(j j 0 and any other subset of 's) = R(j j0) and there is only one ANOVA table. (ii) R(j j0) = ^j2 n X (Xji ; Xj:)2 i=1 j0) F (iii) F = R(MSE 1;n;k;1(2) n X where 2 = 12 j2 (Xji ; Xj:)2 i=1 and this F-statistic provides a test of H0 : j = 0 versus HA : j 6= 0. j YT (I ; P1)Y 407 408 Condence interval for an estimable function cT For any testable hypothesis, reject H0 : C = d in favor of the general alternative HA : C 6= d if cT b t(n;rank(X ))=2 T T );C T ];1(C b ; d)=m F = (C b ;YdT)(I[;C (PX )X Y=(n ; rank(X )) q MSE cT (X T X );c Use cT = (0 0 .. 0 1 0 .. 0) " j -th position to construct a condence interval for j;1 Use cT = (1; x1; x2; : : : ; xr) to construct a condence interval for E (YjX1 = x1; : : : ; Xr ; xr ) = 0 + 1x1 + + rxr X > F(m;n;rank(X )); where m = number of rows in C = rank(C ) and b = (X T X );X T Y 410 409 /* Prediction intervals: A SAS program to perform a regression analysis of the effects of the Predict a future observation at X1 = x1; : : : ; Xr = xr i.e., predict Y = 0 + 1x1 + + rxr + % " estimate the estimate conditional this with mean as its mean b0 + b1x1 + + br xr E () = 0 A (1 ; ) 100% prediction interval is q (cT b +0) t(n;rank(X ));=2 MSE [1 + cT(XTX);c] where cT = (1 x1 xr ) 411 composition of Portland cement on the amount of heat given off as the cement hardens. */ data set1; input run x1 x2 x3 x4 y; /* label y = evolved heat (calories) x1 = tricalcium aluminate x2 = tricalcium silicate x3 = tetracalcium aluminate ferrate x4 = dicalcium silicate; */ cards; 1 7 26 6 60 78.5 2 1 29 15 52 74.3 3 11 56 8 20 104.3 4 11 31 8 47 87.6 5 6 33 95.9 7 52 6 11 55 9 22 109.2 7 3 71 17 8 1 31 22 44 6 102.7 72.5 412 /* Regress y on all four explanatory variables and check residual plots and collinearity diagnostics */ 9 2 54 18 22 10 21 47 11 1 40 23 34 12 11 66 13 10 68 93.1 proc reg data=set1 corr; 4 26 115.9 model y = x1 x2 x3 x4 / p r ss1 ss2 83.8 covb 9 12 113.2 collin; output out=set2 residual=r 8 12 109.4 predicted=yhat; run; run; proc print data=set1 uniform split='*'; var y x1 x2 x3 x4; /* label y = 'Evolved*heat*(calories)' Examine smaller regression models x1 = 'Percent*tricalcium*aluminate' corresponding to subsets of the x2 = 'Percent*tricalcium*silicate' explanatory variables */ x3 = 'Percent*tetracalcium*aluminate*ferrate' x4 = 'Percent*dicalcium*silicate'; proc reg data=set1; run; model y = x1 x2 x3 x4 / selection=rsquare cp aic sbc mse stop=4 best=6; run; 414 413 /* Obs Regress y on two of explanatory variables and check residual plots 1 2 3 4 5 6 7 8 9 10 11 12 13 and collinearity diagnostics */ proc reg data=set1 corr; model y = x1 x2 / p r ss1 ss2 covb collin; output out=set2 residual=r predicted=yhat; run; /* Evolved Percent heat tricalcium (calories) aluminate 78.5 74.3 104.3 87.6 95.9 109.2 102.7 72.5 93.1 115.9 83.8 113.2 109.4 7 1 11 11 7 11 3 1 2 21 1 11 10 Percent Percent tetracalcium Percent tricalcium aluminate dicalcium silicate ferrate silicate 26 29 56 31 52 55 71 31 54 47 40 66 68 6 15 8 8 6 9 17 22 18 4 23 9 8 60 52 20 47 33 22 6 44 22 26 34 12 12 Use the GLM procedure to identify all estimable functions Correlation */ Variable proc glm data=set1; model y = x1 x2 x3 x4 / ss1 ss2 e1 e2 e p; run; 415 x1 x2 x3 x4 y x1 x2 x3 x4 y 1.0000 0.2286 -0.8241 -0.2454 0.7309 0.2286 1.0000 -0.1392 -0.9730 0.8162 -0.8241 -0.1392 1.0000 0.0295 -0.5348 -0.2454 -0.9730 0.0295 1.0000 -0.8212 0.7309 0.8162 -0.5348 -0.8212 1.0000 416 Parameter Estimates The REG Procedure Model: MODEL1 Dependent Variable: y Variable Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 4 8 12 2664.52051 47.67641 2712.19692 666.13013 5.95955 Root MSE Dependent Mean Coeff Var 2.44122 95.41538 2.55852 R-Square Adj R-Sq 0.9824 0.9736 DF Parameter Estimate Standard Error t Value 1 1 1 1 1 63.16602 1.54305 0.50200 0.09419 -0.15152 69.93378 0.74331 0.72237 0.75323 0.70766 0.90 2.08 0.69 0.13 -0.21 Intercept x1 x2 x3 x4 Number Eigenvalue Condition Index 1 2 3 4 5 4.11970 0.55389 0.28870 0.03764 0.00006614 1.00000 2.72721 3.77753 10.46207 249.57825 DF Type I SS Type II SS Intercept x1 x2 x3 x4 1 1 1 1 1 118353 1448,75413 1205.70283 9.79033 0.27323 4.86191 25.68225 2.87801 0.09319 0.27323 418 Obs Dep Var y Predicted Value Std Error Mean Predict Student Residual 1 2 3 4 5 6 7 8 9 10 11 12 13 78.5000 74.3000 104.3000 87.6000 95.9000 109.2000 102.7000 72.5000 93.1000 115.9000 83.8000 113.2000 109.4000 78.4929 72.8005 105.9744 89.3333 95.6360 105.2635 104.1289 75.6760 91.7218 115.6010 81.8034 112.3007 111.6675 1.8109 1.4092 1.8543 1.3265 1.4598 0.8602 1.4791 1.5604 1.3244 2.0431 1.5924 1.2519 1.3454 0.00432 0.752 -1.054 -0.846 0.135 1.723 -0.736 -1.692 0.672 0.224 1.079 0.429 -1.113 Obs Collinearity Diagnostics ---------------Proportion of Variation---------------Intercept x1 x2 x3 x4 1 2 3 4 5 0.000005 8.812E-8 3.060E-7 0.000127 0.99987 0.00037 0.01004 0.000581 0.05745 0.93157 0.00002 0.00001 0.00032 0.00278 0.99687 0.00021 0.00266 0.00159 0.04569 0.94985 0.00036 0.00010 0.00168 0.00088 0.99730 419 0.3928 0.0716 0.5068 0.9036 0.8358 Variable 417 Collinearity Diagnostics Pr > |t| 1 2 3 4 5 6 7 8 9 10 11 12 13 Cook's D -2-1 0 1 2 | | | | | | | | | | | | | | |* **| *| | |*** *| ***| |* | |** | **| | | | | | | | | | | | | | 0.000 0.057 0.303 0.060 0.002 0.084 0.063 0.395 0.038 0.023 0.172 0.013 0.108 420 The REG Procedure Model: MODEL1 R-Square Selection Method Regression Models for Dependent Variable: y Number in Model R-Square AIC This output was produced by the option It indicates that all five regression Variables in Model SBC e in the model statement of the GLM procedure. parameters are estimable. 1 0.6744 58.8383 59.96815 x4 1 0.6661 59.1672 60.29712 x2 1 0.5342 63.4964 64.62630 x1 1 0.2860 69.0481 70.17804 x3 -----------------------------------------------------2 0.9787 25.3830 27.07785 x1 x2 2 0.9726 28.6828 30.37766 x1 x4 2 0.9353 39.8308 41.52565 x3 x4 2 0.8470 51.0247 52.71951 x2 x3 2 0.6799 60.6172 62.31201 x2 x4 2 0.5484 65.0933 66.78816 x1 x3 ------------------------------------------------------3 0.9824 24.9187 27.17852 x1 x2 x4 3 0.9823 24.9676 27.22742 x1 x2 x3 3 0.9814 25.6553 27.91511 x1 x3 x4 3 0.9730 30.4953 32.75514 x2 x3 x4 ------------------------------------------------------4 0.9824 26.8933 29.71808 x1 x2 x3 x4 The GLM Procedure General Form of Estimable Functions Effect Coefficients Intercept L1 x1 L2 x2 L3 x3 L4 x4 L5 422 421 This output was produced by the e1 option in the model statement of the GLM procedure. It describes the null hypotheses that are Type II Estimable Functions tested with the sequential Type I sums of squares. ----Coefficients---Effect Type I Estimable Functions x1 x2 x3 x4 Intercept 0 0 0 0 ----------------Coefficients---------------x1 x2 x3 x4 x1 L2 0 0 0 Effect Intercept 0 0 0 0 x2 0 L3 0 0 x1 L2 0 0 0 x3 0 0 L4 0 x2 0.6047*L2 L3 0 0 x3 -0.8974*L2 0.0213*L3 L4 0 x4 0 0 0 L5 x4 -0.6984*L2 -1.0406*L3 -1.0281*L4 L5 423 424 > > > > > > # The commands are stored in: # # # # cement.spl The data file is stored under the name cement.dat. It has variable names on the first line. We will enter the data into a data frame. > cement <- read.table("cement.txt", header=T) > cement 1 2 3 4 5 6 7 8 9 10 11 12 13 run 1 2 3 4 5 6 7 8 9 10 11 12 13 X1 7 1 11 11 7 11 3 1 2 21 1 11 10 > # Compute correlations and round the results > # to four significant digits > round(cor(cement[-1]),4) X2 26 29 56 31 52 55 71 31 54 47 40 66 68 X3 6 15 8 8 6 9 17 22 18 4 23 9 8 X4 60 52 20 47 33 22 6 44 22 26 34 12 12 Y 78.5 74.3 104.3 87.6 95.9 109.2 102.7 72.5 93.1 115.9 83.8 113.2 109.4 X4 Y X1 1.0000 X1 0.2286 -0.8241 -0.2454 X2 X3 0.7309 X2 0.2286 1.0000 -0.1392 -0.9730 0.8162 X3 -0.8241 -0.1392 1.0000 0.0295 -0.5348 X4 -0.2454 -0.9730 0.0295 1.0000 -0.8212 Y 0.7309 0.8162 -0.5348 -0.8212 1.0000 426 425 10 30 50 15 20 30 40 50 60 70 > # to open a graphics wundow should first use motif( ) X2 40 Unix users 60 Create a scatterplot matrix with smooth curves. 50 > # > # 70 5 10 X1 30 > 15 10 5 } X4 pairs(cement[ ,-1], panel=points.lines) 110 > 100 par(din=c(7,7), pch=18, mkh=.15, cex=1.2, lwd=3) 10 > 20 30 > Y 80 90 lines(loess.smooth(x, y, 0.90)) + 60 + X3 50 points(x, y) 40 { + 20 > points.lines <- function(x, y) + 5 427 10 15 20 5 10 15 20 80 90 110 428 > # Fit a > # and Ripley, Chapter 6) linear regression model (Venables Correlation of Coefficients: (Intercept) > cement.out <- lm(Y~X1+X2+X3+X4, cement) Call: lm(formula = Y ~ X1+X2+X3+X4, data=cement) Residuals: Min 1Q Median 3Q X2 X3 X1 -0.9678 > summary(cement.out) -3.176 -1.674 X1 X2 -0.9978 0.9510 X3 -0.9769 0.9861 0.9624 X4 -0.9983 0.9568 0.9979 0.9659 Max 0.264 1.378 3.936 > anova(cement.out) Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 63.1660 69.9338 0.9032 0.3928 X1 1.5431 0.7433 2.0759 0.0716 X2 0.5020 0.7224 0.6949 0.5068 X3 0.0942 0.7532 0.1250 0.9036 X4 -0.1515 0.7077 -0.2141 0.8358 Analysis of Variance Table Response: Y Terms added sequentially (first to last) F Value Pr(F) X1 1 1448.754 1448.754 243.0978 0.0000 Residual standard error: 2.441 on 8 degrees of freedom X2 1 1205.703 1205.703 202.3144 0.0000 Multiple R-Squared: 0.9824 X3 1 9.790 9.790 1.6428 0.2358 X4 1 0.273 0.273 0.0458 0.8358 Residuals 8 47.676 5.960 F-statistic: 111.8 on 4 and 8 degrees of freedom, Df Sum of Sq Mean Sq the p-value is 4.707e-007 430 429 > # Create a function to evaluate an orthogonal > # projection matrix. Then create a function > # to compute type II sums of squares. > # This uses the ginv( ) > # library, so you must attach the MASS library function in the MASS > library(MASS) > #======================================= > # project( ) > #-------------> # calculate orthogonal projection matrix > #======================================= > project <- function(X) + { X%*%ginv(crossprod(X))%*%t(X) } > #======================================= 431 > > > > > > > > > > > + + + + + + + + + + + + + + #======================================== # typeII.SS( ) #-----------------# calculate Type II sum of squares # # input lmout = object made by the # lm( ) function # y = dependent variable #======================================== typeII.SS <- function(lmout,y) { # generate the model matrix model <- model.matrix(lmout) # create list of parameter names par.name <- dimnames(model)[[2]] # compute number of parameters n.par <- dim(model)[2] # Compute residual mean square SS.res <- deviance(lmout) df2 <- lmout$df.resid MS.res <- SS.res/df2 432 + result <- NULL # store results + + # Compute Type II SS + for (i in 1:n.par) { + A <- project(model)-project(model[,-i]) + SS.II <- t(y) %*% A %*% y + df1 <- qr(project(model))$rank - + > typeII.SS(cement.out, cement$Y) qr(project(model[ ,-i]))$rank + MS.II + F.stat <- MS.II/MS.res + p.val <- 1-pf(F.stat,df1,df2) + temp <- cbind(df1,SS.II,MS.II,F.stat,p.val) + <- SS.II/df1 Analysis of Variance (TypeII Sum of Squares) Df Sum of Sq (Inter.) result <- rbind(result,temp) + } + + result<-rbind(result,c(df2,SS.res,MS.res,NA,NA)) + dimnames(result)<-list(c(par.name,"Residual"), + c("Df","Sum of Sq","Mean Sq","F Value","Pr(F)")) 1 4.861907 Mean Sq F Value Pr(F) 4.861907 0.815818 0.392790 X1 1 25.682254 25.682254 4.309427 0.071568 X2 1 2.878010 2.878010 0.482924 0.506779 X3 1 0.093191 0.093191 0.015637 0.903570 1 0.273229 0.273229 0.045847 0.835810 X4 Residual 8 47.676412 5.959551 NA NA + cat("Analysis of Variance (TypeII Sum of Squares) + + \n") round(result,6) + } > #========================================== 433 434 > # Venables and Ripley have supplied functions > # studres( ) and stdres( ) to compute Response Predicted Residual Stud. Res. Std. Res. > # studentized and standardized residuals. 1 78.5 78.4929 0.0071 0.0040 > # You must attach the MASS library before 2 74.3 72.8005 1.4995 0.7299 0.0043 0.7522 > # using these functions. 3 104.3 105.9744 -1.6744 -1.0630 -1.0545 4 87.6 89.3333 -1.7333 -0.8291 -0.8458 0.1349 > cement.res <- cbind(cement$Y,cement.out$fitted, 5 95.9 95.6360 0.2640 0.1264 + cement.out$resid, 6 109.2 105.2635 3.9365 2.0324 1.7230 + studres(cement.out), 7 102.7 104.1289 -1.4289 -0.7128 -0.7358 + stdres(cement.out)) 8 72.5 75.6760 -3.1760 -1.9745 -1.6917 9 93.1 91.7218 1.3782 0.6472 0.6721 > dimnames(cement.res) <- list(cement$run, 10 115.9 115.6010 0.2990 0.2100 0.2237 + 11 83.8 81.8034 1.9966 1.0919 1.0790 12 113.2 112.3007 0.8993 0.4061 0.4291 13 109.4 111.6675 -2.2675 -1.1326 -1.1131 + c("Response","Predicted","Residual", "Stud. Res.","Std. Res.")) > round(cement.res,4) 435 436 > # Produce plots for model diagnostics including > # Cook's D. Unix users should first use motif() > # to open a graphics window > par(mfrow=c(3,2)) > plot(cement.out) > # Search for a simpler model > cement.stp <- step(cement.out, + scope=list(upper = ~X1 + X2 + X3 + X4, + lower = ~ 1), trace=F) 438 2.0 4 437 6 6 1.5 1.0 13 0.5 0 -2 Residuals 2 sqrt(abs(Residuals)) 8 > cement.stp$anova 13 8 80 90 100 110 80 90 Fitted : X1 + X2 + X3 + X4 100 110 Stepwise Model Path fits 4 Analysis of Deviance Table 110 6 0 Y ~ X1 + X2 + X3 + X4 -2 80 90 Y Residuals 100 2 Initial Model: Final Model: 13 8 80 90 100 110 -1 Fitted : X1 + X2 + X3 + X4 0.4 0.3 0.2 f-value Step Df Deviance Resid. Df Resid. Dev 1 0.1 11 AIC 8 47.67641 107.2719 2 - X3 1 0.093191 9 47.76960 95.4460 3 - X4 1 9.970363 10 57.73997 93.4973 0.0 -20 0.6 8 3 0.2 20 10 0 -10 Cook’s Distance 20 Residuals 10 -10 -20 0.2 Y ~ X1 + X2 1 Quantiles of Standard Normal Y 0 Fitted Values 0 0.6 2 4 6 8 10 12 Index 439