Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II ANOVA Analysis of Variance Similar in derivation to ANOVA that is generalization of two-sample t-test Partitioning of variance into several parts • that due to the ‘model’: SSR • that due to ‘error’: SSE The sum of the two parts is the total sum of squares: SST 2.6 2.4 2.2 2.0 data$logLOS 2.8 3.0 Total Deviations: Yi Y 0 200 400 data$BEDS 600 800 2.6 2.4 2.2 2.0 data$logLOS 2.8 3.0 Regression Deviations: Yˆi Y 0 200 400 data$BEDS 600 800 2.6 2.4 2.2 2.0 data$logLOS 2.8 3.0 Error Deviations: Yi Yˆi 0 200 400 data$BEDS 600 800 Definitions Yi Y Yˆi Y Yi Yˆi SST (Yi Y ) 2 2 ˆ SSR (Yi Y ) 2 ˆ SSE (Yi Y ) SST SSR SSE Example: logLOS ~ BEDS > ybar <- mean(data$logLOS) > yhati <- reg$fitted.values > sst <- sum((data$logLOS- ybar)^2) > ssr <- sum((yhati - ybar )^2) > sse <- sum((data$logLOS - yhati)^2) > > sst [1] 3.547454 > ssr [1] 0.6401715 > sse [1] 2.907282 > sse+ssr [1] 3.547454 > Degrees of Freedom Degrees of freedom for SST: n - 1 • one df is lost because it is used to estimate mean Y Degrees of freedom for SSR: 1 • only one df because all estimates are based on same fitted regression line Degrees of freedom for SSE: n - 2 • two lost due to estimating regression line (slope and intercept) Mean Squares “Scaled” version of Sum of Squares Mean Square = SS/df MSR = SSR/1 MSE = SSE/(n-2) Notes: • mean squares are not additive! That is, MSR + MSE ≠SST/(n-1) • MSE is the same as we saw previously Standard ANOVA Table SS df MS Regression SSR 1 MSR Error SSE n-2 MSE Total SST n-1 ANOVA for logLOS ~ BEDS > anova(reg) Analysis of Variance Table Response: logLOS Df Sum Sq Mean Sq F value Pr(>F) BEDS 1 0.64017 0.64017 24.442 2.737e-06 *** Residuals 111 2.90728 0.02619 Inference? What is of interest and how do we interpret? We’d like to know if BEDS is related to logLOS. How do we do that using ANOVA table? We need to know the expected value of the MSR and MSE: E ( MSE) 2 E ( MSR) 2 12 ( X i X ) 2 Implications E ( MSE) 2 E ( MSR) 2 12 ( X i X ) 2 mean of sampling distribution of MSE is σ2 regardless of whether or not β1= 0 If β1= 0, E(MSE) = E(MSR) If β1≠ 0, E(MSE) < E(MSR) To test significance of β1, we can test if MSR and MSE are of the same magnitude. F-test Derived naturally from the arguments just made Hypotheses: • H0: β1= 0 • H1: β1≠ 0 Test statistic: F* = MSR/MSE Based on earlier argument we expect F* >1 if H1 is true. Implies one-sided test. F-test The distribution of F under the null has two sets of degrees of freedom (df) • numerator degrees of freedom • denominator degrees of freedom These correspond to the df as shown in the ANOVA table • numerator df = 1 • denominator df = n-2 Test is based on MSR F* ~ F (1, n 2) MSE Implementing the F-test The decision rule If F* > F(1-α; 1, n-2), then reject Ho If F* ≤ F(1-α; 1, n-2), then fail to reject Ho 0.8 F-distributions 0.4 0.2 0.0 f1 0.6 F(1,10) F(1,1000) F(5,10) F(5,1000) 0 1 2 3 x 4 5 6 ANOVA for logLOS ~ BEDS > anova(reg) Analysis of Variance Table Response: logLOS Df Sum Sq Mean Sq F value Pr(>F) BEDS 1 0.64017 0.64017 24.442 2.737e-06 *** Residuals 111 2.90728 0.02619 > qf(0.95, 1, 111) [1] 3.926607 > 1-pf(24.44,1,111) [1] 2.739016e-06 More interesting: MLR You can test that several coefficients are zero at the same time Otherwise, F-test gives the same result as a ttest That is: for testing the significance of ONE covariate in a linear regression model, an F-test and a t-test give the same result: • H0: β1= 0 • H1: β1≠ 0 general F testing approach Previous seems simple It is in this case, but can be generalized to be more useful Imagine more general test: • Ho: small model • Ha: large model Constraint: the small model must be ‘nested’ in the large model That is, the small model must be a ‘subset’ of the large model Example of ‘nested’ models Model 1: LOSi 0 1 INFRISK 2 MS 3 NURSE 4 NURSE2 ei Model 2: LOSi 0 1 INFRISK 3 NURSE 4 NURSE2 ei Model 3: LOSi 0 1 INFRISK 2 MS ei Models 2 and 3 are nested in Model 1 Model 2 is not nested in Model 3 Model 3 is not nested in Model 2 Testing: Models must be nested! To test Model 1 vs. Model 2 • we are testing that β2 = 0 • Ho: β2 = 0 vs. Ha: β2 ≠ 0 • If β2 = 0 , then we conclude that Model 2 is superior to Model 1 • That is, if we reject the null hypothesis Model 1: LOSi 0 1 INFRISK 2 MS 3 NURSE 4 NURSE2 ei Model 2: LOSi 0 1 INFRISK 3 NURSE 4 NURSE2 ei R reg1 <- lm(LOS ~ INFRISK + ms + NURSE + nurse2, data=data) reg2 <- lm(LOS ~ INFRISK + NURSE + nurse2, data=data) reg3 <- lm(LOS ~ INFRISK + ms, data=data) > anova(reg1) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK 1 116.446 116.446 45.4043 8.115e-10 *** ms 1 12.897 12.897 5.0288 0.02697 * NURSE 1 1.097 1.097 0.4277 0.51449 nurse2 1 1.789 1.789 0.6976 0.40543 Residuals 108 276.981 2.565 --- R > anova(reg2) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK 1 116.446 116.446 44.8865 9.507e-10 *** NURSE 1 8.212 8.212 3.1653 0.078 . nurse2 1 1.782 1.782 0.6870 0.409 Residuals 109 282.771 2.594 --- > anova(reg1, reg2) Analysis of Variance Table Model 1: Model 2: Res.Df 1 108 2 109 LOS ~ INFRISK + ms + NURSE + nurse2 LOS ~ INFRISK + NURSE + nurse2 RSS Df Sum of Sq F Pr(>F) 276.981 282.771 -1 -5.789 2.2574 0.1359 R > summary(reg1) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.355e+00 5.266e-01 12.068 < 2e-16 *** INFRISK 6.289e-01 1.339e-01 4.696 7.86e-06 *** ms 7.829e-01 5.211e-01 1.502 0.136 NURSE 4.136e-03 4.093e-03 1.010 0.315 nurse2 -5.676e-06 6.796e-06 -0.835 0.405 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.601 on 108 degrees of freedom Multiple R-squared: 0.3231, Adjusted R-squared: 0.2981 F-statistic: 12.89 on 4 and 108 DF, p-value: 1.298e-08 > Testing more than two covariates To test Model 1 vs. Model 3 • we are testing that β3 = 0 AND β4 = 0 • Ho: β3 = β4 = 0 vs. Ha: β3 ≠ 0 or β4 ≠ 0 • If β3 = β4 = 0, then we conclude that Model 3 is superior to Model 1 • That is, if we reject the null hypothesis Model 1: LOSi 0 1 INFRISK 2 MS 3 NURSE 4 NURSE2 ei Model 3: LOSi 0 1 INFRISK 2 MS ei R > anova(reg3) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK 1 116.446 116.446 45.7683 6.724e-10 *** ms 1 12.897 12.897 5.0691 0.02634 * Residuals 110 279.867 2.544 --- > anova(reg1, reg3) Analysis of Variance Table Model 1: Model 2: Res.Df 1 108 2 110 LOS ~ INFRISK + ms + NURSE + nurse2 LOS ~ INFRISK + ms RSS Df Sum of Sq F Pr(>F) 276.981 279.867 -2 -2.886 0.5627 0.5713 R > summary(reg3) Call: lm(formula = LOS ~ INFRISK + ms, data = data) Residuals: Min 1Q Median -2.9037 -0.8739 -0.1142 3Q 0.5965 Max 8.5568 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.4547 0.5146 12.542 <2e-16 *** INFRISK 0.6998 0.1156 6.054 2e-08 *** ms 0.9717 0.4316 2.251 0.0263 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ Residual standard error: 1.595 on 110 degrees of freedom Multiple R-squared: 0.3161, Adjusted R-squared: 0.3036 F-statistic: 25.42 on 2 and 110 DF, p-value: 8.42e-10 Testing multiple coefficients simultaneously Region: it is a ‘factor’ variable with 4 categories LOSi 0 1 I ( Ri 2) 2 I ( Ri 3) 3 I ( Ri 4) ei