lect9

Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II ANOVA  Analysis of Variance  Similar in derivation to ANOVA that is generalization of two-sample t-test  Partitioning of variance into several parts • that due to the ‘model’: SSR • that due to ‘error’: SSE  The sum of the two parts is the total sum of squares: SST 2.6 2.4 2.2 2.0 data$logLOS 2.8 3.0 Total Deviations: Yi  Y 0 200 400 data$BEDS 600 800 2.6 2.4 2.2 2.0 data$logLOS 2.8 3.0 Regression Deviations: Yî  Y 0 200 400 data$BEDS 600 800 2.6 2.4 2.2 2.0 data$logLOS 2.8 3.0 Error Deviations: Yi  Yî 0 200 400 data$BEDS 600 800 Definitions Yi  Y  Yî  Y  Yi  Yî SST   (Yi  Y ) 2 2 ˆ SSR   (Yi  Y ) 2 ˆ SSE   (Yi  Y ) SST  SSR  SSE Example: logLOS ~ BEDS > ybar <- mean(data$logLOS) > yhati <- reg$fitted.values > sst <- sum((data$logLOS- ybar)^2) > ssr <- sum((yhati - ybar )^2) > sse <- sum((data$logLOS - yhati)^2) > > sst [1] 3.547454 > ssr [1] 0.6401715 > sse [1] 2.907282 > sse+ssr [1] 3.547454 > Degrees of Freedom  Degrees of freedom for SST: n - 1 • one df is lost because it is used to estimate mean Y  Degrees of freedom for SSR: 1 • only one df because all estimates are based on same fitted regression line  Degrees of freedom for SSE: n - 2 • two lost due to estimating regression line (slope and intercept) Mean Squares  “Scaled” version of Sum of Squares  Mean Square = SS/df  MSR = SSR/1  MSE = SSE/(n-2)  Notes: • mean squares are not additive! That is, MSR + MSE ≠SST/(n-1) • MSE is the same as we saw previously Standard ANOVA Table SS df MS Regression SSR 1 MSR Error SSE n-2 MSE Total SST n-1 ANOVA for logLOS ~ BEDS > anova(reg) Analysis of Variance Table Response: logLOS Df Sum Sq Mean Sq F value Pr(>F) BEDS 1 0.64017 0.64017 24.442 2.737e-06 *** Residuals 111 2.90728 0.02619 Inference?  What is of interest and how do we interpret?  We’d like to know if BEDS is related to logLOS.  How do we do that using ANOVA table?  We need to know the expected value of the MSR and MSE: E ( MSE)   2 E ( MSR)   2  12  ( X i  X ) 2 Implications E ( MSE)   2 E ( MSR)   2  12  ( X i  X ) 2  mean of sampling distribution of MSE is σ2 regardless of whether or not β1= 0  If β1= 0, E(MSE) = E(MSR)  If β1≠ 0, E(MSE) < E(MSR)  To test significance of β1, we can test if MSR and MSE are of the same magnitude. F-test  Derived naturally from the arguments just made  Hypotheses: • H0: β1= 0 • H1: β1≠ 0  Test statistic: F* = MSR/MSE  Based on earlier argument we expect F* >1 if H1 is true.  Implies one-sided test. F-test  The distribution of F under the null has two sets of degrees of freedom (df) • numerator degrees of freedom • denominator degrees of freedom  These correspond to the df as shown in the ANOVA table • numerator df = 1 • denominator df = n-2  Test is based on MSR F*  ~ F (1, n  2) MSE Implementing the F-test  The decision rule  If F* > F(1-α; 1, n-2), then reject Ho  If F* ≤ F(1-α; 1, n-2), then fail to reject Ho 0.8 F-distributions 0.4 0.2 0.0 f1 0.6 F(1,10) F(1,1000) F(5,10) F(5,1000) 0 1 2 3 x 4 5 6 ANOVA for logLOS ~ BEDS > anova(reg) Analysis of Variance Table Response: logLOS Df Sum Sq Mean Sq F value Pr(>F) BEDS 1 0.64017 0.64017 24.442 2.737e-06 *** Residuals 111 2.90728 0.02619 > qf(0.95, 1, 111) [1] 3.926607 > 1-pf(24.44,1,111) [1] 2.739016e-06 More interesting: MLR  You can test that several coefficients are zero at the same time  Otherwise, F-test gives the same result as a ttest  That is: for testing the significance of ONE covariate in a linear regression model, an F-test and a t-test give the same result: • H0: β1= 0 • H1: β1≠ 0 general F testing approach  Previous seems simple  It is in this case, but can be generalized to be more useful  Imagine more general test: • Ho: small model • Ha: large model  Constraint: the small model must be ‘nested’ in the large model  That is, the small model must be a ‘subset’ of the large model Example of ‘nested’ models Model 1: LOSi   0  1 INFRISK  2 MS  3 NURSE   4 NURSE2  ei Model 2: LOSi   0  1 INFRISK 3 NURSE   4 NURSE2  ei Model 3: LOSi   0  1 INFRISK  2 MS  ei Models 2 and 3 are nested in Model 1 Model 2 is not nested in Model 3 Model 3 is not nested in Model 2 Testing: Models must be nested!  To test Model 1 vs. Model 2 • we are testing that β2 = 0 • Ho: β2 = 0 vs. Ha: β2 ≠ 0 • If β2 = 0 , then we conclude that Model 2 is superior to Model 1 • That is, if we reject the null hypothesis Model 1: LOSi   0  1 INFRISK  2 MS  3 NURSE   4 NURSE2  ei Model 2: LOSi   0  1 INFRISK 3 NURSE   4 NURSE2  ei R reg1 <- lm(LOS ~ INFRISK + ms + NURSE + nurse2, data=data) reg2 <- lm(LOS ~ INFRISK + NURSE + nurse2, data=data) reg3 <- lm(LOS ~ INFRISK + ms, data=data) > anova(reg1) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK 1 116.446 116.446 45.4043 8.115e-10 *** ms 1 12.897 12.897 5.0288 0.02697 * NURSE 1 1.097 1.097 0.4277 0.51449 nurse2 1 1.789 1.789 0.6976 0.40543 Residuals 108 276.981 2.565 --- R > anova(reg2) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK 1 116.446 116.446 44.8865 9.507e-10 *** NURSE 1 8.212 8.212 3.1653 0.078 . nurse2 1 1.782 1.782 0.6870 0.409 Residuals 109 282.771 2.594 --- > anova(reg1, reg2) Analysis of Variance Table Model 1: Model 2: Res.Df 1 108 2 109 LOS ~ INFRISK + ms + NURSE + nurse2 LOS ~ INFRISK + NURSE + nurse2 RSS Df Sum of Sq F Pr(>F) 276.981 282.771 -1 -5.789 2.2574 0.1359 R > summary(reg1) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.355e+00 5.266e-01 12.068 < 2e-16 *** INFRISK 6.289e-01 1.339e-01 4.696 7.86e-06 *** ms 7.829e-01 5.211e-01 1.502 0.136 NURSE 4.136e-03 4.093e-03 1.010 0.315 nurse2 -5.676e-06 6.796e-06 -0.835 0.405 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.601 on 108 degrees of freedom Multiple R-squared: 0.3231, Adjusted R-squared: 0.2981 F-statistic: 12.89 on 4 and 108 DF, p-value: 1.298e-08 > Testing more than two covariates  To test Model 1 vs. Model 3 • we are testing that β3 = 0 AND β4 = 0 • Ho: β3 = β4 = 0 vs. Ha: β3 ≠ 0 or β4 ≠ 0 • If β3 = β4 = 0, then we conclude that Model 3 is superior to Model 1 • That is, if we reject the null hypothesis Model 1: LOSi   0  1 INFRISK  2 MS  3 NURSE   4 NURSE2  ei Model 3: LOSi   0  1 INFRISK  2 MS  ei R > anova(reg3) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK 1 116.446 116.446 45.7683 6.724e-10 *** ms 1 12.897 12.897 5.0691 0.02634 * Residuals 110 279.867 2.544 --- > anova(reg1, reg3) Analysis of Variance Table Model 1: Model 2: Res.Df 1 108 2 110 LOS ~ INFRISK + ms + NURSE + nurse2 LOS ~ INFRISK + ms RSS Df Sum of Sq F Pr(>F) 276.981 279.867 -2 -2.886 0.5627 0.5713 R > summary(reg3) Call: lm(formula = LOS ~ INFRISK + ms, data = data) Residuals: Min 1Q Median -2.9037 -0.8739 -0.1142 3Q 0.5965 Max 8.5568 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.4547 0.5146 12.542 <2e-16 *** INFRISK 0.6998 0.1156 6.054 2e-08 *** ms 0.9717 0.4316 2.251 0.0263 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ Residual standard error: 1.595 on 110 degrees of freedom Multiple R-squared: 0.3161, Adjusted R-squared: 0.3036 F-statistic: 25.42 on 2 and 110 DF, p-value: 8.42e-10 Testing multiple coefficients simultaneously  Region: it is a ‘factor’ variable with 4 categories LOSi   0  1 I ( Ri  2)   2 I ( Ri  3)   3 I ( Ri  4)  ei

lect9

Related documents

Products

Support

lect9

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib