Analysis of Covariance (Chapter 16) • A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable, X, sometimes called a covariate. • The procedure, ANCOVA, is a combination of ANOVA with regression. 23-1 Example: Calf Weight Gain • An animal scientist wishes to examine the impact of a pair of new dietary supplements on calf weight gain (response). • Three treatments are defined: standard diet, standard diet + supplement Q, and standard diet + supplement R. • All new calves from a large herd are available for use as study units. She selects 30 calves for study. Calves are randomized to the three diets at random (completely randomized design). • Initial weights are recorded, then calves are placed on the diets. At the end of four weeks the final weight is taken and weight gain is computed. • Simple analysis of variance and associated multiple comparisons procedures indicate no significant differences in weight gain between the two supplementary diets, but big differences between the supplemental diets and the standard diet. • Is this the end of the story? … 23-2 ANOVA Results Average Weight Gain (Response g/day) xx x xx x x xx Standard Diet xxx xxx x xx xx x x x xx x x x + Supplement Q + Supplement R Simple ANOVA of a one-way classification would suggest no difference between Supplements Q and R but both different from Standard diet. 23-3 Initial Weights Initial Weight xx x x xx x x xx xx x xxxx x xx Standard Diet + Supplement Q x xx xx xx x x x + Supplement R Plotting of the initial weights by group shows that the groups were not equal when it came to initial weights. 23-4 Weight Gain to Initial Weight Standard Diet Weight (kg) 2 wF 2 w g ain 1 w gain 1 wF 1 wi w 2 i age If animals come into the study at different ages, they have different initial weights and are at different points on the growth curve. Expected weight gains will be different depending on age at entry into study. 23-5 Regression of Initial Weight to Weight Gain 2 w gain Weight Gain (g/day) (Y) 1 w gain 1 wi w 2 i Initial Weight (x) If we disregard the age of the animal but instead focus on the initial weight, we see that there is a linear relationship between initial weight and the weight gain expected. 23-6 Covariates Initial weight in the previous example is a covariable or covariate. A covariate is a disturbing variable (confounder), that is, it is known to have an effect on the response. Usually, the covariate can be measured but often we may not be able to control its effect through blocking. In the EXAMPLE, had the animal scientist known that the calves were very variable in initial weight (or age), she could have: • Created blocks of 3 or 6 equal weight animals, and randomized treatments to calves within these blocks. • This would have entailed some cost in terms of time spent sorting the calves and then keeping track of block membership over the life of the study. • It was much easier to simply record the calf initial weight and then use analysis of covariance for the final analysis. • In many cases, due to the continuous nature of the covariate, blocking is just not feasible. 23-7 Expectations under Ho Under Ho: no treatment effects. If all animals had come in with the same initial weight, All three treatments would produce the same weight gain. Expected Weight Gain (g/day) (Y) Initial Weight (x) Average Weight Animal 23-8 Expectations under HA Under Ha: Significant Treatment effects + Supplement Q (q) + Supplement R (r) Standard Diet (c) WGQ WGR WGs Different treatments produce different weight gains for animals of the same initial weight. Expected Weight Gain (g/day) (Y) Average Weight Animal Initial Weight (x) 23-9 Different Initial Weights Under Ho: no treatment effects. If the average initial weights in the treatment groups differ, the observed weight gains will be different, even if treatments have no effect. WGR WGs WGQ Expected Weight Gain (g/day) (Y) cc c qq r rr c cc c c cc q qqqq q qq rr rr r r r Initial Weight (x) 23-10 Observed Responses under HA Suppose now that different supplements actually do increase weight gain. This translates to animals in different treatment groups following different, but parallel regression lines with initial weight. + Supplement Q + Supplement R WGR WGQ WGs Weight Gain (g/day) (Y) q rr r r rr r rr r q qq q q q c q q c cc q c cc c c c cc c qq r rr c cc c c cc q qqqq q qq rr rr r r r Standard Diet Under HA: Significant Treatment effects Initial Weight (x) What difference in weight gain is due to Initial weight and what is due to Treatment? 23-11 Observed Group Means Weight Gain (g/day) (Y) Simple one-way classification ANOVA (without accounting for initial weight) gives us the wrong answer! + Supplement Q + Supplement R yr yq yc Unadjusted treatment means q r rr rr r r r rr Standard Diet q qq q q q c c q q c q c c c c c cc cc c qq r rr c cc c c cc q qqqq q qq rr rr r r r Initial Weight (x) 23-12 Predicted Average Responses Weight Gain (g/day) (Y) y q | X x y r | X x y c | X x Adjusted treatment means Expected weight gain is computed for treatments for the average initial weight and comparisons are then made. + Supplement Q + Supplement R r rr rr r r r rr q Standard Diet q qq q q q c c q q c q c c c c c cc cc c qq r rr c cc c c cc q qqqq q qq rr rr r r r X x Initial Weight (x) 23-13 ANCOVA: Objectives The objective of an analysis of covariance is to compare the treatment means after adjusting for differences among the treatments due to differences in the covariate levels for the treatments groups. The analysis proceeds by combining a regression model with an analysis of variance model. 23-14 Model E ( y ij ) = m+ a i + b x ij The ai, i=1,…,t, are estimates of how each of the t treatments modifies the overall mean response. (The index j=1,…,n, runs over the n replicates for each treatment.) The slope coefficient, , is a measure of how the average response changes as the value of the covariate changes. The analysis proceeds by fitting a linear regression model with dummy variables to code for the different treatment levels. 23-15 A Priori Assumptions The covariate is related to the response, and can account for variation in the response. Check with a scatterplot of Y vs. X. The covariate is NOT related to the treatments. If Y is related to X, then the variance of the treatment differences is increased relative to that obtained from an ANOVA model without X, which results in a loss of precision. The treatment’s regression equations are linear in the covariate. Check with a scatterplot of Y vs. X, for each treatment. Non-linearity can be accommodated (e.g. polynomial terms, transforms), but analysis may be more complex. The regression lines for the different treatments are parallel. This means there is only one slope in the Y vs. X plots. Non-parallel lines can be accommodated, but this complicates the analysis 23-16 since differences in treatments will now depend on the value of Example Four different formulations of an industrial glue are being tested. The tensile strength (response) of the glue is known to be related to the thickness as applied. Five observations on strength (Y) in pounds, and thickness (X) in 0.01 inches are made for each formulation. Here: • There are t=4 treatments (formulations of glue). • Covariate X is thickness of applied glue. • Each treatment is replicated n=5 times at different values of X. Formulation Strength Thickness 1 46.5 13 1 45.9 14 1 49.8 12 1 46.1 12 1 44.3 14 2 48.7 12 2 49.0 10 2 50.1 11 2 48.5 12 2 45.2 14 3 46.3 15 3 47.1 14 3 48.9 11 3 48.2 11 3 50.3 10 4 44.7 16 4 43.0 15 4 51.0 10 4 48.1 12 4 46.8 11 23-17 Formulation Profiles 52.0 48.0 Strength (Y) 44.0 40.0 16 15 10 12 11 Thickness (X) Form_1 Form_2 Form_3 Form_4 23-18 SAS Program The basic model is a combination of regression and one-way classification. data glue; input Formulation Strength Thickness; datalines; 1 46.5 13 1 45.9 14 1 49.8 12 1 46.1 12 1 44.3 14 2 48.7 12 2 49.0 10 2 50.1 11 2 48.5 12 2 45.2 14 3 46.3 15 3 47.1 14 3 48.9 11 3 48.2 11 3 50.3 10 4 44.7 16 4 43.0 15 4 51.0 10 4 48.1 12 4 46.8 11 ; run; proc glm; class formulation; model strength = thickness formulation / solution ; lsmeans formulation / stderr pdiff; run; 23-19 Output: Use Type III SS to test significance of each variable MSE Source Model Error Corrected Total DF 4 15 19 R-Square 0.730636 Squares 66.31065753 24.44684247 90.75750000 Coeff Var 2.691897 Mean Square 16.57766438 1.62978950 Root MSE 1.276632 F Value 10.17 Pr > F 0.0003 Regression on thickness is significant. No formulation differences. Strength Mean 47.42500 Source Thickness Formulation DF 1 3 Type I SS 63.50120135 2.80945618 Mean Square 63.50120135 0.93648539 F Value 38.96 0.57 Pr > F <.0001 0.6405 Source Thickness Formulation DF 1 3 Type III SS 53.20115753 2.80945618 Mean Square 53.20115753 0.93648539 F Value 32.64 0.57 Pr > F <.0001 0.6405 Parameter Intercept Thickness Formulation Formulation Formulation Formulation Estimate 1 2 3 4 58.93698630 -0.95445205 -0.00910959 0.62554795 0.86732877 0.00000000 B B B B B Standard Error t Value Pr > |t| 2.21321008 0.16705494 0.80810401 0.82451389 0.81361075 . 26.63 -5.71 -0.01 0.76 1.07 . <.0001 <.0001 0.9912 0.4598 0.3033 . Divide by MSE to get mean squares. 23-20 Least Squares Means (Adjusted Formulation means computed at the average value of Thickness [=12.45]) The GLM Procedure Least Squares Means Formulation 1 2 3 4 Strength LSMEAN Standard Error Pr > |t| LSMEAN Number 47.0449486 47.6796062 47.9213870 47.0540582 0.5782732 0.5811616 0.5724527 0.5739134 <.0001 <.0001 <.0001 <.0001 1 2 3 4 Least Squares Means for effect Formulation Pr > |t| for H0: LSMean(i)=LSMean(j) Dependent Variable: Strength i/j 1 2 3 4 1 0.4574 0.3011 0.9912 2 0.4574 0.7695 0.4598 3 0.3011 0.7695 4 0.9912 0.4598 0.3033 0.3033 23-21 ANCOVA in Minitab Stat > ANOVA > General Linear Model … > Responses: Strength > Model: Formulation > Covariates: Thickness > Options: Adjusted (Type III) Sums of Squares General Linear Model: Strength versus Formulation Factor Formulat Type Levels Values fixed 4 1 2 3 4 Source Thicknes Formulat Error Total DF 1 3 15 19 Seq SS 63.501 2.809 24.447 90.758 Adj SS 53.201 2.809 24.447 Adj MS 53.201 0.936 1.630 Term Constant Thicknes Formulat 1 2 3 Coef 59.308 -0.9545 SE Coef 2.099 0.1671 T 28.25 -5.71 P 0.000 0.000 -0.3801 0.2546 0.4964 0.5029 0.5062 0.4962 -0.76 0.50 1.00 0.462 0.622 0.333 F 32.64 0.57 Formulation 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 Strength 46.5 45.9 49.8 46.1 44.3 48.7 49.0 50.1 48.5 45.2 46.3 47.1 48.9 48.2 50.3 44.7 43.0 51.0 48.1 46.8 Thickness 13 14 12 12 14 12 10 11 12 14 15 14 11 11 10 16 15 10 12 11 P 0.000 0.640 23-22 Factor Plots… > Main Effects Plot > Formulation Main Effects Plot - LS Means for Strength 47.9 47.8 Strength 47.7 47.6 47.5 47.4 47.3 47.2 47.1 47.0 1 2 3 4 Formulation 23-23 ANCOVA in R > glue <- read.table("glue.txt",header=TRUE) > glue$Formulation <- as.factor(glue$Formulation) > # fit linear models: full, thickness only, formulation only > full.lm <- lm(Strength ~ Formulation + Thickness, data=glue) > thick.lm <- lm(Strength ~ Thickness, data=glue) > formu.lm <- lm(Strength ~ Formulation, data=glue) > > anova(thick.lm,full.lm) Analysis of Variance Table Model 1: Model 2: Res.Df 1 18 2 15 Strength ~ Thickness Strength ~ Formulation + Thickness RSS Df Sum of Sq F Pr(>F) 27.2563 24.4468 3 2.8095 0.5746 0.6405 Test for Formulation differences > anova(formu.lm,full.lm) Analysis of Variance Table Model 1: Model 2: Res.Df 1 16 2 15 Strength ~ Formulation Strength ~ Formulation + Thickness RSS Df Sum of Sq F Pr(>F) 77.648 24.447 1 53.201 32.643 4.105e-05 *** Test for significance of Thickness 23-24 > summary(full.lm) Call: lm(formula = Strength ~ Formulation + Thickness, data = glue) Residuals: Min 1Q -1.6380 -1.0398 Median 0.1873 3Q 0.6966 Max 2.3255 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 58.92788 2.24551 26.243 5.97e-14 *** Formulation2 0.63466 0.83193 0.763 0.457 Formulation3 0.87644 0.81840 1.071 0.301 Formulation4 0.00911 0.80810 0.011 0.991 Thickness -0.95445 0.16706 -5.713 4.11e-05 *** > summary(thick.lm) Call: lm(formula = Strength ~ Thickness, data = glue) Residuals: Min 1Q -2.0813 -0.7324 Median 0.1274 3Q 0.9090 R Full model (can be refined by omitting formulation) Max 1.9230 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 59.9294 1.9504 30.726 < 2e-16 *** Thickness -1.0044 0.1551 -6.476 4.32e-06 *** Residual standard error: 1.231 on 18 degrees of freedom Multiple R-Squared: 0.6997, Adjusted R-squared: 0.683 F-statistic: 41.94 on 1 and 18 DF, p-value: 4.317e-06 Reduced model (formulation omitted) 23-25 Plot lines for full model; but these can all be replaced by single line for reduced model (blue). R 23-26 Check fit of reduced model (with just thickness). R 23-27