Analysis of Covariance Chapter 17 Analysis of Covariance (ANCOVA) It is useful when we are interested in comparing treatment effects, but our response is affected by another numerical variable that we cannot effectively control in our design. It’s like having a numerical BLOCK. Example: Studying weekly sales of Y of some item under advertising strategies for different stores (treatments), will be more successful if each store’s sales of the item the week before, X, is included in our study. Physicians studying the effects of diets on renal function will want to use the age of the patients as a co-variate since, age may have huge effects on the renal function too. The data for the simplest ANCOVA will be of the following form: ni observation from the ith treatment as pairs (Yij, Xij), j=1,…,ni and i=1,…,t. The FULL model or the unequal slopes model for an ANCOVA is simply that each of the r treatments possesses its own regression line for Y vs. X, but with the same amount of variability for each line. Data Example A common clinical method to evaluate an individual's cardiovascular capacity is through treadmill exercise testing. One of the measures obtained during treadmill testing, maximal oxygen uptake, is considered the best index of work capacity and maximal cardiovascular function. The measured maximal oxygen uptake by an individual depends on a number of factors including the mode of testing, test protocol, and the subject’s physical condition and age. A common test protocol on the treadmill is the inclined protocol where grade and speed are incrementally increased until exhaustion occurs. Two treatments were of interest to the researcher: a 12-week step aerobic training program and a 12-week outdoor running regimen on flat terrain. It was thought that the step aerobic training better simulated the treadmill inclined protocol than the flat terrain running regimen. 12 healthy males who did not participate in a regular exercise program were selected. Six individuals were randomly assigned to the step aerobic treatment and six to the flat terrain running treatment. Various respiratory measurements were made on the subjects while on the treadmill before the 12-week period. There were no differences in the respiratory measurements of the two groups of subjects prior to the treatment. The measurement of interest for this example is the change in maximal ventilation (liters/minute) of oxygen for the 12-week period. The observations on the 12 subjects and their ages are shown in the following table: Aerobic Age 31 23 27 28 22 24 Group Change 17.05 4.96 10.40 11.05 0.26 2.51 Running Age 23 22 22 25 27 20 Group Change -0.87 -10.74 -3.27 -1.97 7.50 -7.25 The experimental design is completely randomized with a oneway treatment structure. However, we need to control for the fact that the subjects had different ages. We also believe age will affect the response variable BUT not the treatment A model with a linear predictor for adjusting age (X) effects is: Yij i Xij Xij ei j The analysis based on this model is known as analysis of covariance (ANCOVA). Assumptions: Yij response for the (ij)th observation is the grand mean τ i is the ith treatment effect β j is the common slope of the linear predictor τβ ij is the interaction between the treatment effect and the linear predictor. X ij is the (ij)th value of the linear predictor eij is the (ij)th error, which are independent and identically normally distributed with mean zero and variance 2 . OR we can write it as:Yij i + i Xij + i)j Where: is the overall constant (an average Y intercept over the r regression lines) i: an adjustment to the Y intercept for the ith regression line i: slope of the ith regression line (combines and ) Xij covariate assumed to be measured without error ij are independently, normally distributed with mean 0 and variance 2. The true regression line for treatment 1 is (+1) + 1X for treatment 2 is (+2) + 2X and so on.. In most situations we are interested in comparing the mean responses between treatments at a specified value of X, say X0. Such a difference is labeled D (for treatment 1 and 2) D= () – () X0. Obviously if we try to do this for all possible values of X0 its going to be a lot of work. Hence it would be much easier for us if the lines were parallel () and then it’s a straight comparison of (). Then our model is of form: The Parallel lines Model: Yij i + Xij + ij. So when comparing the mean responses among treatments that of primary interest, 1. Fit the first (unequal slopes) model. 2. Check for equality of slopes 3. If the test is highly INSIGNIFICANT, fit the second model and proceed with comparison of means. For our data Test if the interaction () between the treatment effect and the linear predictor is significant. Recall, that interactions measure the parallel nature of the treatment means across the levels of the second factor, which in this case is the linear predictor. If the interaction is significant, this indicates that at least on of the regression lines for a treatment has a different slope. If the interaction is non-significant, this indicates that the regression lines have the same slope for each treatment. How to do this in SAS: Do a plot to check for the equality of slopes. PROC PLOT; PLOT Y*X=TRT; For the oxygen uptake data the regression lines for the two treatments appears as follows: Regression lines for the two treatments (Aerobic and Running) This should give you a rough idea of whether the lines are indeed parallel. To do a formal test, we want to check for equality of slopes. PROC GLM; CLASS TRT; MODEL Y=TRT X X*TRT; REMEMBER: You should only interpret the TYPE III F test for X*TRT which tests for equal slopes. Do not interpret anything else. ESPECIALLY TRT effects. (It tests for equality of the y intercepts among the treatments and if X=0 is not in your data range, this test is neither of use, nor relevant). Equal Slopes Model: If you have decided that the slopes are indeed equal, you can use the following statements PROC GLM; CLASS TRT; MODEL Y=TRT X; Hypothesis of interest: No treatment effects: (lines coincide) , r (TYPE III F TEST FOR TRT) No X effect (slope =0) =0 (TYPE III F TEST FOR X) SAS creates the vector of parameters as follows and you can estimate anything you want from the ESTIMATE statements in SAS. SAS code for the Exercise Data: title3 "With Covariate the Age of the Individual"; data ancova; input age oxygen treatment $; cards; 31 17.05 aerobic 23 4.96 aerobic 27 10.40 aerobic 28 11.05 aerobic 22 0.26 aerobic 24 2.51 aerobic 23 -0.87 running 22 -10.74 running 22 -3.27 running 25 -1.97 running 27 7.50 running 20 -7.25 running ; proc print data = ancova; run; title4 "Model With an Interation"; proc glm data = ancova; class treatment; model oxygen = treatment age treatment*age / solution; run; title4 "Model Without an Interation"; proc glm data = ancova; class treatment; model oxygen = treatment age / solution; lsmeans treatment / pdiff stderr; run; Consider a situation with 3 treatments and 1 covariate. Vector created by SAS is ( Intercept TRT slope ) How to do this: If you are interested in the intercept of treatment 1 ESTIMATE INTERCEPT 1 TRT 1 0 0; Common slope ESTIMETE X 1; Distance between line 1 and 2 ESTIMATE TRT 1 –1 0; Mean response in treatment 1 with a X=50 ESTIMATE INTERCEPT 1 TRT 1 0 0 X 50; AND SO ON. The LSMEANS or the adjusted means calculates the means of the treatment at the most typical value of X which is X…, If that is of interest to you you can use the following statements; After the model statement LSMEANS TRT/ STDERR PDIFF; It gives you the estimates of the means, the stderr and the p-valus for the non-simulatneous difference among the means. You can use these results to do BONFERRONI type comparisons. HOWEVER, NEVER NEVER USE THE MEANS STATEMENT IS SAS WITH ANCOVA. CRD with Repeated Measures 1 Sample Unit = Subject (1 to 8), Treatment = Drug (AX23, BWW9, CONTROL) With Covariate the Age of the Individual Obs age oxygen treatment 1 2 3 4 5 6 7 8 9 10 11 12 31 23 27 28 22 24 23 22 22 25 27 20 17.05 4.96 10.40 11.05 0.26 2.51 -0.87 -10.74 -3.27 -1.97 7.50 -7.25 aerobic aerobic aerobic aerobic aerobic aerobic running running running running running running CRD with Repeated Measures 2 Sample Unit = Subject (1 to 8), Treatment = Drug (AX23, BWW9, CONTROL) With Covariate the Age of the Individual Model With an Interation The GLM Procedure Class Level Information Class treatment Levels 2 Values aerobic running Number of Observations Read Number of Observations Used 12 12 CRD with Repeated Measures 3 Sample Unit = Subject (1 to 8), Treatment = Drug (AX23, BWW9, CONTROL) With Covariate the Age of the Individual Model With an Interation The GLM Procedure Dependent Variable: oxygen Source DF Sum of Squares Mean Square F Value Pr > F Model 3 649.9238779 216.6412926 25.36 0.0002 Error 8 68.3398137 8.5424767 11 718.2636917 Corrected Total R-Square Coeff Var Root MSE oxygen Mean 0.904854 118.3700 2.922752 2.469167 Source treatment age age*treatment Source treatment age age*treatment Parameter Intercept treatment treatment age age*treatment age*treatment DF Type I SS Mean Square F Value Pr > F 1 1 1 328.9674083 318.9075130 2.0489566 328.9674083 318.9075130 2.0489566 38.51 37.33 0.24 0.0003 0.0003 0.6375 DF Type III SS Mean Square F Value Pr > F 1 1 1 5.9071100 303.1764867 2.0489566 5.9071100 303.1764867 2.0489566 0.69 35.49 0.24 0.4298 0.0003 0.6375 Estimate aerobic running aerobic running -51.29394595 13.10709042 0.00000000 2.09470270 -0.31824378 0.00000000 B B B B B B Standard Error t Value Pr > |t| 12.25221255 15.76197619 . 0.52635853 0.64980859 . -4.19 0.83 . 3.98 -0.49 . 0.0031 0.4298 . 0.0041 0.6375 . NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable. CRD with Repeated Measures 4 Sample Unit = Subject (1 to 8), Treatment = Drug (AX23, BWW9, CONTROL) With Covariate the Age of the Individual Model Without an Interation The GLM Procedure Class Level Information Class Levels treatment Values 2 aerobic running Number of Observations Read Number of Observations Used 12 12 CRD with Repeated Measures 5 Sample Unit = Subject (1 to 8), Treatment = Drug (AX23, BWW9, CONTROL) With Covariate the Age of the Individual Model Without an Interation The GLM Procedure Dependent Variable: oxygen Source DF Sum of Squares Mean Square F Value Pr > F Model 2 647.8749214 323.9374607 41.42 <.0001 Error 9 70.3887703 7.8209745 11 718.2636917 Corrected Total R-Square Coeff Var Root MSE oxygen Mean 0.902001 113.2609 2.796601 2.469167 Source treatment age Source treatment age Parameter Intercept treatment aerobic treatment running age DF Type I SS Mean Square F Value Pr > F 1 1 328.9674083 318.9075130 328.9674083 318.9075130 42.06 40.78 0.0001 0.0001 DF Type III SS Mean Square F Value Pr > F 1 1 71.7869428 318.9075130 Estimate -46.45650248 B 5.44262082 B 0.00000000 B 1.88589219 71.7869428 9.18 0.0143 318.9075130 40.78 0.0001 Standard Error t Value Pr > |t| 6.93653144 1.79645269 . 0.29533500 -6.70 3.03 . 6.39 <.0001 0.0143 . 0.0001 NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable. CRD with Repeated Measures 6 Sample Unit = Subject (1 to 8), Treatment = Drug (AX23, BWW9, CONTROL) With Covariate the Age of the Individual Model Without an Interation The GLM Procedure Least Squares Means treatment aerobic running oxygen LSMEAN Standard Error H0:LSMEAN=0 Pr > |t| 5.19047708 -0.25214374 1.20770793 1.20770793 0.0020 0.8 H0:LSMean1= LSMean2 Pr > |t| 0.0143 Let us consider the following example: We are interested to see if there is a difference in the mean car prices for cars (which are roughly the same age and have similar mileage) for 4 different car makers: Chevrolet, Pontiac, Saab and Buick. To look at this, we randomly select 10 sedans for each of the four makers and record the blue book price. Since we cannot get exactly the same mileages from each maker we also record their specific mileages. Based on the data given below do you see a difference in the mean price by the makers? Does mileage matter in terms of price? Does mileage matter in terms of makes for price? Here is your data: Price mileage make 17314.1 17542.0 16218.8 16336.9 16339.2 15709.1 15048.0 14862.1 15295.0 21335.9 12649.1 12314.6 11318.0 12409.9 11555.3 11700.1 11215.0 10145.0 9954.1 11918.5 25452.5 23449.3 23578.2 22525.3 21982.6 22231.6 22189.1 21765.1 21403.8 8221 9135 13196 16342 19832 22236 22964 24021 27325 10237 3629 4142 11156 11981 13404 15253 19945 23963 37345 7278 11892 17273 19148 19521 20472 21929 25651 25794 27168 Buick Buick Buick Buick Buick Buick Buick Buick Buick Buick Chevrolet Chevrolet Chevrolet Chevrolet Chevrolet Chevrolet Chevrolet Chevrolet Chevrolet Chevrolet Pontiac Pontiac Pontiac Pontiac Pontiac Pontiac Pontiac Pontiac Pontiac 21200.7 26337.8 26775.0 25300.0 24896.6 25996.8 24801.6 24063.0 23249.8 19244.9 26841.1 31197 16068 16688 19569 21266 21433 26345 27674 27686 30387 10003 Pontiac SAAB SAAB SAAB SAAB SAAB SAAB SAAB SAAB SAAB SAAB The SAS System The GLM Procedure Class Level Information Class Levels Values Make 4 Buick Chevrole Pontiac SAAB Number of Observations Read 40 Number of Observations Used 40 The SAS System The GLM Procedure Dependent Variable: Price Source DF Sum of Squares Mean Square F Value Pr > F Model 7 1140738140 162962591 Error 32 34501975 1078187 Corrected Total 39 1175240115 151.15 <.0001 R-Square Coeff Var Root MSE Price Mean 0.970643 5.505130 1038.358 18861.64 Make 3 256816406.2 85605468.7 79.40 <.0001 Mileage 1 Mileage*Make 3 61813903.6 61813903.6 57.33 <.0001 13247454.2 4415818.1 4.10 0.0144 Since there is an interaction we could compare the prices for a specific mileage and see if the prices are different. For example the difference in price between Buick and Chevy at 20,000 miles is Parameter Estimate Standard Error t Value Pr > |t| buick-chevy 3462.45821 1106.15924 3.13 0.0037 This is the ANOVA way of approaching this problem. For the Regression Approach we would need to write MAKE as a numerical variable and define dummy variables: For example X1 = 1 if make=Buick =0 ow X2 = 1 if make=Pontiac =0 ow X3 = 1 if make=SAAB =0 ow Then write your model as: Price = B0 + B1X1 + B2X2+ B3X3 + B4X4+B5 X1*MILEAGE + B6 X2*MILEAGE + B7 X3*MILEAGE So here I used Chevy as our base category and am comparing everything to Chevy. Parameter Estimates Variable DF Parameter Estimate 12708 Intercept 1 600.14716 21.17 <.0001 x1 1 7336.70734 1109.64057 6.61 <.0001 x2 1 14441 1526.91391 9.46 <.0001 x3 1 18301 1361.18979 13.44 <.0001 x4 1 -0.08035 0.03392 -2.37 0.0241 x5 1 -0.11817 0.06071 -1.95 0.0604 x6 1 -0.12740 0.07070 -1.80 0.0810 x7 1 -0.20788 0.06394 -3.25 0.0027 The SAS program: data cars; input Price datalines; 17314.1 17542.0 16218.8 16336.9 16339.2 15709.1 15048.0 14862.1 15295.0 21335.9 12649.1 12314.6 11318.0 12409.9 11555.3 11700.1 11215.0 10145.0 9954.1 11918.5 25452.5 23449.3 23578.2 22525.3 21982.6 22231.6 22189.1 21765.1 21403.8 21200.7 26337.8 26775.0 Standard t Value Pr > |t| Error Mileage 8221 9135 13196 16342 19832 22236 22964 24021 27325 10237 3629 4142 11156 11981 13404 15253 19945 23963 37345 7278 11892 17273 19148 19521 20472 21929 25651 25794 27168 31197 16068 16688 Make $; Buick Buick Buick Buick Buick Buick Buick Buick Buick Buick Chevrolet Chevrolet Chevrolet Chevrolet Chevrolet Chevrolet Chevrolet Chevrolet Chevrolet Chevrolet Pontiac Pontiac Pontiac Pontiac Pontiac Pontiac Pontiac Pontiac Pontiac Pontiac SAAB SAAB 25300.0 19569 SAAB 24896.6 21266 SAAB 25996.8 21433 SAAB 24801.6 26345 SAAB 24063.0 27674 SAAB 23249.8 27686 SAAB 19244.9 30387 SAAB 26841.1 10003 SAAB ; proc gplot data=cars; plot price*mileage=make; run; proc glm data=cars; class make; model price=make mileage make*mileage; estimate "buick-chevy" make 1 -1 0 0 mileage 20000; run; data dummy; set cars; if make="Buick" then x1=1; else x1=0; if make="Pontiac" then x2=1; else x2=0; if make="SAAB" then x3=1; else x3=0; x4=mileage; x5=x1*x4; x6=x2*x4; x7=x3*x4; run; proc reg data=dummy; model price = x1 x2 x3 x4 x5 x6 x7; run;