L8_ch16_ANCOVA.doc

advertisement
Analysis of Covariance
Chapter 17 Analysis of Covariance (ANCOVA)
It is useful when we are interested in comparing treatment effects,
but our response is affected by another numerical variable that we
cannot effectively control in our design. It’s like having a
numerical BLOCK.
Example:
 Studying weekly sales of Y of some item under advertising
strategies for different stores (treatments), will be more
successful if each store’s sales of the item the week before,
X, is included in our study.
 Physicians studying the effects of diets on renal function will
want to use the age of the patients as a co-variate since, age
may have huge effects on the renal function too.
The data for the simplest ANCOVA will be of the following form:
ni observation from the ith treatment as pairs (Yij, Xij), j=1,…,ni
and i=1,…,t.
The FULL model or the unequal slopes model for an ANCOVA is
simply that each of the r treatments possesses its own regression
line for Y vs. X, but with the same amount of variability  for
each line.
Data Example
A common clinical method to evaluate an individual's cardiovascular
capacity is through treadmill exercise testing. One of the measures obtained
during treadmill testing, maximal oxygen uptake, is considered the best
index of work capacity and maximal cardiovascular function. The measured
maximal oxygen uptake by an individual depends on a number of factors
including the mode of testing, test protocol, and the subject’s physical
condition and age. A common test protocol on the treadmill is the inclined
protocol where grade and speed are incrementally increased until exhaustion
occurs.
Two treatments were of interest to the researcher: a 12-week step aerobic
training program and a 12-week outdoor running regimen on flat terrain. It
was thought that the step aerobic training better simulated the treadmill
inclined protocol than the flat terrain running regimen.
12 healthy males who did not participate in a regular exercise program were
selected. Six individuals were randomly assigned to the step aerobic
treatment and six to the flat terrain running treatment. Various respiratory
measurements were made on the subjects while on the treadmill before the
12-week period. There were no differences in the respiratory measurements
of the two groups of subjects prior to the treatment.
The measurement of interest for this example is the change in maximal
ventilation (liters/minute) of oxygen for the 12-week period. The
observations on the 12 subjects and their ages are shown in the following
table:
Aerobic Age
31
23
27
28
22
24
Group Change
17.05
4.96
10.40
11.05
0.26
2.51
Running Age
23
22
22
25
27
20
Group Change
-0.87
-10.74
-3.27
-1.97
7.50
-7.25
The experimental design is completely randomized with a oneway treatment
structure. However, we need to control for the fact that the subjects had
different ages. We also believe age will affect the response variable BUT
not the treatment
A model with a linear predictor for
adjusting age (X) effects is:
Yij     i    Xij    Xij  ei  j
The analysis based on this model is known as analysis of covariance
(ANCOVA).
Assumptions:
Yij response for the (ij)th observation
 is the grand mean
τ i is the ith treatment effect
β j is the common slope of the linear predictor
 τβ ij is the interaction between the treatment effect and the linear predictor.
X ij is the (ij)th value of the linear predictor
eij is the (ij)th error, which are independent and identically normally
distributed with mean zero and variance  2 .
OR we can write it as:Yij i + i Xij + i)j
Where:
  is the overall constant (an average Y intercept over the r
regression lines)
 i: an adjustment to the Y intercept for the ith regression line
 i: slope of the ith regression line (combines  and )
 Xij covariate assumed to be measured without error
 ij are independently, normally distributed with mean 0 and
variance 2.
The true regression line for treatment 1 is (+1) + 1X for
treatment 2 is (+2) + 2X and so on..
In most situations we are interested in comparing the mean
responses between treatments at a specified value of X, say X0.
Such a difference is labeled D (for treatment 1 and 2)
D= () – () X0.
Obviously if we try to do this for all possible values of X0 its
going to be a lot of work. Hence it would be much easier for us if
the lines were parallel () and then it’s a straight comparison
of ().
Then our model is of form:
The Parallel lines Model:
Yij i +  Xij + ij.
So when comparing the mean responses among treatments that of
primary interest,
1. Fit the first (unequal slopes) model.
2. Check for equality of slopes
3. If the test is highly INSIGNIFICANT, fit the second model and
proceed with comparison of means.
For our data
Test if the interaction () between the treatment effect and the linear
predictor is significant. Recall, that interactions measure the parallel nature
of the treatment means across the levels of the second factor, which in this
case is the linear predictor. If the interaction is significant, this indicates that
at least on of the regression lines for a treatment has a different slope. If the
interaction is non-significant, this indicates that the regression lines have the
same slope for each treatment.
How to do this in SAS:
Do a plot to check for the equality of slopes.
PROC PLOT; PLOT Y*X=TRT;
For the oxygen uptake data the regression lines for the two treatments
appears as follows:
Regression lines for the two treatments (Aerobic and Running)
This should give you a rough idea of whether the lines are indeed
parallel.
To do a formal test, we want to check for equality of slopes.
PROC GLM;
CLASS TRT;
MODEL Y=TRT X X*TRT;
REMEMBER: You should only interpret the TYPE III F test for
X*TRT which tests for equal slopes. Do not interpret anything
else. ESPECIALLY TRT effects. (It tests for equality of the y
intercepts among the treatments and if X=0 is not in your data
range, this test is neither of use, nor relevant).
Equal Slopes Model:
If you have decided that the slopes are indeed equal, you can use
the following statements
PROC GLM;
CLASS TRT;
MODEL Y=TRT X;
Hypothesis of interest:
No treatment effects: (lines coincide) , r (TYPE III F TEST
FOR TRT)
No X effect (slope =0) =0 (TYPE III F TEST FOR X)
SAS creates the vector of parameters as follows and you can
estimate anything you want from the ESTIMATE statements in
SAS.
SAS code for the Exercise Data:
title3 "With Covariate the Age of the Individual";
data ancova;
input age oxygen treatment $;
cards;
31 17.05 aerobic
23
4.96 aerobic
27 10.40 aerobic
28 11.05 aerobic
22
0.26 aerobic
24
2.51 aerobic
23 -0.87 running
22 -10.74 running
22 -3.27 running
25 -1.97 running
27
7.50 running
20 -7.25 running
;
proc print data = ancova;
run;
title4 "Model With an Interation";
proc glm data = ancova;
class treatment;
model oxygen = treatment age treatment*age /
solution;
run;
title4 "Model Without an Interation";
proc glm data = ancova;
class treatment;
model oxygen = treatment age / solution;
lsmeans treatment / pdiff stderr;
run;
Consider a situation with 3 treatments and 1 covariate.
Vector created by SAS is ( 
Intercept TRT
 
slope
)
How to do this:
 If you are interested in the intercept of treatment 1
ESTIMATE INTERCEPT 1 TRT 1 0 0;
 Common slope
ESTIMETE X 1;
 Distance between line 1 and 2
ESTIMATE
TRT 1 –1 0;
 Mean response in treatment 1 with a X=50
ESTIMATE
INTERCEPT 1 TRT 1 0 0 X 50;
AND SO ON.
The LSMEANS or the adjusted means calculates the means of the
treatment at the most typical value of X which is X…,
If that is of interest to you you can use the following statements;
After the model statement
LSMEANS TRT/ STDERR PDIFF;
It gives you the estimates of the means, the stderr and the p-valus
for the non-simulatneous difference among the means. You can
use these results to do BONFERRONI type comparisons.
HOWEVER, NEVER NEVER USE THE MEANS STATEMENT
IS SAS WITH ANCOVA.
CRD with Repeated Measures
1
Sample Unit = Subject (1 to 8), Treatment = Drug (AX23, BWW9, CONTROL)
With Covariate the Age of the Individual
Obs
age
oxygen
treatment
1
2
3
4
5
6
7
8
9
10
11
12
31
23
27
28
22
24
23
22
22
25
27
20
17.05
4.96
10.40
11.05
0.26
2.51
-0.87
-10.74
-3.27
-1.97
7.50
-7.25
aerobic
aerobic
aerobic
aerobic
aerobic
aerobic
running
running
running
running
running
running
CRD with Repeated Measures
2
Sample Unit = Subject (1 to 8), Treatment = Drug (AX23, BWW9, CONTROL)
With Covariate the Age of the Individual
Model With an Interation
The GLM Procedure
Class Level Information
Class
treatment
Levels
2
Values
aerobic running
Number of Observations Read
Number of Observations Used
12
12
CRD with Repeated Measures
3
Sample Unit = Subject (1 to 8), Treatment = Drug (AX23, BWW9, CONTROL)
With Covariate the Age of the Individual
Model With an Interation
The GLM Procedure
Dependent Variable: oxygen
Source
DF
Sum of
Squares
Mean Square
F Value
Pr > F
Model
3
649.9238779
216.6412926
25.36
0.0002
Error
8
68.3398137
8.5424767
11
718.2636917
Corrected Total
R-Square
Coeff Var
Root MSE
oxygen Mean
0.904854
118.3700
2.922752
2.469167
Source
treatment
age
age*treatment
Source
treatment
age
age*treatment
Parameter
Intercept
treatment
treatment
age
age*treatment
age*treatment
DF
Type I SS
Mean Square
F Value
Pr > F
1
1
1
328.9674083
318.9075130
2.0489566
328.9674083
318.9075130
2.0489566
38.51
37.33
0.24
0.0003
0.0003
0.6375
DF
Type III SS
Mean Square
F Value
Pr > F
1
1
1
5.9071100
303.1764867
2.0489566
5.9071100
303.1764867
2.0489566
0.69
35.49
0.24
0.4298
0.0003
0.6375
Estimate
aerobic
running
aerobic
running
-51.29394595
13.10709042
0.00000000
2.09470270
-0.31824378
0.00000000
B
B
B
B
B
B
Standard
Error
t Value
Pr > |t|
12.25221255
15.76197619
.
0.52635853
0.64980859
.
-4.19
0.83
.
3.98
-0.49
.
0.0031
0.4298
.
0.0041
0.6375
.
NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve
the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely
estimable.
CRD with Repeated Measures
4
Sample Unit = Subject (1 to 8), Treatment = Drug (AX23, BWW9, CONTROL)
With Covariate the Age of the Individual
Model Without an Interation
The GLM Procedure
Class Level Information
Class
Levels
treatment
Values
2
aerobic running
Number of Observations Read
Number of Observations Used
12
12
CRD with Repeated Measures
5
Sample Unit = Subject (1 to 8), Treatment = Drug (AX23, BWW9, CONTROL)
With Covariate the Age of the Individual
Model Without an Interation
The GLM Procedure
Dependent Variable: oxygen
Source
DF
Sum of
Squares
Mean Square
F Value
Pr > F
Model
2
647.8749214
323.9374607
41.42
<.0001
Error
9
70.3887703
7.8209745
11
718.2636917
Corrected Total
R-Square
Coeff Var
Root MSE
oxygen Mean
0.902001
113.2609
2.796601
2.469167
Source
treatment
age
Source
treatment
age
Parameter
Intercept
treatment aerobic
treatment running
age
DF
Type I SS
Mean Square
F Value
Pr > F
1
1
328.9674083
318.9075130
328.9674083
318.9075130
42.06
40.78
0.0001
0.0001
DF
Type III SS
Mean Square
F Value
Pr > F
1
1
71.7869428
318.9075130
Estimate
-46.45650248 B
5.44262082 B
0.00000000 B
1.88589219
71.7869428
9.18
0.0143
318.9075130
40.78
0.0001
Standard
Error
t Value
Pr > |t|
6.93653144
1.79645269
.
0.29533500
-6.70
3.03
.
6.39
<.0001
0.0143
.
0.0001
NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve
the normal equations. Terms whose estimates are followed by the letter 'B' are not
uniquely estimable.
CRD with Repeated Measures
6
Sample Unit = Subject (1 to 8), Treatment = Drug (AX23, BWW9, CONTROL)
With Covariate the Age of the Individual
Model Without an Interation
The GLM Procedure
Least Squares Means
treatment
aerobic
running
oxygen
LSMEAN
Standard
Error
H0:LSMEAN=0
Pr > |t|
5.19047708
-0.25214374
1.20770793
1.20770793
0.0020
0.8
H0:LSMean1=
LSMean2
Pr > |t|
0.0143
Let us consider the following example:
We are interested to see if there is a difference in the mean car
prices for cars (which are roughly the same age and have similar
mileage) for 4 different car makers: Chevrolet, Pontiac, Saab and
Buick. To look at this, we randomly select 10 sedans for each of
the four makers and record the blue book price. Since we cannot
get exactly the same mileages from each maker we also record
their specific mileages.
Based on the data given below do you see a difference in the mean
price by the makers? Does mileage matter in terms of price? Does
mileage matter in terms of makes for price?
Here is your data:
Price mileage make
17314.1
17542.0
16218.8
16336.9
16339.2
15709.1
15048.0
14862.1
15295.0
21335.9
12649.1
12314.6
11318.0
12409.9
11555.3
11700.1
11215.0
10145.0
9954.1
11918.5
25452.5
23449.3
23578.2
22525.3
21982.6
22231.6
22189.1
21765.1
21403.8
8221
9135
13196
16342
19832
22236
22964
24021
27325
10237
3629
4142
11156
11981
13404
15253
19945
23963
37345
7278
11892
17273
19148
19521
20472
21929
25651
25794
27168
Buick
Buick
Buick
Buick
Buick
Buick
Buick
Buick
Buick
Buick
Chevrolet
Chevrolet
Chevrolet
Chevrolet
Chevrolet
Chevrolet
Chevrolet
Chevrolet
Chevrolet
Chevrolet
Pontiac
Pontiac
Pontiac
Pontiac
Pontiac
Pontiac
Pontiac
Pontiac
Pontiac
21200.7
26337.8
26775.0
25300.0
24896.6
25996.8
24801.6
24063.0
23249.8
19244.9
26841.1
31197
16068
16688
19569
21266
21433
26345
27674
27686
30387
10003
Pontiac
SAAB
SAAB
SAAB
SAAB
SAAB
SAAB
SAAB
SAAB
SAAB
SAAB
The SAS System
The GLM Procedure
Class Level Information
Class Levels Values
Make
4 Buick Chevrole Pontiac SAAB
Number of Observations Read 40
Number of Observations Used 40
The SAS System
The GLM Procedure
Dependent Variable: Price
Source
DF Sum of Squares Mean Square F Value Pr > F
Model
7
1140738140
162962591
Error
32
34501975
1078187
Corrected Total 39
1175240115
151.15 <.0001
R-Square Coeff Var Root MSE Price Mean
0.970643
5.505130
1038.358
18861.64
Make
3 256816406.2 85605468.7 79.40 <.0001
Mileage
1
Mileage*Make 3
61813903.6 61813903.6 57.33 <.0001
13247454.2
4415818.1
4.10 0.0144
Since there is an interaction we could compare the prices for a
specific mileage and see if the prices are different.
For example the difference in price between Buick and Chevy at
20,000 miles is
Parameter
Estimate Standard Error t Value Pr > |t|
buick-chevy 3462.45821
1106.15924
3.13 0.0037
This is the ANOVA way of approaching this problem.
For the Regression Approach we would need to write MAKE as a
numerical variable and define dummy variables:
For example
X1 = 1 if make=Buick
=0 ow
X2 = 1 if make=Pontiac
=0 ow
X3 = 1 if make=SAAB
=0 ow
Then write your model as:
Price = B0 + B1X1 + B2X2+ B3X3 + B4X4+B5 X1*MILEAGE +
B6 X2*MILEAGE + B7 X3*MILEAGE
So here I used Chevy as our base category and am comparing
everything to Chevy.
Parameter Estimates
Variable DF Parameter
Estimate
12708
Intercept
1
600.14716
21.17 <.0001
x1
1 7336.70734 1109.64057
6.61 <.0001
x2
1
14441 1526.91391
9.46 <.0001
x3
1
18301 1361.18979
13.44 <.0001
x4
1
-0.08035
0.03392
-2.37 0.0241
x5
1
-0.11817
0.06071
-1.95 0.0604
x6
1
-0.12740
0.07070
-1.80 0.0810
x7
1
-0.20788
0.06394
-3.25 0.0027
The SAS program:
data cars;
input Price
datalines;
17314.1
17542.0
16218.8
16336.9
16339.2
15709.1
15048.0
14862.1
15295.0
21335.9
12649.1
12314.6
11318.0
12409.9
11555.3
11700.1
11215.0
10145.0
9954.1
11918.5
25452.5
23449.3
23578.2
22525.3
21982.6
22231.6
22189.1
21765.1
21403.8
21200.7
26337.8
26775.0
Standard t Value Pr > |t|
Error
Mileage
8221
9135
13196
16342
19832
22236
22964
24021
27325
10237
3629
4142
11156
11981
13404
15253
19945
23963
37345
7278
11892
17273
19148
19521
20472
21929
25651
25794
27168
31197
16068
16688
Make $;
Buick
Buick
Buick
Buick
Buick
Buick
Buick
Buick
Buick
Buick
Chevrolet
Chevrolet
Chevrolet
Chevrolet
Chevrolet
Chevrolet
Chevrolet
Chevrolet
Chevrolet
Chevrolet
Pontiac
Pontiac
Pontiac
Pontiac
Pontiac
Pontiac
Pontiac
Pontiac
Pontiac
Pontiac
SAAB
SAAB
25300.0
19569 SAAB
24896.6
21266 SAAB
25996.8
21433 SAAB
24801.6
26345 SAAB
24063.0
27674 SAAB
23249.8
27686 SAAB
19244.9
30387 SAAB
26841.1
10003 SAAB
;
proc gplot data=cars;
plot price*mileage=make;
run;
proc glm data=cars;
class make;
model price=make mileage make*mileage;
estimate "buick-chevy" make 1 -1 0 0 mileage 20000;
run;
data dummy;
set cars;
if make="Buick" then x1=1;
else x1=0;
if make="Pontiac" then x2=1;
else x2=0;
if make="SAAB" then x3=1;
else x3=0;
x4=mileage;
x5=x1*x4;
x6=x2*x4;
x7=x3*x4;
run;
proc reg data=dummy;
model price = x1 x2 x3 x4 x5 x6 x7;
run;
Download