The Use of Dummy Variables

advertisement
The Use of Dummy Variables
Dummy variables are artificially defined variables designed to convert a model including
categorical independent variables to the standard multiple regression model.
Comparison of Slopes of k Regression Lines with Common Intercept
Situation:
- k treatments or k populations are being compared.
- For each of the k treatments we have measured both Y (the response variable)
and X (an independent variable)
- Y is assumed to be linearly related to X with the slope dependent on treatment
(population), while the intercept is the same for each treatment
The Model:
Y = β0 + β 1( i ) X + ε for treatment i (i = 1, 2, ... , k)
Graphical Illustration of the above Model
120
Treat k
100
Treat 3
.....
Treat 2
80
Treat 1
y
60
40
Different Slopes
20
Common Intercept
0
0
x
10
20
30
This model can be artificially put into the form of the Multiple Regression model by the
use of dummy variables to handle the categorical variable Treatments. Dummy variables
are variables that are artificially defined:
In this case we define a new variable for each category of the categorical variable.
That is we will define Xi for each category of treatments as follows:
Then the model can be written as follows:
X if the subject receives treatment i
Xi = 0 otherwise

The Complete Model: (in Multiple Regression Format)
(2)
(k)
Y = β0 + β(1)
1 X1 +β 1 X2+ ... + β 1 Xk+ ε
X if the subject receives treatment i
where Xi = 0 otherwise

Dependent Variable:
Y
page 67
Independent Variables:
X1, X2, ... , Xk
In the above situation we would likely be interested in testing the equality of the slopes.
Namely the Null Hypothesis
H0: β 1(1) = β 1(2 ) = = β 1(k ) = β 1 (q = k-1)
In this situation the model would become as follows
The Reduced Model: Y = β0 + β1X + ε
Dependent Variable:
Independent Variables:
Y
X = X1 + X2 + ... + X2
The Anova Table to carry out this test would take on the following form:
The Anova Table :
Source
df Sum of Squares Mean Square
1
SSReg
1
SSReg
1 /s2
MSReg
k -1
SSH0
1
SSH0
k-1
MSH0
N-k-1
SSError
s2
Regression
1
(for the reduced model)
Departure from H0
F
s2
(Equality of Slopes)
Residual (Error)
Total
N-1
SSTotal
(N= The total number of cases = n1 + n2 + ... + nk and ni = the number of cases for treatment i)
Example
In the following example we are measuring Yield Y as it dependents on the amount of
pesticide X. Again we will assume that the dependence will be linear. (I should point out
that the concepts that are used in this discussion can easily be adapted to the non-linear
situation.) Suppose that the experiment is going to be repeated for three brands of
pesticides - A, B and C. The quantity, X, of pesticide in this experiment was set at 4
different levels 2 units/hectare, 4 units/hectare and 8 units per hectare. Four test plots
were randomly assigned to each of the nine combinations of test plot and level of
pesticide. Note that we would expect a common intercept for each brand of pesticide
since when the amount of pesticide, X, is zero the four brands of pesticides would be
equivalent.
page 68
The data for this experiment is given in the following table:
2
29.63
31.87
28.02
35.24
32.95
24.74
23.38
32.08
28.68
28.70
22.67
30.02
A
B
C
4
28.16
33.48
28.13
28.25
29.55
34.97
36.35
38.38
33.79
43.95
36.89
33.56
8
28.45
37.21
35.06
33.99
44.38
38.78
34.92
27.45
46.26
50.77
50.21
44.14
A graph of the data is displayed below:
60
40
A
B
C
20
0
0
1
2
3
4
page 69
5
6
7
8
The data as it would appear in a data file. The variables X1, X2 and X3 are the “dummy”
variables
Pesticide
A
A
A
A
B
B
B
B
C
C
C
C
A
A
A
A
B
B
B
B
C
C
C
C
A
A
A
A
B
B
B
B
C
C
C
C
X (Amount)
2
2
2
2
2
2
2
2
2
2
2
2
4
4
4
4
4
4
4
4
4
4
4
4
8
8
8
8
8
8
8
8
8
8
8
8
X1
2
2
2
2
0
0
0
0
0
0
0
0
4
4
4
4
0
0
0
0
0
0
0
0
8
8
8
8
0
0
0
0
0
0
0
0
X2
0
0
0
0
2
2
2
2
0
0
0
0
0
0
0
0
4
4
4
4
0
0
0
0
0
0
0
0
8
8
8
8
0
0
0
0
page 70
X3
0
0
0
0
0
0
0
0
2
2
2
2
0
0
0
0
0
0
0
0
4
4
4
4
0
0
0
0
0
0
0
0
8
8
8
8
Y
29.63
31.87
28.02
35.24
32.95
24.74
23.38
32.08
28.68
28.70
22.67
30.02
28.16
33.48
28.13
28.25
29.55
34.97
36.35
38.38
33.79
43.95
36.89
33.56
28.45
37.21
35.06
33.99
44.38
38.78
34.92
27.45
46.26
50.77
50.21
44.14
Fitting the complete model
ANOVA
Regression
Residual
Total
Intercept
X1
X2
X3
df
3
32
35
SS
1095.815813
637.6415754
1733.457389
MS
F
Significance F
365.2719378 18.33114788 4.19538E-07
19.92629923
Coefficients
26.24166667
0.981388889
1.422638889
2.602400794
Fitting the Reduced model
ANOVA
Regression
Residual
Total
Intercept
X
df
1
34
35
SS
623.8232508
1109.634138
1733.457389
MS
F
Significance F
623.8232508 19.11439978 0.000110172
32.63629818
Coefficients
26.24166667
1.668809524
The Anova Table for testing the equality of slopes
common slope
zero
Slope comparison
Residual
Total
df
1
2
32
35
SS
623.8232508
471.9925627
637.6415754
1733.457389
page 71
MS
F
Significance F
623.8232508 31.3065283
3.51448E-06
235.9962813 11.84345766
19.92629923
0.000141367
Comparison of Intercepts of k Regression Lines with a Common Slope (One-way
Analysis of Covariance)
Situation:
- k treatments or k populations are being compared.
- For each of the k treatments we have measured both Y (then response variable)
and X (an independent variable)
- Y is assumed to be linearly related to X with the intercept dependent on treatment
(population), while the slope is the same for each treatment.
- Y is called the response variable, while X is called the covariate.
The Model:
Y = β(i)
0 + β1X + ε for treatment i (i = 1, 2, ... , k)
Graphical Illustration of the One-way
Analysis of Covariance Model
200
Treat k
Treat 3
y
Treat 2
Treat 1
100
Common Slopes
0
0
x
10
20
30
Equivalent Forms of the Model:
_
Y = µi + β1(X - X ) + ε (treatment i), where
µi = the adjusted mean for treatment i
_
2)
Y = µ + αi + β1(X - X ) + ε (treatment i), where
µ = the overall adjusted mean response
αi = the adjusted effect for treatment i
µi = µ + α i
The Complete Model: (in Multiple Regression Format)
Y = β0 + δ1X1 + δ2X2+ ... + δk-1Xk-1+ β1X + ε
1 if the subject receives treatment i
where Xi = 0 otherwise

(i)
Comment:
β 0 = β0 + δi for treatment i = 1, 2, 3, .., k-1; and
β(k)
0 = β0 .
Dependent Variable:
Y
Independent Variables:
X1, X2, ... , Xk-1, X
1)
page 72
Testing for the Equality of Intercepts (Treatments)
(2)
(k)
H0: β(1)
0 = β 0 = ... = β 0 (= β0 say) (q = k-1)
( or δ1 = δ2 = ... = δk-1= 0)
The Reduced Model:
Y = β0 + β1X + ε
Dependent Variable:
Independent Variables:
Y
X
The Anova Table (Analysis of Covariance Table):
Source
df Sum of Squares Mean Square
F
1
SSReg
1
SSReg
1 /s2
MSReg
k -1
SSH0
1
SSH0
k-1
MSH0
N-k-1
SSError
s2
N-1
SSTotal
Regression
1
(for the reduced model)
Departure from H0
s2
(Equality of Intercepts
(Treatments))
Residual (Error)
Total
where
and
N = The total number of cases = n1 + n2 + ... + nk
ni = the number of cases for treatment i
An Example
In this example we are comparing four treatments for reducing Blood Pressure in Patients
whose blood pressure is abnormally high. Ten patients are randomly assigned to each of
the four treatment groups. In addition to the drop in blood pressure (Y) during the test
period the initial blood pressure (X) prior to the test period was also recorded. It was
thought that this would be correlated with X. The data is given below for this experiment.
1
2
3
4
5
6
7
8
9
10
Treatment case
1
X
186
185
199 167
187
168
183
176
158
190
Y
34
36
41
34
36
38
39
34
37
35
2
X
183
202
149 187
182
139
167
192
160
185
Y
29
36
27
29
27
28
22
32
26
30
3
X
182
168
175 174
183
182
181
148
205
188
Y
27
30
28
31
28
25
27
25
32
25
4
X
176
202
159 164
176
173
159
167
174
175
Y
26
26
20
18
27
20
24
22
22
25
page 73
The data as it would appear in a data file:
X
Y
Treatment
186
185
199
167
187
168
183
176
158
190
183
202
149
187
182
139
167
192
160
185
182
168
175
174
183
182
181
148
205
188
176
202
159
164
176
173
159
167
174
175
34
36
41
34
36
38
39
34
37
35
29
36
27
29
27
28
22
32
26
30
27
30
28
31
28
25
27
25
32
25
26
26
20
18
27
20
24
22
22
25
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
X1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
X2
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
page 74
X3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
The Complete Model
ANOVA
df
Regression
Residual
Total
Intercept
X1
X2
X3
X
4
35
39
SS
1000.862103
239.0378966
1239.9
MS
250.2155258
6.829654189
F
Significance F
36.6366318
4.66264E-12
SS
187.7440297
1052.15597
1239.9
MS
187.7440297
27.68831501
F
Significance F
6.78062315
0.013076205
Coefficients
6.360395468
12.68618508
5.397430901
4.211584999
0.096461476
The Reduced Model
ANOVA
df
Regression
Residual
Total
Intercept
X
1
38
39
Coefficients
2.991349082
0.147157885
The Anova Table for comparing intercepts:
ANOVA
Testing for slope
Comparison of intercepts
Residual
Total
df
1
3
35
39
SS
187.7440297
813.1180737
239.0378966
1239.9
page 75
MS
187.7440297
271.0393579
6.829654189
F
Significance F
27.48953674
7.68771E-06
39.68566349
2.32981E-11
Download