Dummy variables

advertisement
Regression: Dummy variables (discrete independent variables)
Model 1: shift
Consider the regression model
yi   o  1 xi   0 Di   i
where Di has the value 1 when the ith observation has a given attribute, and 0 otherwise. The variables x
and y are assumed to be continuous measurements. The variable D is called a dummy variable for the
attribute. The interpretation of  can be obtained by specializing the regression model to the two levels of
D:
D  0: yi   o  1xi   i
D  1: yi   o   0   1xi   i
The coefficient 0 represents a shift in the intercept, and thus the entire regression line, due to the
presence of the given attribute. To estimate the parameters with least squares, in addition to the usual
regression assumptions, one assumes that the presence of the attribute does not alter the regression slope
(the effect of the independent variable), nor the regression variance. The hypothesis of no relation
between the attribute and the dependent variable, given the independent variable, can then be written
 0  0 and tested with the t-test of the statistic ˆ0 sˆ with degrees of freedom n  3 . This model can be
0
extended with the inclusion of additional independent variables.
Model 2: interaction
The assumption of common slope can be relaxed by including an interaction term in the model
yi   o  1 xi   0 Di  1 xi Di   i
which when specialized to the cases of D yields:
D  0: yi   o  1xi   i
D  1: yi   o   0   1  1 xi   i
Now, the coefficient 1 represents the change in slope in the presence of the attribute; unless this change
has the value zero, the presence of the attribute alters the effect of x on y (an interaction between attribute
and independent variable). The hypothesis that the attribute is irrelevant,  0  1  0 , is now tested with
SSER  SSEU  2 involving these two restrictions, the restricted
the usual F-test of the statistic F 
SSEU n  4
model (R) being yi   o  1 xi   i . In the event the hypothesis is rejected, one can proceed to test for a
common slope with the t-test for 1.
Model 3: polychotomous category
The model can be extended to a category with more than two levels, say d levels, by defining dummy
variables for all but one of the levels (the base level):
yi   o  1xi   01Di1   02Di 2     0d 1Di d 1   i
which specializes to the various cases as follows:
D1  D2    Dd 1  0 :
yi   o  1 xi   i
D1  1: yi   o   01   1 xi   i
D2  1: yi   o   02   1 xi   i

Dd 1  1: yi   o   0d 1  1 xi   i


Note that any of the dummies having the value one implies the others have the value zero. The dummy
coefficients have the interpretation of shifts in the linear regression relation compared to the base level.
The hypothesis that the category is irrelevant is tested with the F-test of the d  1 restrictions
1   2     d 1  0 . If rejected, differences between the base and any other level can be tested with
the t-test for the respective dummy coefficient. Other differences between levels can be tested by
changing the level considered as the base level, i.e., by inserting a dummy variable for the old base and
removing the dummy variable for the new base. One must be cautious about “over-testing” and thereby
impugning the overall probability of type I error. This model can also be extended by including
interaction terms between the category levels and the independent variable.
Model 4: multiple categories
Dummy variables can be used to model the inclusion of multiple categories. Consider the model:
yi   o  1 xi   0Ci   0 Di   i
with C  1 coding the presence of an attribute in the first (dichotomous) category, and D  1 coding the
presence of another attribute in the second (dichotomous) category. The four cases specialize as follows:
C  0, D  0: yi   o  1 xi   i
C  0, D  1: yi   o   0   1 xi   i
C  1, D  0: yi   o   0   1 xi   i
C  1, D  1: yi   o   0   0   1 xi   i
Thus, the dummy coefficients represent shifts due to the presence of the respective attributes, which are
additive when both attributes are present. The base group is the group lacking both attributes.
Hypothesis can be tested as above, and the model can be extended as above with additional levels,
categories, independent variables, and various forms of interaction.
Confidence intervals for the mean
Obtaining confidence intervals for the regression mean for particular values of the independent variables
and particular levels of the categories requires point estimates of the mean as well as the estimated
standard error of the mean. This can be obtained using standard statistical software by reparameterizing
the independent variables so that the particular values become zero, and arranging that the particular
levels of the categories constitute the base group. For example in model 4 above, with particular values
x  x * and C  D  0 , one can create the transformed variable x  x  x * , which has the property that
x  0 when x  x * . Consider then the transformed model yi   o  1xi   0Ci   0 Di   i . At the
particular values of interest, this model specializes to yi   o   i , in which the intercept  o represents
the desired mean. A confidence interval for the mean with confidence coefficient 1   is then
constructed in the usual way, namely ˆo  t1  n  4   s ˆ .
o