More on Indicator Variables

advertisement
More on Indicator Variables
When we use in a regression model a quantitative and categorical variable with more than
two levels one may ask, “If we are simply interested in the effect the categorical variable
has in predicting the response when the other variable(s) are in the model, why should we
create indicator variables? That is, why not simply regress Y on the quantitative and
categorical variables without creating an indicator variable for each categorical level”.
The explanation follows.
We assign indicator levels to account for the effect that the variable may have on the
response. As the sole predictor we showed how this results in the line intercept being the
mean of the baseline variable the slope(s) being the difference between the mean of the
level(s) used in the regression and the mean of the baseline level. But why do this when
we add a quantitative predictor to the model?
As an example, let Y be the total electricity consumption during summer months and X1
represent the size of the house in square feet. Included in the analysis are four types of
air conditioning systems: No A/C; window units; heat pump; and central A/C. These
four levels can be modeled by three indicator variables X2 (window units); X3 (heat
pump); and X4 (central A/C). The regression model is as follows:
Y = Bo + B1X1 + B2X2 + B3X3 +B4X4 + e (equation 1)
If the house has no A/C (i.e. X2 through X4 are 0) then this equation becomes:
Y = Bo + B1X1 + e
If the house has window units then equation 1 becomes:
Y = (Bo + B2 ) + B1X1 + e
If the house has heat pump then equation 1 becomes:
Y = (Bo + B3 ) + B1X1 + e
And if the house has central A/C then equation 1 becomes:
Y = (Bo + B4 ) + B1X1 + e
Thus, equation 1 assumes that the relationship between summer electricity consumption
and the size of the house is linear and that the slope does not depend on the type of A/C
system used. The parameters, B2, B3, and B4 modify the height or intercept of the
regression model for the different types of A/C. Also, other effects can be determined by
directly comparing the appropriate regression coefficients. For instance, B3 - B4 reflects
the relative efficiency of a heat pump compared to central A/C. Note also the assumption
that the variance of energy consumption does not depend on the type of A/C system used.
This assumption may not be appropriate meaning an interaction term between the size of
the house and each indicator variable should be considered - but that is not relevant for
this discussion.
Back to the original question, what if we simply used one variable to represent A/C
systems with allocated coding such as 1 for no A/C; 2 for window units; 3 for heat pump;
and 4 for central A/C? That is, we would have the following regression equation where
X2 represents A/C system and has values of 1, 2, 3, or 4:
Y = Bo + B1X1 + B2X2 + e (equation 2)
This model implies that:
No A/C: Y = Bo + B1X1 + B2 + e
Window Units: Y = Bo + B1X1 + 2B2 + e
Heat Pump: Y = Bo + B1X1 + 3B2 + e
Central A/C: Y = Bo + B1X1 + 4B2 + e
A direct consequence of this is that B2 is simply equal to:
E(Y|X1, central A/C) - E(Y|X1, heat pump)
= E(Y|X1, heat pump) - E(Y|X1, window units)
= E(Y|X1, window units) - E(Y|X1, no A/C)
Which might be quite unrealistic. (For example say B2 is estimated to be 3.7, then what
the above is saying is that for a given house size, X1, the constant difference in mean
electricity consumption for summer months between using central A/C or a heat pump is
3.7; or between using a heat pump or window unit is 3.7; or between using a window unit
and no A/C is 3.7 ---- that is the difference in mean electricity consumption is the same
across the system types. Seems quite unrealistic!
Indicator variables are more informative for this type of problem because they do not
force any particular metric on the levels of the qualitative factor. Furthermore, regression
using indicator variables always leads to larger R-squared values than does regression
using allocated codes (for reference see: Searle, S.R. & Udell, J.G. [1970], “The use of
regression on dummy variables in market research,” Management Science B, 16, 397409.)
Example with Cargo Type on Next Page
Variable
Cost
Cargo Type
Durable
Fragile
Semifragile
Cost(overall)
Mean
3.26
13.00
8.700
8.320
Regression Analysis: Cost versus Fragile, Semi
Coding: 0,1 with Durable as Baseline
Predictor
Constant
Fragile
Semi
Coef
3.260
9.740
5.440
S = 2.40437
SE Coef
1.075
1.521
1.521
T
3.03
6.41
3.58
R-Sq = 77.4%
P
0.010
0.000
0.004
R-Sq(adj) = 73.7%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
2
12
14
SS
238.25
69.37
307.62
MS
119.13
5.78
F
20.61
P
0.000
Regression Analysis: Cost versus Frag-effect, Semi-effect
Coding: -1,1 with Durable as Baseline
Predictor
Constant
Frag-effect
Semi-effect
Coef
8.3200
4.6800
0.3800
S = 2.40437
SE Coef
0.6208
0.8780
0.8780
R-Sq = 77.4%
T
13.40
5.33
0.43
P
0.000
0.000
0.673
R-Sq(adj) = 73.7%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
2
12
14
SS
238.25
69.37
307.62
MS
119.13
5.78
F
20.61
P
0.000
Download