More on Indicator Variables When we use in a regression model a quantitative and categorical variable with more than two levels one may ask, “If we are simply interested in the effect the categorical variable has in predicting the response when the other variable(s) are in the model, why should we create indicator variables? That is, why not simply regress Y on the quantitative and categorical variables without creating an indicator variable for each categorical level”. The explanation follows. We assign indicator levels to account for the effect that the variable may have on the response. As the sole predictor we showed how this results in the line intercept being the mean of the baseline variable the slope(s) being the difference between the mean of the level(s) used in the regression and the mean of the baseline level. But why do this when we add a quantitative predictor to the model? As an example, let Y be the total electricity consumption during summer months and X1 represent the size of the house in square feet. Included in the analysis are four types of air conditioning systems: No A/C; window units; heat pump; and central A/C. These four levels can be modeled by three indicator variables X2 (window units); X3 (heat pump); and X4 (central A/C). The regression model is as follows: Y = Bo + B1X1 + B2X2 + B3X3 +B4X4 + e (equation 1) If the house has no A/C (i.e. X2 through X4 are 0) then this equation becomes: Y = Bo + B1X1 + e If the house has window units then equation 1 becomes: Y = (Bo + B2 ) + B1X1 + e If the house has heat pump then equation 1 becomes: Y = (Bo + B3 ) + B1X1 + e And if the house has central A/C then equation 1 becomes: Y = (Bo + B4 ) + B1X1 + e Thus, equation 1 assumes that the relationship between summer electricity consumption and the size of the house is linear and that the slope does not depend on the type of A/C system used. The parameters, B2, B3, and B4 modify the height or intercept of the regression model for the different types of A/C. Also, other effects can be determined by directly comparing the appropriate regression coefficients. For instance, B3 - B4 reflects the relative efficiency of a heat pump compared to central A/C. Note also the assumption that the variance of energy consumption does not depend on the type of A/C system used. This assumption may not be appropriate meaning an interaction term between the size of the house and each indicator variable should be considered - but that is not relevant for this discussion. Back to the original question, what if we simply used one variable to represent A/C systems with allocated coding such as 1 for no A/C; 2 for window units; 3 for heat pump; and 4 for central A/C? That is, we would have the following regression equation where X2 represents A/C system and has values of 1, 2, 3, or 4: Y = Bo + B1X1 + B2X2 + e (equation 2) This model implies that: No A/C: Y = Bo + B1X1 + B2 + e Window Units: Y = Bo + B1X1 + 2B2 + e Heat Pump: Y = Bo + B1X1 + 3B2 + e Central A/C: Y = Bo + B1X1 + 4B2 + e A direct consequence of this is that B2 is simply equal to: E(Y|X1, central A/C) - E(Y|X1, heat pump) = E(Y|X1, heat pump) - E(Y|X1, window units) = E(Y|X1, window units) - E(Y|X1, no A/C) Which might be quite unrealistic. (For example say B2 is estimated to be 3.7, then what the above is saying is that for a given house size, X1, the constant difference in mean electricity consumption for summer months between using central A/C or a heat pump is 3.7; or between using a heat pump or window unit is 3.7; or between using a window unit and no A/C is 3.7 ---- that is the difference in mean electricity consumption is the same across the system types. Seems quite unrealistic! Indicator variables are more informative for this type of problem because they do not force any particular metric on the levels of the qualitative factor. Furthermore, regression using indicator variables always leads to larger R-squared values than does regression using allocated codes (for reference see: Searle, S.R. & Udell, J.G. [1970], “The use of regression on dummy variables in market research,” Management Science B, 16, 397409.) Example with Cargo Type on Next Page Variable Cost Cargo Type Durable Fragile Semifragile Cost(overall) Mean 3.26 13.00 8.700 8.320 Regression Analysis: Cost versus Fragile, Semi Coding: 0,1 with Durable as Baseline Predictor Constant Fragile Semi Coef 3.260 9.740 5.440 S = 2.40437 SE Coef 1.075 1.521 1.521 T 3.03 6.41 3.58 R-Sq = 77.4% P 0.010 0.000 0.004 R-Sq(adj) = 73.7% Analysis of Variance Source Regression Residual Error Total DF 2 12 14 SS 238.25 69.37 307.62 MS 119.13 5.78 F 20.61 P 0.000 Regression Analysis: Cost versus Frag-effect, Semi-effect Coding: -1,1 with Durable as Baseline Predictor Constant Frag-effect Semi-effect Coef 8.3200 4.6800 0.3800 S = 2.40437 SE Coef 0.6208 0.8780 0.8780 R-Sq = 77.4% T 13.40 5.33 0.43 P 0.000 0.000 0.673 R-Sq(adj) = 73.7% Analysis of Variance Source Regression Residual Error Total DF 2 12 14 SS 238.25 69.37 307.62 MS 119.13 5.78 F 20.61 P 0.000