Regression: Dummy variables (discrete independent variables) Model 1: shift Consider the regression model yi o 1 xi 0 Di i where Di has the value 1 when the ith observation has a given attribute, and 0 otherwise. The variables x and y are assumed to be continuous measurements. The variable D is called a dummy variable for the attribute. The interpretation of can be obtained by specializing the regression model to the two levels of D: D 0: yi o 1xi i D 1: yi o 0 1xi i The coefficient 0 represents a shift in the intercept, and thus the entire regression line, due to the presence of the given attribute. To estimate the parameters with least squares, in addition to the usual regression assumptions, one assumes that the presence of the attribute does not alter the regression slope (the effect of the independent variable), nor the regression variance. The hypothesis of no relation between the attribute and the dependent variable, given the independent variable, can then be written 0 0 and tested with the t-test of the statistic ˆ0 sˆ with degrees of freedom n 3 . This model can be 0 extended with the inclusion of additional independent variables. Model 2: interaction The assumption of common slope can be relaxed by including an interaction term in the model yi o 1 xi 0 Di 1 xi Di i which when specialized to the cases of D yields: D 0: yi o 1xi i D 1: yi o 0 1 1 xi i Now, the coefficient 1 represents the change in slope in the presence of the attribute; unless this change has the value zero, the presence of the attribute alters the effect of x on y (an interaction between attribute and independent variable). The hypothesis that the attribute is irrelevant, 0 1 0 , is now tested with SSER SSEU 2 involving these two restrictions, the restricted the usual F-test of the statistic F SSEU n 4 model (R) being yi o 1 xi i . In the event the hypothesis is rejected, one can proceed to test for a common slope with the t-test for 1. Model 3: polychotomous category The model can be extended to a category with more than two levels, say d levels, by defining dummy variables for all but one of the levels (the base level): yi o 1xi 01Di1 02Di 2 0d 1Di d 1 i which specializes to the various cases as follows: D1 D2 Dd 1 0 : yi o 1 xi i D1 1: yi o 01 1 xi i D2 1: yi o 02 1 xi i Dd 1 1: yi o 0d 1 1 xi i Note that any of the dummies having the value one implies the others have the value zero. The dummy coefficients have the interpretation of shifts in the linear regression relation compared to the base level. The hypothesis that the category is irrelevant is tested with the F-test of the d 1 restrictions 1 2 d 1 0 . If rejected, differences between the base and any other level can be tested with the t-test for the respective dummy coefficient. Other differences between levels can be tested by changing the level considered as the base level, i.e., by inserting a dummy variable for the old base and removing the dummy variable for the new base. One must be cautious about “over-testing” and thereby impugning the overall probability of type I error. This model can also be extended by including interaction terms between the category levels and the independent variable. Model 4: multiple categories Dummy variables can be used to model the inclusion of multiple categories. Consider the model: yi o 1 xi 0Ci 0 Di i with C 1 coding the presence of an attribute in the first (dichotomous) category, and D 1 coding the presence of another attribute in the second (dichotomous) category. The four cases specialize as follows: C 0, D 0: yi o 1 xi i C 0, D 1: yi o 0 1 xi i C 1, D 0: yi o 0 1 xi i C 1, D 1: yi o 0 0 1 xi i Thus, the dummy coefficients represent shifts due to the presence of the respective attributes, which are additive when both attributes are present. The base group is the group lacking both attributes. Hypothesis can be tested as above, and the model can be extended as above with additional levels, categories, independent variables, and various forms of interaction. Confidence intervals for the mean Obtaining confidence intervals for the regression mean for particular values of the independent variables and particular levels of the categories requires point estimates of the mean as well as the estimated standard error of the mean. This can be obtained using standard statistical software by reparameterizing the independent variables so that the particular values become zero, and arranging that the particular levels of the categories constitute the base group. For example in model 4 above, with particular values x x * and C D 0 , one can create the transformed variable x x x * , which has the property that x 0 when x x * . Consider then the transformed model yi o 1xi 0Ci 0 Di i . At the particular values of interest, this model specializes to yi o i , in which the intercept o represents the desired mean. A confidence interval for the mean with confidence coefficient 1 is then constructed in the usual way, namely ˆo t1 n 4 s ˆ . o