Linear Regression with Dummy Variables to accompany Prepared by Steven Prus Carleton University Linear Regression with Dummy Variables to accompany Statistics: A Tool for Social Research, Second Canadian Edition By Joseph F. Healey and Steven G. Prus COPYRIGHT © 2013 by Nelson Education Ltd. Nelson is a registered trademark used herein under licence. All rights reserved. For more information, contact Nelson, 1120 Birchmount Road, Toronto, ON M1K 5G4. Or you can visit our Internet site at www.nelson.com. ALL RIGHTS RESERVED. No part of this work covered by the copyright hereon may be reproduced or used in any form or by any means—graphic, electronic, or mechanical, including photocopying, recording, taping, web distribution or information storage and retrieval systems— without the written permission of the publisher. Copyright © 2013 by Nelson Education Ltd. Linear Regression with Dummy Variables Introduction We saw in Chapters 13 and 14 that linear regression analysis requires that variables be measured at the interval-ratio level. One method commonly used by researchers to overcome the restriction when the independent variable is measured at the nominal or ordinal level is through a process called “dummy” variable coding. (To learn how we can deal with a nominal or ordinal level dependent variable, see the other on-line chapter titled “Regression with a Dichotomous Dependent Variable: An Introduction to Logistic Regression”). This technique involves converting nominal or ordinal variables into dichotomous variables, variables with just two values: 1s and 0s. We call these new dichotomous variables “dummy” variables; they are also known as indicator variables. We then use dummy variables in a linear regression model to examine differences between categories or groups such as males and females, young, middle-aged, and older persons, and so on. A dummy variable is appropriate for linear regression for various reasons. First, it is treated like an interval-ratio variable. We can think of a dummy variable as a numeric variable (with values 0 and 1) that has a “mean,” equal to the proportion of cases with 1s in the distribution and a “standard deviation,” equal to square root of p*q, where p is proportion of cases containing 1s and q is (1- p). Second, a dummy variable can take only one shape---it can have only a linear relationship with the dependent variable, assuming that there is indeed an actual relationship. Third, linear regression results with a dummy independent variable are interpreted exactly the same as regression results with an interval-ratio one. In this chapter we will provide a basic understanding of linear regression with dummy variables. We will consider three basic situations of dummy variables in linear regression models: 1) a single dummy variable; 2) a set of dummy variables; and 3) a single dummy variable with an interval-ratio variable, with and without interactions between them. What you will learn by the end of this chapter is that linear regression with dummy variables is essentially the same as linear regression with only interval-ratio variables—regression with dummy variables is conducted and interpreted in the same way as it is with interval-ratio level variables. Thus, you may wish to review Chapters 13 and 14 before reading on. Dummy Variables in Linear Regression A "Natural" Dummy Variable The simplest case of regression analysis with dummy variables is with a single dichotomous variable. Consider the following example. Suppose that we are interested in examining the effect of sex on income. In this situation we have an intervalratio dependent variable, income, and a nominal independent variable, sex. A variable like sex with just two categories is considered a "natural" dummy variable since it does not need to be converted and thus can be directly entered in a linear regression model. 1 Copyright © 2013 by Nelson Education Ltd. As an illustration, let’s make use of data from the 2006 Canadian Census supplied with your textbook, and compute a linear regression of income on sex using SPSS. To make the findings more comparable, we selected only respondents who worked mainly full-time weeks. Table 1 shows the regression coefficients from the SPSS output, where sex is coded as 1 for males and 0 for females (note, we call the category coded as 1 the target category or category of interest and 0 as the reference or comparison category). 1 For convenience of discussion, the regression coefficients have been rounded to the nearest 1000 (e.g., the actual coefficient for sex is 17766, which we rounded to 18000). Table 1 SPSS Linear Regression Output of Income, Y, on Sex, X. Coefficientsa Unstandardized Standardized Coefficients Coefficients Model B Std. Error Beta (Constant) 38000 3100 1 sex 18000 4200 .148 a. Dependent Variable: Total Income of individual t 12.258 4.048 Sig. .000 .000 Using the results in Table 1 we can write the model for our data as: FORMULA 1 Y = a + bX Y = 38,000 + (18,000)(X) where Y = score on the dependent variable a = mean of category X = 0 b = mean of category X = 1 minus mean of X = 0 X = score on the independent variable Note, SPSS puts the a and b linear regression coefficients in the column labeled “B”. Recall that in Chapter 13 of your textbook that we used the regression line, Y = a + bX, as a way of describing the linear relationship between an independent and dependent interval-ratio variable. With a dummy independent variable, the model is interpreted in the same way. The Y intercept, a, is equal to the point where the regression line crosses the Y axis, or the expected value (mean) of Y when X is zero. The slope, b, is the amount of change on average 1 See Chapter 13 of your textbook, namely SPSS Demonstration 13.2, for details on how to use the SPSS linear regression command. The same routine is used whether the independent variable is an interval-ratio or dummy variable. However, because the variable sex in the Census datafile is coded with values other than 0 and 1—sex is coded with the values 1 and 2— we first recoded it using the 0-1 scheme, then entered the recoded sex variable into the regression analysis. We should point out that other coding schemes (e.g., 1 and 2) produce fundamentally the same results as the 0-1 scheme. 2 Copyright © 2013 by Nelson Education Ltd. produced in Y for each unit change in X. When the value of b is positive, it tells us how much Y is expected to increase as X increases by one unit; if the value of b is negative, it tells you how much Y is expected to decrease as X increases by one unit. 2 In the present example, the intercept, a, represents the mean income of females, $38,000, since females are coded as 0. The slope, b, tells us how much the dependent variable changes on average for each unit change in the independent variable. 3 However, a dummy independent variable can only change from 0 to 1. Thus, as we move from the category coded as 0 (female) to the category coded as 1 (male), income increases by $18,000. In other words, the value of b equals the mean of Y for category X = 1 minus the mean of Y for X = 0, or what we call simply the “mean difference”—the difference in mean income between males and female is $18,000. 4 What these results imply is that if a is the mean income of females and b is the difference in mean income between males and female, then a + b (38,000 + 18,000 = $56,000) must be equal to the mean income of males. The results can also be illustrated algebraically by constructing category-specific regression models—separate equations for each category of the dummy variable. We do this by setting the dummy variable sex to the appropriate value: the mean income for females (X=0) is given by the equation: Y = a + bX Y = a + b(0) Y=a thus, Y = 38,000 + (18,000)X Y = 38,000 + (18,000)(0) Y = 38,000 the mean income for males (X=1) is given by the equation: Y = a + bX 2 It may interest you to know that linear regression analysis with dummy variables produces the same results as the t-test and Analysis of Variance (ANOVA) discussed in Chapters 8 and 9 of your textbook—linear regression with single dummy variable is a t-test for two categories and linear regression with a set of dummy variables (you will learn what a set of dummy variables is at another point in this chapter) is an ANOVA with three or more categories. These techniques differ mainly in the way the results are expressed, with linear regression expressing results in the form of the regression line Y = a + bX, while differences in means between one category and another require additional calculation (post-hoc tests) in ANOVA. 3 The b coefficient of a dummy variable more appropriately measures the “distances” from the dummy category to the reference category as opposed to a “slope.” Nonetheless, we do not wish to cause undue confusion and will continue to call the b coefficient of a dummy variable a “slope.” 4 When the value of b is greater than 0, as in our example, the mean response of the category coded as 1 is higher than the mean response of the category coded as 0. Conversely, when the value of b is less than 0, the mean response of the category coded as 1 is lower than the mean response of the category coded as 0. 3 Copyright © 2013 by Nelson Education Ltd. Y = a + b(1) Y=a+b thus, Y = 38,000 + (18,000)X Y = 38,000 + (18,000)(1) Y = 38,000 + 18,000 = 56,000 The mean income of females is $38,000, which is exactly equal to the intercept, a, and the difference in mean income between males and female ($56,000 - $38,000) is $18,000, which is exactly equal to the slope b. As a final comment, while we arbitrarily coded sex as 1 = males and 0 = females in the example above, we could have alternatively coded 1 = females and 0 = males, in which case the computed the regression model would be: Y = a + bX Y = 56,000 + (-18,000)(X) As you can see, way the categories of a dummy variable are coded affects the regression coefficients in two ways. First, the sign of the slope b is reversed. That is, the direction of relationship between the dummy variable and the dependent variable will differ depending the way the dummy variable is coded. While the magnitude of the slope b is the same ($18,000), its sign is reversed—it’s negative (as we move from the category coded as 0, male, to the category coded as 1, female, income decreases by $18,000). Second, since males are coded as 0, a is equal to the mean income of males, or $56,000 (remember, the intercept, a, represents the value of Y when X is zero). Overall, is not of great importance which category is coded as 1 and which is coded as 0, namely because the magnitude of the slope b is not affected. Other results of the regression analysis, r2 and significance tests for the slope b (not shown here), remain the same as well. Dummy vs. Effect Coding The interpretation of the regression results is straightforward using dummy-variable (1 and 0) coding—one of the categories (0) is the reference category to which the other category (1) is compared. Nonetheless, other coding schemes are possible. Here we will look at one alternative to dummy coding called effect or deviation coding (another commonly used alternative is contrast or orthogonal coding, though we will not consider it). Effect coding uses a 1 and -1 scheme. Using our example problem, we can code sex as 1 for males and -1 for females, where the category coded 1 (males) is the target category. In such a situation, we interpret the regression model with a single dummy variable as follows: Y = a + bX where Y = score on the dependent variable a = grand mean 4 Copyright © 2013 by Nelson Education Ltd. b = mean of category X = 1 minus grand mean X = score on the independent variable Using effect coding the reference “category” is the entire sample or all cases, as opposed to dummy coding where the reference category is the category coded as 0. Thus, the intercept a is equal to the grand mean of Y (i.e., the mean of the means of the categories of X) and the slope b is the mean difference in Y between the target category (category coded as 1) and the grand mean. Overall, coding (e.g., dummy vs. effect coding) will affect the magnitude of the coefficients. The decision to use one coding method over another must be given serious consider, however, dummy variable coding is the most commonly used method. Polytomous Variables When we have a “natural” dummy variable with just two categories the interpretation of linear regression results is straightforward, as we have just seen. For a nominal or ordinal independent variable with more than two categories, which are often referred to as polytomous variables, the regression analysis is a little more complicated because we need to create and interpret a “set” of dummy variables. The exact number of variables in the set is equal to the number of categories in the original variable minus one: k-1 dummy variables where k = the number of categories on the independent variable By creating k-1 dummy variables, all information on the original independent variable is retained. 5 As before, each dummy variable in the set has only two categories: 0 and 1. Since each dummy variable in the set is considered an independent variable, we use the multiple regression model (see Chapter 14 of your textbook) to specify the relationship as such: FORMULA 2 Y = a + b1X1+ b2X2+ b3X3 +........+ bkXk where Y = score on the dependent variable a = mean of reference category b1= mean of category X1 minus mean of reference category b2= mean of category X2 minus mean of reference category b3= mean of category X3 minus mean of reference category bk = mean of category kth minus mean of reference category X1, X2, X3,..., Xk = scores on the k independent variable 5 If we attempted to create a dummy variable for each category of the original variable, an irreparable problem called perfect multicollinearity would arise and the regression coefficients could not be calculated. For example, if we had a variable with four categories and created a fourth dummy variable, it would be an exact linear function of the other three dummy variables, resulting in perfect multicollinearity. To avoid this problem, k-1 dummy variables need to be created. 5 Copyright © 2013 by Nelson Education Ltd. We interpret this model (Formula 2) the same way we interpret any multivariate regression equation. X1 identifies the first independent variable, X2 the second independent variable, and so on. The intercept a is the expected value (mean) of Y when the dummy variables are zero. The b coefficients are subscripted to identify the independent variable associated with each, and represent the change in Y for the “included” categories relative to the “excluded” category. More loosely speaking, the b coefficient for a dummy variable is the difference in means between the two categories of the dummy variable, 1 and 0, where 1 is the “included” and 0 the “excluded” category. Since all other categories are compared to it, the excluded category is the reference category. As an example of creating and interpreting a set of dummy variables from a single polytomous variable, let’s examine the effect of language on income. In the Canadian Census, the variable “official language spoken” has four categories: English; French; English and French (bilingual); and neither English nor French. To use this variable in a linear regression analysis, we must first convert it into a set of three (k – 1, or 4 - 1 = 3) dummy variables, which we will call French, X1, Eng&Fr, X2, and Neither, X3. Our coding is illustrated below as well as in Table 2: French, X1, is coded as 1 = French vs. 0 = otherwise Eng&Fr, X2, is coded as 1 = English and French vs. 0 = otherwise Neither, X3, is coded as 1 = Neither English nor French vs. 0 = otherwise Table 2 Dummy Coding for the Variable Official Language Spoken Categories of Original Variable French English and French neither English nor French English Dummy Variable Coding X1 X2 X3 (French) (Eng&Fr) (Neither) 1 0 0 0 0 1 0 0 0 0 1 0 Description of Dummy Variable 1 = French, 0 = otherwise 1 = English and French, 0 = otherwise 1 = neither English nor French, 0 = otherwise According to this coding: 1. Persons who speak French only are uniquely identified by 1 on the dummy variable French and 0 on the other two dummy variables; 2. Persons who speak English and French are uniquely identified by 1 on the dummy variable Eng&Fr and 0 on the other two dummy variables; 3. Persons who speak neither English nor French only are uniquely identified by 1 on the dummy variable Neither and 0 on the other two dummy variables; 4. Persons who speak English are uniquely identified by 0 on the all three dummy variables. Hence, we need only three dummy variables to capture all information about language. We do not need to create a fourth dummy variable for “English” because this category is already represented by the other three dummy variables—when each of the dummy variables is 0, then it must be the case that the person is “English.” This category is the excluded or reference category, and the b coefficient for each dummy variable is compared against it. 6 Copyright © 2013 by Nelson Education Ltd. The selection of the reference category is entirely arbitrary, though it is often the category with the most cases. This will likely produce more stable comparisons in the regression analysis. Case in point, English is the largest category in the variable “Official Language Spoken,” so we selected it as the reference. Now that our independent variable language has been converted into a set of dummy variables, we can proceed to the regression analysis. 6 Using data from the 2006 Canadian Census, our linear regression of income on language is shown in Table 2 (again, we selected only respondents who worked mainly full-time weeks and rounded the regression coefficients to the nearest 1000). Table 2 SPSS Linear Regression Output of Income, Y, on French, X1, Eng&Fr, X2, and Neither, X3 Model 1 (Constant) French, X1 Eng&Fr, X2 Neither, X3 Coefficientsa Unstandardized Standardized Coefficients Coefficients B Std. Error Beta 50000 2500 -7000 4800 -.052 3000 19000 .006 -26000 22700 -.041 t 20 -1.458 .158 -1.145 Sig. .000 .144 .870 .251 a. Dependent Variable: Total income of individual Using the results in Table 2 we can write the model for our data as: Y = a + b1X1+ b2X2+ b3X3 Y = 50000 + (-7000)X1 + (3000)X2 + (-26000)X3 We interpret the b coefficient for the dummy variables much like we did in the regression example of sex and income—it is the difference in means between a given category and the reference category. The b coefficient of -7000 for the dummy variable French, X1, tells us that the expected income of French-only speaking persons is $7,000 less than the mean of Englishonly speaking persons, the reference category. On the other hand, with a b coefficient of -3000 for the dummy variable Eng&Fr, X2, those who speak English and French can expected an income that of $3,000 more than that of English-only speaking persons. The difference in mean income between those who speak neither English nor French and those who speak English-only is $26,000. 6 The linear regression function in SPSS unfortunately does not automatically recode nominal or ordinal variables into dummy variables; you must do this yourself using the recode command. Once dummy variables are created, they are then inserted together as a group in the regression model—see Chapter 14 of your textbook and SPSS Demonstration 14.1 for instruction on how to use the SPSS linear regression command. 7 Copyright © 2013 by Nelson Education Ltd. Once again, we can better understand this regression model if we construct category-specific regression models: The mean income for English-only speaking persons is given by the equation: Y = a + b1X1+ b2X2+ b3X3 Y = a + b1(0)+ b2(0)+ b3(0) Y=a thus, Y = 50000 + (-7000)X1 + (3000)X2 + (-26000)X3 Y = 50000 + (-7000)(0) + (3000)(0) + (-26000)(0) Y = 50,000 The mean income for French-only speaking persons is given by the equation: Y = a + b1X1+ b2X2+ b3X3 Y = a + b1(1)+ b2(0)+ b3(0) Y = a + b1 thus, Y = 50000 + (-7000)X1 + (3000)X2 + (-26000)X3 Y = 50000 + (-7000)(1) + (3000)(0) + (-26000)(0) Y = 50,000 + (-7,000) = 43,000 The mean income for English and French speaking persons is given by the equation: Y = a + b1X1+ b2X2+ b3X3 Y = a + b1(0)+ b2(1)+ b3(0) Y = a + b2 thus, Y = 50000 + (-7000)X1 + (3000)X2 + (-26000)X3 Y = 50000 + (-7000)(0) + (3000)(1) + (-26000)(0) Y = 50,000 + 3000 = 53,000 The mean income for neither English nor French speaking persons is given by the equation: Y = a + b1X1+ b2X2+ b3X3 Y = a + b1(0)+ b2(0)+ b3(1) Y = a + b3 thus, Y = 50000 + (-7000)X1 + (3000)X2 + (-26000)X3 Y = 50000 + (-7000)(0) + (3000)(0) + (-26000)(1) Y = 50,000 + (-26,000) = 24,000 The results are summarized in Table 3. Table 3 Income by First Official Language Spoken First official language spoken 8 Mean Copyright © 2013 by Nelson Education Ltd. English Only French Only Both English and French Neither English nor French $50,000 $43,000 $53,000 $24,000 As a final word of caution, when we have a set of dummy variables, the unstandardized regression slope, b, has a straightforward interpretation but the standardized slope (beta), b*, does not (see Table 2). The typical interpretation of a standardized regression coefficient, as discussed in Chapter 14 of your textbook, is not valid for a set of dummy variables, and the interpretation should focus on the unstandardized regression coefficient, b. The same applies to partial r, however it is fine to interpret the R2 value, which will tell us how much of the variance in the dependent variable is explained by the original independent variable. Dummy with Interval-Ratio Variables in Linear Regression In practice, we often perform linear regression with more than one independent variable. We may have a linear regression model with more than one nominal or ordinal variable or a combination of nominal, ordinal, and interval-ratio independent variables. Here, we will consider an example of the latter situation, where we have a nominal and interval-ratio independent variable. In the example to follow, we will not use any data or compute an actual regression analysis. Let’s consider a multivariate regression of income, Y, on sex, X1, and number of years of work experience, X2: Y = a + b1X1+ b2X2 Again, we can better understand this model by constructing separate equations for each category of the dummy variable sex (note, sex is coded as 1 = males and 0 = females): The mean income for females (X=0) is given by the equation: Y = a + b1X1+ b2X2 Y = a + b1(0) + b2X2 Y = a + b2X2 The mean income for males (X=1) is given by the equation: Y = a + b1X1+ b2X2 Y = a + b1(1) + b2X2 Y = (a + b1) + b2X2 We also graph these two regression models in Figure 1, where 9 Copyright © 2013 by Nelson Education Ltd. a is the intercept for category X1 = 0 (i.e., for women); b1 is the mean difference between the categories of X1 at X2 (i.e., the difference in intercepts of the two regression lines for males and females. Note that since the regression lines are parallel, the mean difference in income between males and females is constant across all values of X2); b2 is the slope, or the average amount of change produced in Y for each one-unit change in X2 at X1 (i.e., amount of change in income for each year of experience for both males and females). To summarize, Figure 1 shows us that the relationship between the X2, experience, and Y, income, does not change with categories of the dummy variable sex, X1—the two regression lines are parallel. While the slopes of the two lines are the same, their intercepts are not. That is, the slope, b2, of the regression line is the same for males and females, yet the mean income of males and females, b1, is different. 7 Figure 1 Linear Regressions of Income, Y, on Number of Years of Work Experience, X2, for Males and Females: An Additive Model Line for Males I n c o m e Line for Females b2 1 a + b1 a Experience Additive vs. Interaction Model The regression model above, Y = a + b1X1+ b2X2, is an example of what is called an additive model since each independent variable has an “additive” effect on the dependent variable. This is illustrated by the parallel lines in Figure 1—the effect of experience on income is the same regardless of sex and vice versa (i.e., the effect of sex on income is the same regardless of the specific value on experience). 7 If the independent variable had three or more categories, then there would be three of more parallel lines in the graph. 10 Copyright © 2013 by Nelson Education Ltd. There are times, however, when independent variables interact to affect the dependent variable. In this situation, which we call an interaction model, the regression lines are not parallel; i.e., the slopes differ for each category of the dummy variable. To account for the interaction, an interaction variable is included in the regression model. Let’s reconsider the relationship between experience and income for males and females, but this time we will assume that the independent variables interact to affect the dependent variable. This model includes the dependent variable income, Y, and the independent variables sex, X1, experience, X2 , and the interaction variable, X1*X2, which is the product (X1 multiplied by X2) of the other two independent variables: Y = a + b1X1+ b2X2 + b3X1*X2 Once again we can better understand this regression model by constructing regression models for each category of the dummy variable sex: The mean income for females (X=0) is given by the equation: Y = a + b1X1+ b2X2 + b3X1*X2 Y = a + b1(0) + b2X2 + b3(0)*X2 Y = a + b2X2 The mean income for males (X=1) is given by the equation: Y = a + b1X1+ b2X2 + b3X1*X2 Y = a + b1(1) + b2X2 + b3(1)*X2 Y = a + b1 + b2X2 + b3X2 Y = (a + b1) + (b2 + b3)X2 For further clarity, these regression models are graphed in Figure 2. Here, a and b2 are the intercept and slope of the regression line, respectively, for females b1 and b3 are the difference in intercepts and slopes between males and females, respectively The interaction effect is seen in Figure 2, where the intercept of the regression lines for males and females is different and the regression lines continually diverge as experience increases—the effect of work experience differs for males and females, and it is males who are getting the higher return in income for work experience. Figure 2 Linear Regressions of Income on Number of Years of Work Experience for Males and Females: An Interaction Model 11 Copyright © 2013 by Nelson Education Ltd. I n c o m e Line for Males b2 + b3 b2 a + b1 a Experience 12 Copyright © 2013 by Nelson Education Ltd. Line for Females