Linear Regression with Dummy Variables
to accompany
Prepared by
Steven Prus
Carleton University
Linear Regression with Dummy Variables to accompany
Statistics: A Tool for Social Research, Second Canadian Edition
By Joseph F. Healey and Steven G. Prus
COPYRIGHT © 2013 by Nelson Education Ltd. Nelson is a registered trademark used herein
under licence. All rights reserved.
For more information, contact Nelson, 1120 Birchmount Road, Toronto, ON M1K 5G4. Or you
can visit our Internet site at www.nelson.com.
ALL RIGHTS RESERVED. No part of this work covered by the copyright hereon may be
reproduced or used in any form or by any means—graphic, electronic, or mechanical, including
photocopying, recording, taping, web distribution or information storage and retrieval systems—
without the written permission of the publisher.
Copyright © 2013 by Nelson Education Ltd.
Linear Regression with Dummy Variables
Introduction
We saw in Chapters 13 and 14 that linear regression analysis requires that variables be measured
at the interval-ratio level. One method commonly used by researchers to overcome the restriction
when the independent variable is measured at the nominal or ordinal level is through a process
called “dummy” variable coding. (To learn how we can deal with a nominal or ordinal level
dependent variable, see the other on-line chapter titled “Regression with a Dichotomous
Dependent Variable: An Introduction to Logistic Regression”).
This technique involves converting nominal or ordinal variables into dichotomous variables,
variables with just two values: 1s and 0s. We call these new dichotomous variables “dummy”
variables; they are also known as indicator variables. We then use dummy variables in a linear
regression model to examine differences between categories or groups such as males and
females, young, middle-aged, and older persons, and so on.
A dummy variable is appropriate for linear regression for various reasons. First, it is treated like
an interval-ratio variable. We can think of a dummy variable as a numeric variable (with values 0
and 1) that has a “mean,” equal to the proportion of cases with 1s in the distribution and a
“standard deviation,” equal to square root of p*q, where p is proportion of cases containing 1s
and q is (1- p). Second, a dummy variable can take only one shape---it can have only a linear
relationship with the dependent variable, assuming that there is indeed an actual relationship.
Third, linear regression results with a dummy independent variable are interpreted exactly the
same as regression results with an interval-ratio one.
In this chapter we will provide a basic understanding of linear regression with dummy variables.
We will consider three basic situations of dummy variables in linear regression models: 1) a
single dummy variable; 2) a set of dummy variables; and 3) a single dummy variable with an
interval-ratio variable, with and without interactions between them.
What you will learn by the end of this chapter is that linear regression with dummy variables is
essentially the same as linear regression with only interval-ratio variables—regression with
dummy variables is conducted and interpreted in the same way as it is with interval-ratio level
variables. Thus, you may wish to review Chapters 13 and 14 before reading on.
Dummy Variables in Linear Regression
A "Natural" Dummy Variable
The simplest case of regression analysis with dummy
variables is with a single dichotomous variable. Consider the following example. Suppose that
we are interested in examining the effect of sex on income. In this situation we have an intervalratio dependent variable, income, and a nominal independent variable, sex. A variable like sex
with just two categories is considered a "natural" dummy variable since it does not need to be
converted and thus can be directly entered in a linear regression model.
1
Copyright © 2013 by Nelson Education Ltd.
As an illustration, let’s make use of data from the 2006 Canadian Census supplied with your
textbook, and compute a linear regression of income on sex using SPSS. To make the findings
more comparable, we selected only respondents who worked mainly full-time weeks. Table 1
shows the regression coefficients from the SPSS output, where sex is coded as 1 for males and 0
for females (note, we call the category coded as 1 the target category or category of interest and
0 as the reference or comparison category). 1 For convenience of discussion, the regression
coefficients have been rounded to the nearest 1000 (e.g., the actual coefficient for sex is 17766,
which we rounded to 18000).
Table 1 SPSS Linear Regression Output of Income, Y, on Sex, X.
Coefficientsa
Unstandardized
Standardized
Coefficients
Coefficients
Model
B
Std. Error
Beta
(Constant)
38000
3100
1
sex
18000
4200
.148
a. Dependent Variable: Total Income of individual
t
12.258
4.048
Sig.
.000
.000
Using the results in Table 1 we can write the model for our data as:
FORMULA 1
Y = a + bX
Y = 38,000 + (18,000)(X)
where Y = score on the dependent variable
a = mean of category X = 0
b = mean of category X = 1 minus mean of X = 0
X = score on the independent variable
Note, SPSS puts the a and b linear regression coefficients in the column labeled “B”.
Recall that in Chapter 13 of your textbook that we used the regression line, Y = a + bX, as a way
of describing the linear relationship between an independent and dependent interval-ratio
variable. With a dummy independent variable, the model is interpreted in the same way.
The Y intercept, a, is equal to the point where the regression line crosses the Y axis, or the
expected value (mean) of Y when X is zero. The slope, b, is the amount of change on average
1
See Chapter 13 of your textbook, namely SPSS Demonstration 13.2, for details on how to use
the SPSS linear regression command. The same routine is used whether the independent variable
is an interval-ratio or dummy variable. However, because the variable sex in the Census datafile
is coded with values other than 0 and 1—sex is coded with the values 1 and 2— we first recoded
it using the 0-1 scheme, then entered the recoded sex variable into the regression analysis. We
should point out that other coding schemes (e.g., 1 and 2) produce fundamentally the same
results as the 0-1 scheme.
2
Copyright © 2013 by Nelson Education Ltd.
produced in Y for each unit change in X. When the value of b is positive, it tells us how much Y
is expected to increase as X increases by one unit; if the value of b is negative, it tells you how
much Y is expected to decrease as X increases by one unit. 2
In the present example, the intercept, a, represents the mean income of females, $38,000, since
females are coded as 0. The slope, b, tells us how much the dependent variable changes on
average for each unit change in the independent variable. 3 However, a dummy independent
variable can only change from 0 to 1. Thus, as we move from the category coded as 0 (female) to
the category coded as 1 (male), income increases by $18,000. In other words, the value of b
equals the mean of Y for category X = 1 minus the mean of Y for X = 0, or what we call simply
the “mean difference”—the difference in mean income between males and female is $18,000. 4
What these results imply is that if a is the mean income of females and b is the difference in
mean income between males and female, then a + b (38,000 + 18,000 = $56,000) must be equal
to the mean income of males. The results can also be illustrated algebraically by constructing
category-specific regression models—separate equations for each category of the dummy
variable. We do this by setting the dummy variable sex to the appropriate value:
the mean income for females (X=0) is given by the equation:
Y = a + bX
Y = a + b(0)
Y=a
thus,
Y = 38,000 + (18,000)X
Y = 38,000 + (18,000)(0)
Y = 38,000
the mean income for males (X=1) is given by the equation:
Y = a + bX
2
It may interest you to know that linear regression analysis with dummy variables produces the
same results as the t-test and Analysis of Variance (ANOVA) discussed in Chapters 8 and 9 of
your textbook—linear regression with single dummy variable is a t-test for two categories and
linear regression with a set of dummy variables (you will learn what a set of dummy variables is
at another point in this chapter) is an ANOVA with three or more categories. These techniques
differ mainly in the way the results are expressed, with linear regression expressing results in the
form of the regression line Y = a + bX, while differences in means between one category and
another require additional calculation (post-hoc tests) in ANOVA.
3
The b coefficient of a dummy variable more appropriately measures the “distances” from the
dummy category to the reference category as opposed to a “slope.” Nonetheless, we do not wish
to cause undue confusion and will continue to call the b coefficient of a dummy variable a
“slope.”
4 When the value of b is greater than 0, as in our example, the mean response of the category
coded as 1 is higher than the mean response of the category coded as 0. Conversely, when the
value of b is less than 0, the mean response of the category coded as 1 is lower than the mean
response of the category coded as 0.
3
Copyright © 2013 by Nelson Education Ltd.
Y = a + b(1)
Y=a+b
thus,
Y = 38,000 + (18,000)X
Y = 38,000 + (18,000)(1)
Y = 38,000 + 18,000 = 56,000
The mean income of females is $38,000, which is exactly equal to the intercept, a, and the
difference in mean income between males and female ($56,000 - $38,000) is $18,000, which is
exactly equal to the slope b.
As a final comment, while we arbitrarily coded sex as 1 = males and 0 = females in the example
above, we could have alternatively coded 1 = females and 0 = males, in which case the computed
the regression model would be:
Y = a + bX
Y = 56,000 + (-18,000)(X)
As you can see, way the categories of a dummy variable are coded affects the regression
coefficients in two ways. First, the sign of the slope b is reversed. That is, the direction of
relationship between the dummy variable and the dependent variable will differ depending the
way the dummy variable is coded. While the magnitude of the slope b is the same ($18,000), its
sign is reversed—it’s negative (as we move from the category coded as 0, male, to the category
coded as 1, female, income decreases by $18,000). Second, since males are coded as 0, a is
equal to the mean income of males, or $56,000 (remember, the intercept, a, represents the value
of Y when X is zero).
Overall, is not of great importance which category is coded as 1 and which is coded as 0, namely
because the magnitude of the slope b is not affected. Other results of the regression analysis, r2
and significance tests for the slope b (not shown here), remain the same as well.
Dummy vs. Effect Coding
The interpretation of the regression results is
straightforward using dummy-variable (1 and 0) coding—one of the categories (0) is the
reference category to which the other category (1) is compared. Nonetheless, other coding
schemes are possible. Here we will look at one alternative to dummy coding called effect or
deviation coding (another commonly used alternative is contrast or orthogonal coding, though
we will not consider it).
Effect coding uses a 1 and -1 scheme. Using our example problem, we can code sex as 1 for
males and -1 for females, where the category coded 1 (males) is the target category. In such a
situation, we interpret the regression model with a single dummy variable as follows:
Y = a + bX
where Y = score on the dependent variable
a = grand mean
4
Copyright © 2013 by Nelson Education Ltd.
b = mean of category X = 1 minus grand mean
X = score on the independent variable
Using effect coding the reference “category” is the entire sample or all cases, as opposed to
dummy coding where the reference category is the category coded as 0. Thus, the intercept a is
equal to the grand mean of Y (i.e., the mean of the means of the categories of X) and the slope b
is the mean difference in Y between the target category (category coded as 1) and the grand
mean.
Overall, coding (e.g., dummy vs. effect coding) will affect the magnitude of the coefficients. The
decision to use one coding method over another must be given serious consider, however,
dummy variable coding is the most commonly used method.
Polytomous Variables
When we have a “natural” dummy variable with just two
categories the interpretation of linear regression results is straightforward, as we have just seen.
For a nominal or ordinal independent variable with more than two categories, which are often
referred to as polytomous variables, the regression analysis is a little more complicated because
we need to create and interpret a “set” of dummy variables. The exact number of variables in the
set is equal to the number of categories in the original variable minus one:
k-1 dummy variables
where k = the number of categories on the independent variable
By creating k-1 dummy variables, all information on the original independent variable is
retained. 5 As before, each dummy variable in the set has only two categories: 0 and 1.
Since each dummy variable in the set is considered an independent variable, we use the multiple
regression model (see Chapter 14 of your textbook) to specify the relationship as such:
FORMULA 2
Y = a + b1X1+ b2X2+ b3X3 +........+ bkXk
where Y = score on the dependent variable
a = mean of reference category
b1= mean of category X1 minus mean of reference category
b2= mean of category X2 minus mean of reference category
b3= mean of category X3 minus mean of reference category
bk = mean of category kth minus mean of reference category
X1, X2, X3,..., Xk = scores on the k independent variable
5
If we attempted to create a dummy variable for each category of the original variable, an
irreparable problem called perfect multicollinearity would arise and the regression coefficients
could not be calculated. For example, if we had a variable with four categories and created a
fourth dummy variable, it would be an exact linear function of the other three dummy variables,
resulting in perfect multicollinearity. To avoid this problem, k-1 dummy variables need to be
created.
5
Copyright © 2013 by Nelson Education Ltd.
We interpret this model (Formula 2) the same way we interpret any multivariate regression
equation. X1 identifies the first independent variable, X2 the second independent variable, and so
on. The intercept a is the expected value (mean) of Y when the dummy variables are zero. The b
coefficients are subscripted to identify the independent variable associated with each, and
represent the change in Y for the “included” categories relative to the “excluded” category. More
loosely speaking, the b coefficient for a dummy variable is the difference in means between the
two categories of the dummy variable, 1 and 0, where 1 is the “included” and 0 the “excluded”
category. Since all other categories are compared to it, the excluded category is the reference
category.
As an example of creating and interpreting a set of dummy variables from a single polytomous
variable, let’s examine the effect of language on income. In the Canadian Census, the variable
“official language spoken” has four categories: English; French; English and French (bilingual);
and neither English nor French. To use this variable in a linear regression analysis, we must first
convert it into a set of three (k – 1, or 4 - 1 = 3) dummy variables, which we will call French, X1,
Eng&Fr, X2, and Neither, X3. Our coding is illustrated below as well as in Table 2:
French, X1, is coded as 1 = French vs. 0 = otherwise
Eng&Fr, X2, is coded as 1 = English and French vs. 0 = otherwise
Neither, X3, is coded as 1 = Neither English nor French vs. 0 = otherwise
Table 2 Dummy Coding for the Variable Official Language Spoken
Categories of Original
Variable
French
English and French
neither English nor French
English
Dummy Variable Coding
X1
X2
X3
(French) (Eng&Fr) (Neither)
1
0
0
0
0
1
0
0
0
0
1
0
Description of Dummy Variable
1 = French, 0 = otherwise
1 = English and French, 0 = otherwise
1 = neither English nor French, 0 = otherwise
According to this coding:
1. Persons who speak French only are uniquely identified by 1 on the dummy variable French
and 0 on the other two dummy variables;
2. Persons who speak English and French are uniquely identified by 1 on the dummy variable
Eng&Fr and 0 on the other two dummy variables;
3. Persons who speak neither English nor French only are uniquely identified by 1 on the dummy
variable Neither and 0 on the other two dummy variables;
4. Persons who speak English are uniquely identified by 0 on the all three dummy variables.
Hence, we need only three dummy variables to capture all information about language. We do
not need to create a fourth dummy variable for “English” because this category is already
represented by the other three dummy variables—when each of the dummy variables is 0, then it
must be the case that the person is “English.” This category is the excluded or reference
category, and the b coefficient for each dummy variable is compared against it.
6
Copyright © 2013 by Nelson Education Ltd.
The selection of the reference category is entirely arbitrary, though it is often the category with
the most cases. This will likely produce more stable comparisons in the regression analysis. Case
in point, English is the largest category in the variable “Official Language Spoken,” so we
selected it as the reference.
Now that our independent variable language has been converted into a set of dummy variables,
we can proceed to the regression analysis. 6 Using data from the 2006 Canadian Census, our
linear regression of income on language is shown in Table 2 (again, we selected only
respondents who worked mainly full-time weeks and rounded the regression coefficients to the
nearest 1000).
Table 2 SPSS Linear Regression Output of Income, Y, on French, X1, Eng&Fr, X2, and Neither,
X3
Model
1
(Constant)
French, X1
Eng&Fr, X2
Neither, X3
Coefficientsa
Unstandardized
Standardized
Coefficients
Coefficients
B
Std. Error
Beta
50000
2500
-7000
4800
-.052
3000
19000
.006
-26000
22700
-.041
t
20
-1.458
.158
-1.145
Sig.
.000
.144
.870
.251
a. Dependent Variable: Total income of individual
Using the results in Table 2 we can write the model for our data as:
Y = a + b1X1+ b2X2+ b3X3
Y = 50000 + (-7000)X1 + (3000)X2 + (-26000)X3
We interpret the b coefficient for the dummy variables much like we did in the regression
example of sex and income—it is the difference in means between a given category and the
reference category. The b coefficient of -7000 for the dummy variable French, X1, tells us that
the expected income of French-only speaking persons is $7,000 less than the mean of Englishonly speaking persons, the reference category. On the other hand, with a b coefficient of -3000
for the dummy variable Eng&Fr, X2, those who speak English and French can expected an
income that of $3,000 more than that of English-only speaking persons. The difference in mean
income between those who speak neither English nor French and those who speak English-only
is $26,000.
6
The linear regression function in SPSS unfortunately does not automatically recode nominal or
ordinal variables into dummy variables; you must do this yourself using the recode command.
Once dummy variables are created, they are then inserted together as a group in the regression
model—see Chapter 14 of your textbook and SPSS Demonstration 14.1 for instruction on how to
use the SPSS linear regression command.
7
Copyright © 2013 by Nelson Education Ltd.
Once again, we can better understand this regression model if we construct category-specific
regression models:
The mean income for English-only speaking persons is given by the equation:
Y = a + b1X1+ b2X2+ b3X3
Y = a + b1(0)+ b2(0)+ b3(0)
Y=a
thus,
Y = 50000 + (-7000)X1 + (3000)X2 + (-26000)X3
Y = 50000 + (-7000)(0) + (3000)(0) + (-26000)(0)
Y = 50,000
The mean income for French-only speaking persons is given by the equation:
Y = a + b1X1+ b2X2+ b3X3
Y = a + b1(1)+ b2(0)+ b3(0)
Y = a + b1
thus,
Y = 50000 + (-7000)X1 + (3000)X2 + (-26000)X3
Y = 50000 + (-7000)(1) + (3000)(0) + (-26000)(0)
Y = 50,000 + (-7,000) = 43,000
The mean income for English and French speaking persons is given by the equation:
Y = a + b1X1+ b2X2+ b3X3
Y = a + b1(0)+ b2(1)+ b3(0)
Y = a + b2
thus,
Y = 50000 + (-7000)X1 + (3000)X2 + (-26000)X3
Y = 50000 + (-7000)(0) + (3000)(1) + (-26000)(0)
Y = 50,000 + 3000 = 53,000
The mean income for neither English nor French speaking persons is given by the equation:
Y = a + b1X1+ b2X2+ b3X3
Y = a + b1(0)+ b2(0)+ b3(1)
Y = a + b3
thus,
Y = 50000 + (-7000)X1 + (3000)X2 + (-26000)X3
Y = 50000 + (-7000)(0) + (3000)(0) + (-26000)(1)
Y = 50,000 + (-26,000) = 24,000
The results are summarized in Table 3.
Table 3 Income by First Official Language Spoken
First official language spoken
8
Mean
Copyright © 2013 by Nelson Education Ltd.
English Only
French Only
Both English and French
Neither English nor French
$50,000
$43,000
$53,000
$24,000
As a final word of caution, when we have a set of dummy variables, the unstandardized
regression slope, b, has a straightforward interpretation but the standardized slope (beta), b*,
does not (see Table 2). The typical interpretation of a standardized regression coefficient, as
discussed in Chapter 14 of your textbook, is not valid for a set of dummy variables, and the
interpretation should focus on the unstandardized regression coefficient, b. The same applies to
partial r, however it is fine to interpret the R2 value, which will tell us how much of the variance
in the dependent variable is explained by the original independent variable.
Dummy with Interval-Ratio Variables in Linear Regression
In practice, we often perform linear regression with more than one independent variable. We
may have a linear regression model with more than one nominal or ordinal variable or a
combination of nominal, ordinal, and interval-ratio independent variables. Here, we will consider
an example of the latter situation, where we have a nominal and interval-ratio independent
variable. In the example to follow, we will not use any data or compute an actual regression
analysis.
Let’s consider a multivariate regression of income, Y, on sex, X1, and number of years of work
experience, X2:
Y = a + b1X1+ b2X2
Again, we can better understand this model by constructing separate equations for each category
of the dummy variable sex (note, sex is coded as 1 = males and 0 = females):
The mean income for females (X=0) is given by the equation:
Y = a + b1X1+ b2X2
Y = a + b1(0) + b2X2
Y = a + b2X2
The mean income for males (X=1) is given by the equation:
Y = a + b1X1+ b2X2
Y = a + b1(1) + b2X2
Y = (a + b1) + b2X2
We also graph these two regression models in Figure 1, where
9
Copyright © 2013 by Nelson Education Ltd.
a is the intercept for category X1 = 0 (i.e., for women);
b1 is the mean difference between the categories of X1 at X2
(i.e., the difference in intercepts of the two regression lines for males and females.
Note that since the regression lines are parallel, the mean difference in income between
males and females is constant across all values of X2);
b2 is the slope, or the average amount of change produced in Y for each one-unit change
in X2 at X1
(i.e., amount of change in income for each year of experience for both males and
females).
To summarize, Figure 1 shows us that the relationship between the X2, experience, and Y, income,
does not change with categories of the dummy variable sex, X1—the two regression lines are
parallel. While the slopes of the two lines are the same, their intercepts are not. That is, the
slope, b2, of the regression line is the same for males and females, yet the mean income of males
and females, b1, is different. 7
Figure 1 Linear Regressions of Income, Y, on Number of Years of Work Experience, X2, for
Males and Females: An Additive Model
Line for Males
I
n
c
o
m
e
Line for Females
b2
1
a + b1
a
Experience
Additive vs. Interaction Model
The regression model above, Y = a + b1X1+ b2X2, is an
example of what is called an additive model since each independent variable has an “additive”
effect on the dependent variable. This is illustrated by the parallel lines in Figure 1—the effect of
experience on income is the same regardless of sex and vice versa (i.e., the effect of sex on
income is the same regardless of the specific value on experience).
7
If the independent variable had three or more categories, then there would be three of more
parallel lines in the graph.
10
Copyright © 2013 by Nelson Education Ltd.
There are times, however, when independent variables interact to affect the dependent variable.
In this situation, which we call an interaction model, the regression lines are not parallel; i.e., the
slopes differ for each category of the dummy variable. To account for the interaction, an
interaction variable is included in the regression model.
Let’s reconsider the relationship between experience and income for males and females, but this
time we will assume that the independent variables interact to affect the dependent variable. This
model includes the dependent variable income, Y, and the independent variables sex, X1,
experience, X2 , and the interaction variable, X1*X2, which is the product (X1 multiplied by X2) of
the other two independent variables:
Y = a + b1X1+ b2X2 + b3X1*X2
Once again we can better understand this regression model by constructing regression models for
each category of the dummy variable sex:
The mean income for females (X=0) is given by the equation:
Y = a + b1X1+ b2X2 + b3X1*X2
Y = a + b1(0) + b2X2 + b3(0)*X2
Y = a + b2X2
The mean income for males (X=1) is given by the equation:
Y = a + b1X1+ b2X2 + b3X1*X2
Y = a + b1(1) + b2X2 + b3(1)*X2
Y = a + b1 + b2X2 + b3X2
Y = (a + b1) + (b2 + b3)X2
For further clarity, these regression models are graphed in Figure 2. Here,
a and b2 are the intercept and slope of the regression line, respectively, for females
b1 and b3 are the difference in intercepts and slopes between males and females, respectively
The interaction effect is seen in Figure 2, where the intercept of the regression lines for males
and females is different and the regression lines continually diverge as experience increases—the
effect of work experience differs for males and females, and it is males who are getting the
higher return in income for work experience.
Figure 2 Linear Regressions of Income on Number of Years of Work Experience for Males and
Females: An Interaction Model
11
Copyright © 2013 by Nelson Education Ltd.
I
n
c
o
m
e
Line for Males
b2 + b3
b2
a + b1
a
Experience
12
Copyright © 2013 by Nelson Education Ltd.
Line for Females