Regression Analysis-

advertisement

UNC-Wilmington

Department of Economics and Finance

ECN 377

Dr. Chris Dumas

Regression Analysis

--Using Dummy Variables to Represent Categorical X Variables

The basic OLS Regression method requires that the dependent (Y) and independent (X) variables be numerical measurement variables. However, it is often the case that important X variables are categorical rather than measurement variables. For example, gender (male/female) and product color (red, blue, yellow, etc.) are categorical variables. Often, we want to know how Y is affected by a categorical X variable. In order to include a categorical X variable in an OLS Regression Analysis, we convert it into one or more “Dummy Variables,” and then we include the Dummy Variables (instead of the original categorical variable) in the regression equation. In this handout we will first learn how to convert categorical variables into Dummy Variables, and then we will see how to include Dummy Variables in a regression equation. (Note: The use of Dummy Variables typically applies to X variables, not the Y variable. If the Y variable in a regression analysis is a categorical variable, we must use a different, modified type of regression analysis. We will cover this topic in a future handout.)

Dummy Variables, like a mannequin or “dummy,” take the place, or represent, something else, namely, the categorical variable that they stand for. Dummy Variables are also called “ Presence/Absence Variables ,” or

“ 0/1 Variables .” They are called “Presence/Absence” variables because they indicate whether a particular category is “present for” (or, applies to) a particular individual (observation/row) in the dataset. They are called

“0/1 Variables” because they can take only two possible values, zero or one.

Creating a Dummy Variable from a Categorical Variable

To explain how a Dummy Variable is created from a categorical variable, let’s consider an example. Suppose we have a categorical variable named “Xgender,” and suppose Xgender can take on only two possible values, “male” or “female.” In our dataset, the data for Xgender look like this (there would be other variables in the dataset, of course):

Xgender male female female male male female

...

Now let’s create a new variable—let’s name it “Dmale”--that will indicate whether the individual (row of data) has “male” or “female” for Xgender. If Xgender = “male,” then Dmale = 1. If Xgender = “female,” then Dmale

= 0. The Dummy Variable indicates the presence of the category for which the Dummy Variable value equals one.

So, Dmale indicates the presence of males, because when the person is male, Dmale = 1. In other words, Dmale indicates whether “maleness” is present or absent for the individual—if Dmale = 1, maleness is present, and if Dmale = 0, then maleness is absent. After we create the Dmale variable and add it to the dataset, we would have (in addition to the other variables in the dataset):

Xgender Dmale male 1 female female male

0

0

1 male female

...

1

0

...

1

UNC-Wilmington

Department of Economics and Finance

ECN 377

Dr. Chris Dumas

Notice that we are using zeros and ones to establish a “code” for male and female. Notice, too, that Dmale is a numerical measurement variable (it’s measuring “maleness”), so we can include it in a regression analysis equation.

Also notice that the category “female” does not have a Dummy Variable to represent its presence. Nonetheless, we know that “female” is present whenever “male” is absent. The category without a Dummy Variable, “female,” is the “baseline” category against which “male” will be compared in the regression analysis.

The categories variable Xgender had only two categories, male and female. What happens if the categorical variable has more than two categories? For example, what if we had a categorical X variable named “Xcolor,” and suppose Xcolor had five categories: blue, red, green, yellow and brown. (Xcolor might describe the different colors of a product that we sell, like cars, for example.) When we use Dummy Variables to represent categorical variables, we must follow one simple rule:

You need a number of Dummy Variables equal to one less than the number of categories of the categorical variable represented. For example, the categorical variable Xcolor has 5 categories, so we need 4 (one less than 5) Dummy Variables to represent it.

Each category, except one, of the categorical variable is represented by a Dummy Variable. The category without a Dummy Variable to represent it is the “baseline” category against which the other categories will be eventually compared in the regression analysis. For example, suppose that for the categorical variable Xcolor, we would like

“blue” to be our baseline category. If we want blue to be the baseline, then we do NOT create a dummy variable for blue; however, we DO create Dummy Variables for the other categories. When we do so, our dataset might look something like this:

Xcolor Dred Dgreen Dyellow Dbrown red blue red

1

0

1

0

0

0

0

0

0

0

0

0 yellow yellow brown green blue red brown

0

0

0

0

0

1

0

0

0

0

1

0

0

0

1

1

0

0

0

0

0

0

0

1

0

0

0

1 brown 0 0 0

(etc.)... (etc.)...

(etc.)...

(etc.)...

1

(etc.)...

Notice that each category, except the baseline category “blue,” has a Dummy Variable to represent its presence.

However, we know that the baseline category “blue” is present whenever the other colors are absent ; the baseline category is present when the values of all the Dummy Variables are zero (In the table above, notice that every row with “blue” for Xcolor has zeros for all the Dummy Variables.)

In a similar way, we could construct Dummy Variables to represent a categorical variable with any number of categories.

2

UNC-Wilmington

Department of Economics and Finance

How to Include Dummy Variables in Regression Equations

ECN 377

Dr. Chris Dumas

We use Dummy Variables to represent categorical X variables in a regression equation. That is, we leave the categorical X variable out of the regression equation, and we put the Dummy Variables in the equation in place of the categorical variable.

For example, suppose we thought that dependent variable Y was affected by numerical measurement variable X1 and categorical variable Xgender. For example, Y might be the number of pizza slices purchased per year by the typical consumer, X1 might be the price per piece of pizza, and Xgender is either “male” or “female,” depending on the gender of the consumer. To investigate the relationship between Y, X1 and Xgender, we might begin with the following sample regression equation:

𝑌 𝑖

= 𝛽

0

̂ ∙ 𝑋

1𝑖

+ 𝛽

2 𝑖

+ 𝑒̂ 𝑖

However, because Xgender is a categorical variable, we cannot include it directly in the regression equation.

Instead, we create a Dummy Variable, Dmale, to represent Xgender: if Xgender = “male,” then Dmale

= 1, and if Xgender = “female,” then Dmale = 0.

There are several ways that we could include Dmale in the regression equation, depending on how we think

Dmale might affect Y, or the relationship between X1 and Y.

When the Dummy Variable Affects the Intercept of the Regression Equation

Suppose that we think that the Dummy Variable might affect the intercept (but not the slope) of the regression equation. In this case, we would include the Dummy Variable in the regression as shown below (notice that we drop Xgender from the regression equation, because we are using Dmale to represent Xgender):

𝑌 𝑖

= 𝛽

0

̂ ∙ 𝑋

1𝑖

+ 𝛽̂

𝟐

∙ 𝑫𝒎𝒂𝒍𝒆 𝒊

+ 𝑒̂ 𝑖

In the equation above, when Dmale = 0, then 𝛽̂

2

drops out of the equation, and the intercept of the equation is simply 𝛽̂

0

—this is the intercept for the equation when the equation is modeling females (because Dmale = 0). On the other hand, when Dmale = 1, then 𝛽̂

2

remains in the equation (it does not drop out), and the intercept of the equation is now 𝛽̂

0

+ 𝛽̂

2

—this is the intercept for the equation when the equation is modeling males (because

Dmale = 1). The effect of Dmale on the graph of the equation is shown in the figure below:

Y

regression line for Dmale = 1 regression line for Dmale = 0

intercept

𝛽̂

0

+ intercept

𝛽̂ 𝛽̂

2

0

slope is

𝛽̂

1

for both regression lines

X

1

Note: 𝛽̂

2

can be positive or negative; if 𝛽̂

2

is positive, it shifts the intercept upward, and if 𝛽̂

2

is negative, it shifts the intercept downward.

3

UNC-Wilmington

Department of Economics and Finance

ECN 377

Dr. Chris Dumas

Notice that when Dmale is included in the regression equation in the manner shown above, it affects the intercept of the graph, but it does not affect the slope of the graph.

When the Dummy Variable Affects the Slope of the Regression Equation

Now suppose that we think that the Dummy Variable might affect the slope (but not the intercept) of the regression equation. In this case, we would include the Dummy Variable in the regression equation as shown below (notice that Xgender is missing from the regression equation, because we are using Dmale to represent

Xgender):

𝑌 𝑖

= 𝛽

0

̂ ∙ 𝑋

1𝑖

+ 𝛽̂

𝟐

∙ 𝑫𝒎𝒂𝒍𝒆 𝒊

∙ 𝑿

𝟏𝒊

+ 𝑒̂ 𝑖

(Be sure to compare the equation above with the equation in the previous subsection.) In the equation above, when Dmale = 0, then 𝛽̂

2

·Dmale·X1 drops out of the equation, and the slope of the equation is simply 𝛽̂

1

—this is the slope of the equation when the equation is modeling females (because Dmale = 0). On the other hand, when

Dmale = 1, then 𝛽̂

2

·Dmale·X1 remains in the equation (it does not drop out), and the slope of the equation is now 𝛽̂

1

+ 𝛽̂

2

—this is the slope for the equation when the equation is modeling males (because Dmale = 1). The effect of Dmale on the graph of the equation is shown in the figure below:

Y

intercept is 𝛽̂

0 for both regression lines

slope

𝛽̂

1

+

𝛽̂

2

slope

𝛽̂

1 regression line for Dmale = 1 regression line for Dmale = 0

Note: again, 𝛽̂

2

can be positive or negative; here, if 𝛽̂

2

is positive, it shifts the slope upward, and if 𝛽̂

2 is negative, it shifts the slope downward.

X

1

Notice that when Dmale is included in the regression equation in the manner shown above, it affects the slope of the graph, but it does not affect the intercept of the graph.

When the Dummy Variable Affects both the Intercept and the Slope of the Regression Equation

Now suppose that we think that the Dummy Variable might affect the intercept and/or the slope of the regression equation. In this case, we would include the Dummy Variable in the regression equation as shown below (again,

Xgender is missing from the regression equation, because we are using Dmale to represent Xgender):

𝑌 𝑖

= 𝛽

0

̂ ∙ 𝑋

1𝑖

+ 𝛽̂

𝟐

∙ 𝑫𝒎𝒂𝒍𝒆 𝒊

+ 𝛽̂

𝟑

∙ 𝑫𝒎𝒂𝒍𝒆 𝒊

∙ 𝑿

𝟏𝒊

+ 𝑒̂ 𝑖

(Be sure to compare the equation above with the equations in the previous subsections.) In the equation above, when Dmale = 0, then 𝛽̂

2

·Dmale and 𝛽̂

3

·Dmale·X1 drop out of the equation; in this case, the intercept of the equation is simply 𝛽̂

0

and the slope of the equation is simply 𝛽̂

1

—these are the intercept and slope when the equation is modeling females (because Dmale = 0). On the other hand, when Dmale = 1, then 𝛽̂

2

·Dmale and 𝛽̂

3

·Dmale·X1 drop remain in the equation (do not drop out); in this case, the intercept of the equation is now 𝛽̂

0

+ 𝛽̂

2

and the slope of the equation is now 𝛽̂

1

+ 𝛽̂

3

—these are the intercept and slope when the equation is modeling males (because Dmale = 1). The effect of Dmale on the graph of the equation is shown in the figure below:

4

UNC-Wilmington

Department of Economics and Finance

Y slope

𝛽̂

1

+

𝛽̂

3

ECN 377

Dr. Chris Dumas regression line for Dmale = 1

intercept is

𝛽̂

0

+

𝛽̂

2

slope

𝛽̂

1 regression line for Dmale = 0

intercept is

𝛽̂

0

X

1

Notice that when Dmale is included in the regression equation in the manner shown above, it may affect both the intercept and the slope of the graph.

Interaction among Dummy Variables

Sometimes Dummy Variables may interact with one another when it comes to their effects on the regression equation. An Interaction Effect occurs when the effect of one dummy variable differs depending on the value of another dummy variable. Interactions Effects are also called Synergistic Effects or

“Double-Whammy”

Effects.

The Interaction Effects may be either positive or negative; that is, the presence of one dummy variable may either increase or decrease the effect of the other dummy variable. For example, one dummy variable might normally shift the intercept of the regression equation by +6, but, when another dummy variable is present, the first dummy variable might shift the intercept by only +3. In this example, the second dummy variable

“interacted” with the first dummy variable, reducing the effect of the first dummy variable on the intercept.

We can model Interaction Effects between two dummy variables by adding additional terms to the regression equation. The additional terms feature the two potentially-interacting dummy variables multiplied together. For example, suppose we have two dummy variables, “Dmale,” which is a dummy variable for gender, as before, and also Dsenior, which indicates whether or not a consumer is a senior citizen (over 65 years of age). Dsenior equals one if the consumer is 65 years of age or older, and Dsenior equals zero if the consumer is less than 65 years old.

We can modify our regression equations to allow for the possibility of interaction effects between Dmale and

Dsenior as shown in the examples below, depending on whether we think the interaction effects might affect the intercept, slope, or both the intercept and the slope, of the regression equation:

Interaction that Might Affect the Intercept

𝑌 𝑖

= 𝛽

0

̂ ∙ 𝑋

1𝑖

+ 𝛽̂

𝟐

∙ 𝑫𝒎𝒂𝒍𝒆 𝒊

+ 𝛽̂

𝟑

∙ 𝑫𝒔𝒆𝒏𝒊𝒐𝒓 𝒊

+ 𝛽̂

𝟒

∙ 𝑫𝒎𝒂𝒍𝒆 𝒊

∙ 𝑫𝒔𝒆𝒏𝒊𝒐𝒓 𝒊

+ 𝑒̂ 𝑖

In the equation above, each dummy variable is allowed to have its own effect on the intercept (the 𝛽̂

𝟐

∙ 𝑫𝒎𝒂𝒍𝒆 𝒊

and 𝛽̂ interaction effect.

𝟑

∙ 𝑫𝒔𝒆𝒏𝒊𝒐𝒓 𝒊 terms). Then, the 𝛽̂

𝟒

∙ 𝑫𝒎𝒂𝒍𝒆 𝒊

∙ 𝑫𝒔𝒆𝒏𝒊𝒐𝒓 𝒊 term adds the

5

UNC-Wilmington

Department of Economics and Finance

Interaction that Might Affect the Slope

𝑌 𝑖

= 𝛽

0

+𝛽̂

𝟒

̂ ∙ 𝑋

1𝑖

∙ 𝑫𝒎𝒂𝒍𝒆

+ 𝛽̂ 𝒊

𝟐

∙ 𝑫𝒎𝒂𝒍𝒆

∙ 𝑫𝒔𝒆𝒏𝒊𝒐𝒓 𝒊 𝒊

∙ 𝑿

∙ 𝑿

𝟏𝒊

𝟏𝒊

+ 𝛽̂

+ 𝑒̂ 𝑖

𝟑

∙ 𝑫𝒔𝒆𝒏𝒊𝒐𝒓 𝒊

∙ 𝑿

𝟏𝒊

ECN 377

Dr. Chris Dumas

In the equation above, each dummy variable is allowed to have its own effect on the slope (the 𝛽̂

𝟐

∙ 𝑫𝒎𝒂𝒍𝒆 𝒊

∙ 𝑿

𝟏𝒊

𝑎𝑛𝑑 𝛽̂

𝟑

∙ 𝑫𝒔𝒆𝒏𝒊𝒐𝒓 term adds the interaction effect. 𝒊

∙ 𝑿

𝟏𝒊

terms), and then the 𝛽̂

𝟒

∙ 𝑫𝒎𝒂𝒍𝒆 𝒊

∙ 𝑫𝒔𝒆𝒏𝒊𝒐𝒓 𝒊

∙ 𝑿

𝟏𝒊

Interaction that Might Affect both the Intercept and the Slope

𝑌 𝑖

= 𝛽

0

+𝛽̂

𝟓

̂ ∙ 𝑋

1𝑖

∙ 𝑫𝒎𝒂𝒍𝒆

+ 𝛽̂ 𝒊

𝟐

∙ 𝑿

∙ 𝑫𝒎𝒂𝒍𝒆

𝟏𝒊

+ 𝛽̂

𝟔 𝒊

+ 𝛽̂

𝟑

∙ 𝑫𝒔𝒆𝒏𝒊𝒐𝒓

∙ 𝑫𝒔𝒆𝒏𝒊𝒐𝒓 𝒊

∙ 𝑿

𝟏𝒊 𝒊

+ 𝛽̂

+ 𝛽̂

𝟕

𝟒

∙ 𝑫𝒎𝒂𝒍𝒆

∙ 𝑫𝒎𝒂𝒍𝒆 𝒊 𝒊

∙ 𝑫𝒔𝒆𝒏𝒊𝒐𝒓

∙ 𝑫𝒔𝒆𝒏𝒊𝒐𝒓 𝒊 𝒊

∙ 𝑿

𝟏𝒊

+ 𝑒̂ 𝑖

In the equation above, each dummy variable is allowed to have its own effect on both the intercept and the slope, and then the 𝛽̂

𝟒

∙ 𝑫𝒎𝒂𝒍𝒆 𝒊

∙ 𝑫𝒔𝒆𝒏𝒊𝒐𝒓 𝒊 and 𝛽̂

𝟕

∙ 𝑫𝒎𝒂𝒍𝒆 𝒊

∙ 𝑫𝒔𝒆𝒏𝒊𝒐𝒓 𝒊

∙ 𝑿

𝟏𝒊 terms adds the interaction effects, the first for an interaction effect involving the intercept, and the second for an interaction effect involving the slope.

Dummy Variables in SAS

Consider again the Xgender variable described earlier. Suppose Xgender is a variable in our dataset, and suppose we would like to create a dummy variable Dmale to represent Xgender in a regression equation. In SAS, in the

DATA STEP portion of the SAS program, we would include the following commands to create the Dmale variable from the Xgender variable: if Xgender = 'male' then Dmale = 1; else Dmale = 0;

Now consider the Xcolor variable described earlier, and suppose it, too, is in our dataset, and we would like to create dummy variables Dred, Dgreen, Dyellow, and Dbrown from Xcolor. In SAS, in the DATA STEP portion of the SAS program, we would include the following commands to create the Dred, Dgreen, Dyellow, and

Dbrown variables from the Xcolor variable: if Xcolor = 'red' then Dred = 1; else Dred = 0; if Xcolor = 'green' then Dgreen = 1; else Dgreen = 0; if Xcolor = 'yellow' then Dyellow = 1; else Dyellow = 0; if Xcolor = 'brown' then Dbrown = 1; else Dbrown = 0;

If the original categorical variable is a numerical label, like a product number, then you don’t need the single quotation marks in the commands above, because numbers are numerical data, not text data (recall that you only need quotation marks in SAS when you are referring to text data.)

6

Download