5 Regression with qualitative explanatory variables

advertisement
5 Regression with qualitative explanatory variables.
(5.1)
Introduction
Explanatory variables are often qualitative in nature(e.g. male versus female, full-time versus
par-time work, east versus west versus south versus north), so that some proxy must be
constructed to represent them in a regression. Dummy variables are used for this purpose. A
dummy variable is an artificial variable constructed such that it takes the value 1 whenever the
qualitative phenomenon it represents occurs, and is zero otherwise. Once created, these
proxies, or dummies as they are often called, are used in regression analyses just like other
explanatory variables, yielding standard OLS results. The econometric literature abounds with
examples of their use. For example:
(1) The demand for a specific good, e.g. tobacco may vary between urban and rural
districts.
(2) The demand for labour from an industry may show seasonal variations.
(3) The wages paid to men and women may reveal variations which are not be explained
by variables like education, experience, skill, age etc.
The exposition below is phrased in terms of examples designed to illustrate the roles dummy
variables can play, clarify the interpretation of their coefficients, and give insight to how their
coefficients are estimated in a regression.
Consider a data-set on incomes of teachers, nurses and doctors exhibited in figure (5.1), ( the
data has been ordered so as to group observations into the professions). Suppose it is
postulated that an individual’s income depends solely on her/his profession, a qualitative
variable. Thus, we specify the regression:
(5.1.1) Y   T DT   N DN   D DD  
where DT is a dummy variable taking the value one if the observation in question is a
teacher, and zero otherwise. DN is a dummy variable equal to one if the observation in
question is a nurse, and zero otherwise. Finally, DD is the dummy variable taking the value
one if the observation in question is a doctor, and zero otherwise. Notice that the equation
states that an individual’s income is given by the coefficient of her/his related dummy varable
plus a disturbance term. For a nurse the income is given by Y   N   since for nurses
DT  DD  0 and DN  1 . Since the expected value of  is zero, we simple have that
 N  E (Y | DT ,  0, DN  1, DD  0) , implying that  N is equal to the mean incomes of nurses.
The interpretations of  T and  D are similar, that is T  E (Y | DT  1, DN  0, DD  0) and
 D  E (Y | DT  0, DN  0, DD  1)
(5.2)
Interpretation
The specification of equation (1) does not contain an intercept term. If it did, perfect multicollinearity would result. For instance, if we include an intercept into the regression equation
(1) we will have:
1
(5.2.1) Y   0 B  T DT   N DN   D DD  
where B is a variable equal to one for all observations. We have written  0   0 B for the
intercept to emphasize that the constant term in a regression equation can be imagined
multiplied by a variable equal to one for all observations. But for any observation we will also
have DT  DN  DD  1  B explaining the occurrence of perfect multi-collinearity in the
regression equation (5.2.1). When we have perfect multi-collinearity between explanatory
variables in a regression equation, the regression can not be run (OLS breaks down). But
don’t despair, this problem is easily overcome. In order to include an intercept term in a
regression with dummy variables we can simply omit one of the dummies. It is easy to see
that this trick will get around the multi-collinearity problem. If we drop DT , for example,
from the regression (5.2.1), the re-specified version of this equation will look like:
(5.2.2) Y   0   N DN   D DD  
The specification (5.2.2), including an intercept, is the usual specification of regression
equations also when dummy variables are used to account for qualitative variables. In the
present case, for a teacher, DN and DD are zero, so that a teacher’s expected income, i. e.
E (Y | DN  0, DD  0)  0 is simply given by the intercept  0 . Thus the estimate of the
intercept is an estimate of the teachers’ mean income. A nurse’s expected income is by (5.2.2)
given as E (Y | DN  1, DD  0)  0   N . Thus an estimate of  N is the difference between
the nurses’ average income and the teachers’ average income. Similarly, the estimate of  D is
the difference between doctors’ average income and teachers’ average income since again
E (Y | DN  0, DD  1)  0   D . We should notice that when the specification is changed from
equation (5.1.1) to equation (5.2.2), the interpretation of the dummy variable coefficients
changes quite dramatically. In order to be precise we note that:
T   0 ,  N   0   N ,  D   0   D .
With no intercept, the dummy variable coefficients reflect the expected income for the
respective professions. With an intercept included, the omitted category (profession) becomes
a base or benchmark to which the others are compared. The dummy variable coefficients for
the remaining categories measure the extent to which they differ from the base category. In
the example above the teachers are the base category. Thus, the coefficient  N shows the
difference between the mean income of a nurse and the mean income of a teacher, and the
coefficient  D shows the difference between the mean income of a doctor and the mean
income of a teacher. Most research worker, therefore, find the regression with an intercept
term more convenient and natural since this equation allows them to address directly the
question in which they usually the most interest, namely whether or not the proposed
categorization makes a difference and if so, by how much. If the categorization makes sense,
by how much is measured directly by the coefficients of the dummy variables. Testing
whether or not the categorization is relevant can be done by conducting a t-test of a dummy
variable coefficient, for instance we could test the null hypothesis:
H 0 :  N  0 against H 1 :  N  0 . More generally we use and F-test to test a joint null
hypothesis on the dummy variable coefficients.
2
(5.3) Adding another qualitative variable.
Above we have seen how to account for the qualitative variable profession, consisting of three
categories, in a regression analysis. Let us now suppose that gender may also have a role to
play in determining income. Gender is a qualitative variable consisting of two categories:
females and males. Following the guidelines set out above we specify the regression:
(5.3.1) Y  0   N DN   D DD   F DF  
Note that although we have two categories for gender only one dummy variable ( DF ) has
been included in the regression equation. The variable DF takes value one if the observation
relates to a woman and zero otherwise. The dummy variable DM for men has been omitted
from the regression equation. If DM had been included we would again run into the problem
of perfect multi-collinearity. This means that in the present specification male teachers will be
the base category. The specification (5.3.1) implies that the difference in income between
male and female will be the same for all professions. This seems to be very restrictive and
should be relaxed. A simple relaxation is to include interaction terms in the regression
function. For example, we can augment specification (5.3.1) to the one:
(5.3.2) Y  0   N DN   FN ( DF DN )   D DD   FD ( DF DD )   F DF  
The expected income of a female nurse is given by:
(5.3.3) E (Y | DN  1, DD  0, DF  1)  0   N   FN   F
A relation which could be compared to the corresponding relation for a male nurse. That is:
(5.3.4) E (Y | DN  1, DD  0, DF  0)  0   N
The lesson we learn from this section is that dummy variables are a flexible instrument to
account for qualitative variables in regression analysis.
(5.4)
Interacting with quantitative variables.
The foregoing examples are somewhat unrealistic in that they are regressions in which all the
explanatory variables are dummy variables. In general, however, quantitative variables
determine the dependent variable as well as qualitative variables. For example, income will
normally also depend on years of experience, E , so that a more reliable specification might
be:
(5.4.1) Y   0   N DN   D DD   E E  
3
In this model the coefficient  N must be interpreted as reflecting the difference between
nurses’ and teachers’ expected incomes, taking account of years of years of experience (i.e.
assuming equal years of experience).
Specification (5.4.1) is in essence a model in which income is expressed as a linear function
of experience , with different intercept term for each profession. In a graph of income against
experience, this would be reflected by three different parallel lines, one for each profession.
Perhaps, the most common use of dummy variables is to effect an intercept shift of this
nature. However, in many applications it may also be the case that the slope coefficient  E
varies between the professions, either in addition to or in place of different intercepts.
We handle this case by adding special dummies to account for the different slopes. Thus,
equation (5.4.1) now reads:
(5.4.2) Y   0   N DN   EN ( DN E )   D DD   ED ( DD E )  
Here ( DN E ) and ( DD E ) are variables formed as the product of the variables indicated.
The mean income for a teacher is by (5.4.2) given by:
(5.4.3) E (Y | DN  0, DD  0)   0   E
Similarly, the mean incomes for nurses and doctors are given by:
(5.4.4) E (Y | DN  1, DD  0)   0   N  ( EN   E ) E
(5.4.5) E (Y | DN  0, DD  1)   0   D  ( ED   E ) E
(note that E (Y | ..) on the left-hand side in these equations denotes the expectation operation
while the E on the right-hand side denotes the variable experience). Comparing teachers to
the nurses shows that  N is the difference in two professions’ intercept, and that  EN is the
difference in their slope coefficients. Thus, this special “product” dummy variable ( DN E )
allows for changes in slope coefficient for one group of data to another and thereby captures a
different kind of interaction effect.
Regression (5.4.1) is specified in such a way that each profession has its own intercept and its
own slope coefficient. In a graph of income versus experience will us three lines which, of
course, need not be parallel. Because of this there will be no difference between the estimates
resulting from running this regression and the estimates obtained by running the three separate
regressions, each using just the data for that particular profession. Using dummy variables in
this case is of little value. The dummy variable technique is of value whenever restrictions of
some kind are imposed on the model. The regression equation (5.4.1) reflects such a
restriction. The slope coefficient  E is assumed to be the same for all professions while the
intercepts are allowed to vary between the professions. By running (5.4.1) as a single
regression, this restriction is imposed and more efficient estimates of all parameters result.
4
Download