5 Regression with qualitative explanatory variables. (5.1) Introduction Explanatory variables are often qualitative in nature(e.g. male versus female, full-time versus par-time work, east versus west versus south versus north), so that some proxy must be constructed to represent them in a regression. Dummy variables are used for this purpose. A dummy variable is an artificial variable constructed such that it takes the value 1 whenever the qualitative phenomenon it represents occurs, and is zero otherwise. Once created, these proxies, or dummies as they are often called, are used in regression analyses just like other explanatory variables, yielding standard OLS results. The econometric literature abounds with examples of their use. For example: (1) The demand for a specific good, e.g. tobacco may vary between urban and rural districts. (2) The demand for labour from an industry may show seasonal variations. (3) The wages paid to men and women may reveal variations which are not be explained by variables like education, experience, skill, age etc. The exposition below is phrased in terms of examples designed to illustrate the roles dummy variables can play, clarify the interpretation of their coefficients, and give insight to how their coefficients are estimated in a regression. Consider a data-set on incomes of teachers, nurses and doctors exhibited in figure (5.1), ( the data has been ordered so as to group observations into the professions). Suppose it is postulated that an individual’s income depends solely on her/his profession, a qualitative variable. Thus, we specify the regression: (5.1.1) Y T DT N DN D DD where DT is a dummy variable taking the value one if the observation in question is a teacher, and zero otherwise. DN is a dummy variable equal to one if the observation in question is a nurse, and zero otherwise. Finally, DD is the dummy variable taking the value one if the observation in question is a doctor, and zero otherwise. Notice that the equation states that an individual’s income is given by the coefficient of her/his related dummy varable plus a disturbance term. For a nurse the income is given by Y N since for nurses DT DD 0 and DN 1 . Since the expected value of is zero, we simple have that N E (Y | DT , 0, DN 1, DD 0) , implying that N is equal to the mean incomes of nurses. The interpretations of T and D are similar, that is T E (Y | DT 1, DN 0, DD 0) and D E (Y | DT 0, DN 0, DD 1) (5.2) Interpretation The specification of equation (1) does not contain an intercept term. If it did, perfect multicollinearity would result. For instance, if we include an intercept into the regression equation (1) we will have: 1 (5.2.1) Y 0 B T DT N DN D DD where B is a variable equal to one for all observations. We have written 0 0 B for the intercept to emphasize that the constant term in a regression equation can be imagined multiplied by a variable equal to one for all observations. But for any observation we will also have DT DN DD 1 B explaining the occurrence of perfect multi-collinearity in the regression equation (5.2.1). When we have perfect multi-collinearity between explanatory variables in a regression equation, the regression can not be run (OLS breaks down). But don’t despair, this problem is easily overcome. In order to include an intercept term in a regression with dummy variables we can simply omit one of the dummies. It is easy to see that this trick will get around the multi-collinearity problem. If we drop DT , for example, from the regression (5.2.1), the re-specified version of this equation will look like: (5.2.2) Y 0 N DN D DD The specification (5.2.2), including an intercept, is the usual specification of regression equations also when dummy variables are used to account for qualitative variables. In the present case, for a teacher, DN and DD are zero, so that a teacher’s expected income, i. e. E (Y | DN 0, DD 0) 0 is simply given by the intercept 0 . Thus the estimate of the intercept is an estimate of the teachers’ mean income. A nurse’s expected income is by (5.2.2) given as E (Y | DN 1, DD 0) 0 N . Thus an estimate of N is the difference between the nurses’ average income and the teachers’ average income. Similarly, the estimate of D is the difference between doctors’ average income and teachers’ average income since again E (Y | DN 0, DD 1) 0 D . We should notice that when the specification is changed from equation (5.1.1) to equation (5.2.2), the interpretation of the dummy variable coefficients changes quite dramatically. In order to be precise we note that: T 0 , N 0 N , D 0 D . With no intercept, the dummy variable coefficients reflect the expected income for the respective professions. With an intercept included, the omitted category (profession) becomes a base or benchmark to which the others are compared. The dummy variable coefficients for the remaining categories measure the extent to which they differ from the base category. In the example above the teachers are the base category. Thus, the coefficient N shows the difference between the mean income of a nurse and the mean income of a teacher, and the coefficient D shows the difference between the mean income of a doctor and the mean income of a teacher. Most research worker, therefore, find the regression with an intercept term more convenient and natural since this equation allows them to address directly the question in which they usually the most interest, namely whether or not the proposed categorization makes a difference and if so, by how much. If the categorization makes sense, by how much is measured directly by the coefficients of the dummy variables. Testing whether or not the categorization is relevant can be done by conducting a t-test of a dummy variable coefficient, for instance we could test the null hypothesis: H 0 : N 0 against H 1 : N 0 . More generally we use and F-test to test a joint null hypothesis on the dummy variable coefficients. 2 (5.3) Adding another qualitative variable. Above we have seen how to account for the qualitative variable profession, consisting of three categories, in a regression analysis. Let us now suppose that gender may also have a role to play in determining income. Gender is a qualitative variable consisting of two categories: females and males. Following the guidelines set out above we specify the regression: (5.3.1) Y 0 N DN D DD F DF Note that although we have two categories for gender only one dummy variable ( DF ) has been included in the regression equation. The variable DF takes value one if the observation relates to a woman and zero otherwise. The dummy variable DM for men has been omitted from the regression equation. If DM had been included we would again run into the problem of perfect multi-collinearity. This means that in the present specification male teachers will be the base category. The specification (5.3.1) implies that the difference in income between male and female will be the same for all professions. This seems to be very restrictive and should be relaxed. A simple relaxation is to include interaction terms in the regression function. For example, we can augment specification (5.3.1) to the one: (5.3.2) Y 0 N DN FN ( DF DN ) D DD FD ( DF DD ) F DF The expected income of a female nurse is given by: (5.3.3) E (Y | DN 1, DD 0, DF 1) 0 N FN F A relation which could be compared to the corresponding relation for a male nurse. That is: (5.3.4) E (Y | DN 1, DD 0, DF 0) 0 N The lesson we learn from this section is that dummy variables are a flexible instrument to account for qualitative variables in regression analysis. (5.4) Interacting with quantitative variables. The foregoing examples are somewhat unrealistic in that they are regressions in which all the explanatory variables are dummy variables. In general, however, quantitative variables determine the dependent variable as well as qualitative variables. For example, income will normally also depend on years of experience, E , so that a more reliable specification might be: (5.4.1) Y 0 N DN D DD E E 3 In this model the coefficient N must be interpreted as reflecting the difference between nurses’ and teachers’ expected incomes, taking account of years of years of experience (i.e. assuming equal years of experience). Specification (5.4.1) is in essence a model in which income is expressed as a linear function of experience , with different intercept term for each profession. In a graph of income against experience, this would be reflected by three different parallel lines, one for each profession. Perhaps, the most common use of dummy variables is to effect an intercept shift of this nature. However, in many applications it may also be the case that the slope coefficient E varies between the professions, either in addition to or in place of different intercepts. We handle this case by adding special dummies to account for the different slopes. Thus, equation (5.4.1) now reads: (5.4.2) Y 0 N DN EN ( DN E ) D DD ED ( DD E ) Here ( DN E ) and ( DD E ) are variables formed as the product of the variables indicated. The mean income for a teacher is by (5.4.2) given by: (5.4.3) E (Y | DN 0, DD 0) 0 E Similarly, the mean incomes for nurses and doctors are given by: (5.4.4) E (Y | DN 1, DD 0) 0 N ( EN E ) E (5.4.5) E (Y | DN 0, DD 1) 0 D ( ED E ) E (note that E (Y | ..) on the left-hand side in these equations denotes the expectation operation while the E on the right-hand side denotes the variable experience). Comparing teachers to the nurses shows that N is the difference in two professions’ intercept, and that EN is the difference in their slope coefficients. Thus, this special “product” dummy variable ( DN E ) allows for changes in slope coefficient for one group of data to another and thereby captures a different kind of interaction effect. Regression (5.4.1) is specified in such a way that each profession has its own intercept and its own slope coefficient. In a graph of income versus experience will us three lines which, of course, need not be parallel. Because of this there will be no difference between the estimates resulting from running this regression and the estimates obtained by running the three separate regressions, each using just the data for that particular profession. Using dummy variables in this case is of little value. The dummy variable technique is of value whenever restrictions of some kind are imposed on the model. The regression equation (5.4.1) reflects such a restriction. The slope coefficient E is assumed to be the same for all professions while the intercepts are allowed to vary between the professions. By running (5.4.1) as a single regression, this restriction is imposed and more efficient estimates of all parameters result. 4