ECON 309 Lecture 9: Multiple Regression I. The Functional Form of the Relationship Now we’ll be looking at relationships in which there is more than one explanatory variable. The usual hypothesized relationship is this: y 1 x1 2 x2 3 x3 ... In order to have all the coefficients look the same, sometimes we use this instead: y 0 1 x1 2 x2 3 x3 ... But the idea is the same regardless: we are trying to estimate an intercept term plus the coefficient on each explanatory variable. As with single regression, the functional form here assumes a linear relationship. And the same assumptions about the error term apply. The best-fit line, which is our predicted relationship, is like so: yˆ a b1 x1 b2 x2 b3 x3 ... There is a formula for the OLS estimate for the parameters, but we won’t learn it here. It is similar to the one for simple regression, but more complicated. II. The Regression Output The regression output can be interpreted in the same way as with simple regression. We can perform hypothesis test and create confidence intervals in the same way. As with the simple regression, we can look at R-squared to find out how much of the variation in the dependent variable is explained by variation in the independent variables. However, you should note that the R-squared will always increase when we add more explanatory variables, so you should not assume that a higher R-squared means the explanatory variables you’ve added are necessarily important ones. The adjusted Rsquared, also given in the regression output, takes this effect into account. Even so, you need to look at the significance of each variable to see its effect. There is another hypothesis test you might be interested in. What if you want to know whether the whole predicted relationship (not just the effect of one variable) is significant? The null hypothesis is that all the coefficients are equal to zero; the alternative is that at least some of them are not equal to zero. The statistic to consider is called the F-statistic. We won’t discuss how this statistic is constructed. But it works much the same as a t-statistic; the larger its value, the greater the significance. The “Significance F” in the regression output tells you the F-statistic’s equivalent of the tstatistic’s p-value. It tells you the smallest Type I error (alpha) that would still allow you to reject the null hypothesis. Example: The publicexpend data set gives the amount of public expenditures per capita in each of the lower-48 states, as well as various possible explanatory variables. Run a regression of expenditures on the economic ability index, metropolitan population percentage, population growth rate, youth population percentage, and elderly population percentage. III. Dummy Variables Some data – what we have called nominal data – does not take numerical form. Yet we might think such variables have an important effect on the dependent variable. How can we take these into account? We use what is called a “dummy” variable. The dummy is set equal to 1 if the observation has a particular characteristic, and it’s set equal to 0 otherwise. For example, if your observations are people, you could have a FEMALE dummy that equals 1 for a female observation and 0 for a male observation. What you’re essentially doing is treating the 0-type as the default. The regression results tell you the relationship for that group. The coefficient on the dummy tells you the effect that membership in that group has on the explanatory variable. You can think of this as the increase in the intercept for that group. That is, the coefficient on “intercept” is the vertical intercept for default group; to get the intercept for the other group, add the coefficient on the dummy to the “intercept” coefficient. Example: The publicexpend data set again. Look at the residuals from the previous regression; you might notice that Western states seem more likely to have positive residuals. So we create a dummy variable, WEST, that codes for Western states. Then we run the regression with this as an additional variable. The coefficient on WEST is 35.47, meaning that Western states tend to spend $35.47 more per capita than nonWestern states (controlling for the other variables we’re looking at). You can have multiple dummies. For instance, you might deal with race by coding for BLACK and ASIAN. If an observation is neither black nor Asian, the value of both dummies is 0. Note 1: You must pick one group as your default. It does not matter which. But your regression will not work if you create a dummy for every group. Note 2: For m different groups, you need m – 1 dummies. You can’t do it with just one. If, for example, you create a race dummy coded as 1 for blacks and 2 for Asians, you’re effecting assuming an Asian is “twice” a black person (whatever that means). IV. Quadratic Functional Forms While the functional form given earlier is linear, we can actually use it to estimate some non-linear relationships. All we have to do is transform the explanatory variables appropriately. For instance, consider the quadratic function: y 2 x 2 10 x 5 This is not linear. But what if we just think of x2 as another variable? We could rewrite the above like so: y 2x1 10x2 5 where x1 is the squared value of x and x2 is just the unmodified value of x. Notice that this is a linear relationship. Similarly, we can do the same thing with a quadratic function whose coefficients we don’t know, and use OLS to estimate those coefficients. Example: The mileage1 data set. This data set has the miles per gallon and weight of 38 vehicles. Run a simple regression first. We get a statistically significant and positive relationship. But looking at the residuals and line-fit plots, we observe a U-shape, which implies the possible existence of a quadratic relationship. How can we estimate this relationship? Create a new variable that is the weight squared. Run a new regression on both weight and weight-squared. The results are significant for both coefficients, and the line-fits and residuals don’t display as much of a pattern. (However, be careful in your interpretation and extension of the results. The coefficient on weight squared is positive, which means for high enough weight values, it could appear that weight increases miles per gallon – hard to believe. Notice that we don’t actually have any values for weight in the range that would allegedly produce this result. The quadratic form is a good fit for the data range we have, but probably not outside that range.) Example: The publicexpend data set again. Look at the residuals and line-fit plots for MET. Note the U-shape. It appears that public spending is highest in states with very low and very high metropolitanization, and lowest in states with moderate metropolitanization. (Why might this be? Maybe there are economies and diseconomies of scale in dealing with metropolitan populations. Or maybe states with both urban and rural populations are less likely to reach legislative consensus that will lead to greater public spending.) How can we estimate this relationship? We create a new variable that is MET squared and run the regression again. In the results, notice that both MET and MET-squared have significant coefficients, and the residuals and line-fit plots don’t display as much of a pattern.