ECON 309 Lecture 10: Regression Difficulties I. Simultaneous Determination Suppose you have a data set with price and quantity data for a particular product. How would you estimate the equation for demand? The natural thing to do is regress quantity on price. But when we do this, we get a positive estimate of the slope – very strange. And how would we estimate the equation for supply? Again, the natural thing to do is to regress quantity on price. But that can’t be right, because we’ll be getting exactly the same results as we did for demand. Demand and supply can’t be the same curve! [Demonstrate using agriculture.xls data set. Note that the coefficient on price is positive, which doesn’t even make sense for a demand curve.] The problem here results from what is called simultaneous determination. Price and quantity are simultaneously determined by the interaction of supply and demand. So we can’t look at equilibrium data and suppose it represents just one curve; it represents the interaction of two curves. We say that price and quantity are both endogenous, which means they are not “given” but are determined within the model. Variables that are given (determined outside the model) are called exogenous. Graphically, if we only had shifts in demand, then we would be “tracing out” the supply curve; every price-quantity pair would be a point on the supply curve. And if we only had shifts in supply, then we’d be “tracing out” the demand curve in the same way. But if both demand and supply are shifting, then we’re not tracing out either curve. Instead, we get a “cloud” of points from many different (short-run) equilibria. This problem occurs in many other contexts. Just one example: gun control may affect the amount of crime (gun control advocates believe it reduces crime, opponents say the opposite), but it may also be affected by crime (because states and cities with higher crime rates might be more inclined to pass gun control legislation). In general, we’ll have a problem whenever there is (possibly) two-way causation. How can we deal with this problem? The most common solution is to make use of an instrumental variable. An instrumental variable is an exogenous variable that only affects one of the underlying relationships. For instance, we expect that consumers’ income affects demand, but not supply. Technological productivity affects supply but not demand. When we use an instrumental variable, we’re basically looking for something that allows us to track shifts in one curve that don’t affect the other curve. This allows us to see the “tracing out” effect described earlier. Suppose demand and supply are represented by the following functional forms: Qs 0 1 P s Qd 0 1 P 2 I d That is, quantity demanded and quantity supplied are both linear functions of price; quantity demanded is also affected by income. These two equations are called the structural equations; they are the underlying relationships we’re interested in. (I’m now using alphas to represent intercept & coefficients on the supply curve, and betas to represent intercept & coefficients on the demand curve. This is just a matter of convenience to keep the two equations distinct in our minds.) If we solve the system of equations (for simplicity, ignore the error terms), we get the following equilibrium values for price and quantity: P* 0 0 2 I 1 1 1 1 ( 0 ) 1 2 Q* 0 1 0 I 1 1 1 1 These may look complicated, but really they’re just equations of lines – each has an intercept, and each has a slope on income (I). We can simply them by just defining some new coefficients like so: P* I Q* I We can estimate both of these equations. We run a regression of price on income, and then another regression of quantity on income. We now will have estimates of both β’ and β’’. These are not interesting in and of themselves, but if you look back at how these parameters were defined, it turns out the ratio of these two parameters is useful: 1 2 1 1 1 2 ' 1 1 And that’s the slope of the original supply curve, that is, the effect of price on quantity supplied. We can use the ratio the estimated coefficients from these two regressions to find an estimate of the slope of the supply curve. [Demonstrate using agriculture.xls data set. Run regression of output on income, then price on income. Then find ratio of the coefficient estimates.] But if we wanted to find an estimate of the slope of the demand curve, not the supply curve? In that case, you’d want to find an instrumental variable that would tend to shift the supply curve but not the demand curve. II. Multicollinearity Multicollinearity is a fancy term that refers to having two (or more) explanatory variables that are very similar to each other (that is, they are highly correlated). This can cause a number of problems with your regression results. The main problem is that if two variables really measure approximately the same thing, then the effect of that thing will be spread out over two different coefficients. Your coefficients are likely to be small and have weak significance. For example, suppose you want to explain the suicide rate as a function of both welfare participation and poverty. You might think both of these things are likely to cause greater incidence of suicide. The problem is that welfare and poverty are highly correlated; people who poor are also likely to be on welfare. So these variables would both measure how many people are in economic distress. If you included both variables, their coefficients would probable be small and not very significant. It would be better to include just one. [Use mileage2.xls to demonstrate the problem. A simple regression shows that weight has a significant effect on MPG. But throw in horsepower and engine size, and we get insignificant effects for all three. That’s because cars with greater horsepower and larger engines also tend to weigh more. As with the earlier mileage data set, it would also be a good idea to include weight-squared as an explanatory variable to allow for a non-linear effect of weight on MPG.] III. Heteroskedasticity and Autocorrelation We won’t be doing examples of these, but I want you to know what they are. Heteroskedasticity occurs when the error term (ε) doesn’t have the same variance over all observations. This happens most often in cross-sectional data, which is data for which observations corresponds to different regions or individuals at the same moment in time. For instance, if you’re using a data set where each observation is a U.S. state’s homicide rate, the problem is that some states might have much greater variance in their homicide rate than others. This can happen because in a state with low population, a single homicide could have a big effect on the homicide rate, while a single homicide would have a smaller effect on a high population state. Autocorrelation occurs when the errors of different observations are not independent of each other, but tend to be correlated. This happens most often in time series data, which is data for which each observation corresponds to a different point in time (e.g., a year or quarter). The problem is that there are certain trends which are likely to carry over from one period to the next. For example, a “shock” to energy supplies caused by adverse weather conditions might be expected to have a lingering effect even after the period in which it occurred. Both heteroskedasticity and autocorrelation can cause your estimates to be biased in one direction or another (and it can be hard to tell which direction). Econometricians have sophisticated methods for trying to control them, but we won’t discuss them further. (There is also a third kind of data: panel data, which is a combination of cross-sectional and time series data. Each observation corresponds to a different region or individual and a different time period. The number of observations is the number of regions/individuals multiplied by the number of time periods.] III. Correlation and Causation, Once Again Regression is a fancy enough tool that you might think it’s immune from basic errors of logic and reasoning. It’s not. You have to think very carefully about your underlying assumptions, or you could reach seriously misleading conclusions. The most important thing to realize is that regression is a sophisticated way of finding a correlation between two or more variables. Unlike simple correlation, we’re estimating more than one parameter (both intercept and slope). But we could still be mistaking correlation for causation; that is, we could be assuming A causes B simply because they happen to go together. Huff gives the example of the correlation between wages of ministers and prices of rum. You might assume, given this correlation, that ministers must be using their higher wages to buy more rum (thus driving up demand). But more likely, the correlation is driven by the fact that wages and prices for almost everything have risen over time because of inflation. Running a regression, instead of just looking at the correlation between the two variables, won’t avoid this error of logic. Also, you should not be misled by high R-squared values and significance levels. You can have these even if there’s no actual causation. And you can have these even if your functional form (which represents your assumption about the underlying relationship) is off. We saw this when we looked at the effect of ad spending on consumer impressions, and when we looked at the effect of weight on gas mileage. In both cases, the simple linear regression resulted in high R-squared and significance levels. But in both cases, a different functional form generated a better fit to the data. There is no substitute for looking at scatterplots to make yourself aware of possible sources of error. [Use Anscombe’s four data sets [anscombe.xls] to demonstrate, if there’s time.]