ECON 309

advertisement
ECON 309
Lecture 10: Regression Difficulties
I. Simultaneous Determination
Suppose you have a data set with price and quantity data for a particular product. How
would you estimate the equation for demand? The natural thing to do is regress quantity
on price. But when we do this, we get a positive estimate of the slope – very strange.
And how would we estimate the equation for supply? Again, the natural thing to do is to
regress quantity on price. But that can’t be right, because we’ll be getting exactly the
same results as we did for demand. Demand and supply can’t be the same curve!
[Demonstrate using agriculture.xls data set. Note that the coefficient on price is positive,
which doesn’t even make sense for a demand curve.]
The problem here results from what is called simultaneous determination. Price and
quantity are simultaneously determined by the interaction of supply and demand. So we
can’t look at equilibrium data and suppose it represents just one curve; it represents the
interaction of two curves. We say that price and quantity are both endogenous, which
means they are not “given” but are determined within the model. Variables that are given
(determined outside the model) are called exogenous.
Graphically, if we only had shifts in demand, then we would be “tracing out” the supply
curve; every price-quantity pair would be a point on the supply curve. And if we only
had shifts in supply, then we’d be “tracing out” the demand curve in the same way. But
if both demand and supply are shifting, then we’re not tracing out either curve. Instead,
we get a “cloud” of points from many different (short-run) equilibria.
This problem occurs in many other contexts. Just one example: gun control may affect
the amount of crime (gun control advocates believe it reduces crime, opponents say the
opposite), but it may also be affected by crime (because states and cities with higher
crime rates might be more inclined to pass gun control legislation). In general, we’ll
have a problem whenever there is (possibly) two-way causation.
How can we deal with this problem? The most common solution is to make use of an
instrumental variable. An instrumental variable is an exogenous variable that only affects
one of the underlying relationships. For instance, we expect that consumers’ income
affects demand, but not supply. Technological productivity affects supply but not
demand. When we use an instrumental variable, we’re basically looking for something
that allows us to track shifts in one curve that don’t affect the other curve. This allows us
to see the “tracing out” effect described earlier.
Suppose demand and supply are represented by the following functional forms:
Qs   0  1 P   s
Qd   0  1 P   2 I   d
That is, quantity demanded and quantity supplied are both linear functions of price;
quantity demanded is also affected by income. These two equations are called the
structural equations; they are the underlying relationships we’re interested in. (I’m now
using alphas to represent intercept & coefficients on the supply curve, and betas to
represent intercept & coefficients on the demand curve. This is just a matter of
convenience to keep the two equations distinct in our minds.)
If we solve the system of equations (for simplicity, ignore the error terms), we get the
following equilibrium values for price and quantity:
P* 
0  0
2

I
 1  1  1  1

 (   0 )  1  2
Q*   0  1 0

I
1  1  1  1

These may look complicated, but really they’re just equations of lines – each has an
intercept, and each has a slope on income (I). We can simply them by just defining some
new coefficients like so:
P*      I
Q*      I
We can estimate both of these equations. We run a regression of price on income, and
then another regression of quantity on income. We now will have estimates of both β’
and β’’. These are not interesting in and of themselves, but if you look back at how these
parameters were defined, it turns out the ratio of these two parameters is useful:
1 2
   1  1

 1
2
'
 1  1
And that’s the slope of the original supply curve, that is, the effect of price on quantity
supplied. We can use the ratio the estimated coefficients from these two regressions to
find an estimate of the slope of the supply curve.
[Demonstrate using agriculture.xls data set. Run regression of output on income, then
price on income. Then find ratio of the coefficient estimates.]
But if we wanted to find an estimate of the slope of the demand curve, not the supply
curve? In that case, you’d want to find an instrumental variable that would tend to shift
the supply curve but not the demand curve.
II. Multicollinearity
Multicollinearity is a fancy term that refers to having two (or more) explanatory variables
that are very similar to each other (that is, they are highly correlated). This can cause a
number of problems with your regression results. The main problem is that if two
variables really measure approximately the same thing, then the effect of that thing will
be spread out over two different coefficients. Your coefficients are likely to be small and
have weak significance.
For example, suppose you want to explain the suicide rate as a function of both welfare
participation and poverty. You might think both of these things are likely to cause
greater incidence of suicide. The problem is that welfare and poverty are highly
correlated; people who poor are also likely to be on welfare. So these variables would
both measure how many people are in economic distress. If you included both variables,
their coefficients would probable be small and not very significant. It would be better to
include just one.
[Use mileage2.xls to demonstrate the problem. A simple regression shows that weight
has a significant effect on MPG. But throw in horsepower and engine size, and we get
insignificant effects for all three. That’s because cars with greater horsepower and larger
engines also tend to weigh more. As with the earlier mileage data set, it would also be a
good idea to include weight-squared as an explanatory variable to allow for a non-linear
effect of weight on MPG.]
III. Heteroskedasticity and Autocorrelation
We won’t be doing examples of these, but I want you to know what they are.
Heteroskedasticity occurs when the error term (ε) doesn’t have the same variance over all
observations. This happens most often in cross-sectional data, which is data for which
observations corresponds to different regions or individuals at the same moment in time.
For instance, if you’re using a data set where each observation is a U.S. state’s homicide
rate, the problem is that some states might have much greater variance in their homicide
rate than others. This can happen because in a state with low population, a single
homicide could have a big effect on the homicide rate, while a single homicide would
have a smaller effect on a high population state.
Autocorrelation occurs when the errors of different observations are not independent of
each other, but tend to be correlated. This happens most often in time series data, which
is data for which each observation corresponds to a different point in time (e.g., a year or
quarter). The problem is that there are certain trends which are likely to carry over from
one period to the next. For example, a “shock” to energy supplies caused by adverse
weather conditions might be expected to have a lingering effect even after the period in
which it occurred.
Both heteroskedasticity and autocorrelation can cause your estimates to be biased in one
direction or another (and it can be hard to tell which direction). Econometricians have
sophisticated methods for trying to control them, but we won’t discuss them further.
(There is also a third kind of data: panel data, which is a combination of cross-sectional
and time series data. Each observation corresponds to a different region or individual
and a different time period. The number of observations is the number of
regions/individuals multiplied by the number of time periods.]
III. Correlation and Causation, Once Again
Regression is a fancy enough tool that you might think it’s immune from basic errors of
logic and reasoning. It’s not. You have to think very carefully about your underlying
assumptions, or you could reach seriously misleading conclusions.
The most important thing to realize is that regression is a sophisticated way of finding a
correlation between two or more variables. Unlike simple correlation, we’re estimating
more than one parameter (both intercept and slope). But we could still be mistaking
correlation for causation; that is, we could be assuming A causes B simply because they
happen to go together.
Huff gives the example of the correlation between wages of ministers and prices of rum.
You might assume, given this correlation, that ministers must be using their higher wages
to buy more rum (thus driving up demand). But more likely, the correlation is driven by
the fact that wages and prices for almost everything have risen over time because of
inflation. Running a regression, instead of just looking at the correlation between the two
variables, won’t avoid this error of logic.
Also, you should not be misled by high R-squared values and significance levels. You
can have these even if there’s no actual causation. And you can have these even if your
functional form (which represents your assumption about the underlying relationship) is
off. We saw this when we looked at the effect of ad spending on consumer impressions,
and when we looked at the effect of weight on gas mileage. In both cases, the simple
linear regression resulted in high R-squared and significance levels. But in both cases, a
different functional form generated a better fit to the data.
There is no substitute for looking at scatterplots to make yourself aware of possible
sources of error. [Use Anscombe’s four data sets [anscombe.xls] to demonstrate, if
there’s time.]
Download