Simple Regression

advertisement
ECON 309
Lecture 8: Simple Regression
I. Studying Relationships
So far we have only been interested in the parameters (or characteristics) of a single
variable: what is the mean number of customers in this store per day?, what is the mean
number of defective products per production run?, etc.
But often what we really want to know is the relationship between two variables; for
example:



How does the number of workers affect a firm’s output?
How does the level of advertising affect a firm’s sales?
How does the level of the minimum wage affect unemployment?
And often we want to know the relationship between more than two variables; for
example:


How do screen size, HD capability, and number of tuners affect the price of a
TV?
How do the price of a good, price of other goods, and income affect the quantity
demanded of a product?
For now, we will be focusing on finding the relationship between just two variables; we
call this doing a simple regression. When we get to three or more variables (that is, how
two or more variables affect a third), we will call it multiple regression.
II. The Functional Form of the Relationship
When looking at a single variable, we usually supposed that it had certain fixed
characteristics called parameters, and we came up with estimates of those parameters.
E.g., a population has mean μ and standard deviation σ, which we estimate with x and s,
respectively.
Similarly, we assume that the relationship between two variables x and y is defined by
fixed parameters. The usual hypothesized relationship is this:
y    x  
The parameters here are α and β; these are the items we want to estimate.
Notice that the functional form is linear; we are supposing there is a linear relationship
between the two variables. [Diagram the actual relationship; show that alpha is the yintercept, beta the slope on x.]
In this function, the dependent variable is y and the independent variable is x. That is, we
think that the value of y results from the value of x, which is determined independently.
If you think causation runs the other way (y affects x), then reverse the variables.
If you think causation runs both ways (x affects y and y affects x), then the problem
becomes a lot more complex. We will discuss problems of simultaneous determination
more later, but for now notice that many economic relationships are like that: price
depends on quantity available, but quantity available depends on the expected price; the
number of policemen can affect the crime rate, but the crime rate can affect the city’s
decision about how many police to hire; etc.
We’ve also added ε at the end of the equation; this is the error term. It represents the fact
that the relationship is not perfect; sometimes the actual y will differ from the y that
would result from the linear relationship. We generally assume that ε is normally
distributed with mean of zero. (If it were not distributed with mean of zero, we could just
add that mean to the α term, leaving an error that does have a mean of zero. So the big
assumption here is not that the mean is zero, but that the error is normally distributed.)
We will usually call our estimates the parameters a and b, respectively. Or in other
words, we’ll estimate the line above with the following equation, which we call a best-fit
line:
yˆ  a  bx
The y-with-a-hat is called the predicted value of y. It will differ from the observed y that
corresponds to given x in our sample for two reasons: first, because there’s an error term;
and second, because our estimates a and b won’t be exactly right.
III. Ordinary Least Squares
So the question is, how do we find a best-fit line? That is, how do we find our estimates
a and b?
The best-fit line gives us predicted values of y. As noted earlier, these will differ from
the observed y’s most of the time. So we want to pick a and b in a way that minimizes
the differences between observed and predicted y’s. Or, more accurately, we want to
minimize the squared differences between observed and predicted y’s.
[Draw a scatter-plot. Then show the best-fit line going through it. Then show the
distance between the predicted and the observed value of y for each value of x.]
So the method we’ll use is called Least Squares or Ordinary Least Squares. It means
minimizing the sum of squared differences between observed and predicted y’s. That is,
we minimize the following:
n
(y
i 1
i
 yˆ i ) 2
It can be proven (using calculus) that this is minimized by the following formulas for a
and b:
b
n xy   x  y 
n x 2   x 
2
a  y  bx
Fortunately for you, Excel can do all that for you.
[Use adspendingkey.xls to demonstrate the hard way; then show the easy way.
Emphasize the estimates of intercept and slope.]
It’s important to be able to interpret the results of a regression. Put the coefficient
estimates into the linear functional form. Then say what it means, by stating this-muchincrease-in-x will lead to that-much-increase-in-y.
[In this case: A $1 million increase in ad spending corresponds to about 0.363 million (or
363,000) more retained impressions by consumers per week.]
Another aspect of the regression you should look at is the R-squared value, also known as
the coefficient of determination. Put simply, the R-squared is the percent of the variation
in the dependent variable that is explained by variation in the independent variable.
IV. Hypothesis Testing With Regressions
Just as we tested hypotheses about the parameters of a single variable before, now we’ll
test hypotheses about parameters of relationships.
Most often, we’ll be interested in the slope coefficient. And typically the most important
question is whether it is different from zero. If it is, then we have support for the
existence of a relationship. So a common set of hypotheses is:
H0: β = 0
H1: β ≠ 0
Note that this is a two-tail test. Because this is such a common test, Excel’s regression
analysis does most of the work for us. For each estimated coefficient, it tells us the tvalue and the p-value. By looking at the p-value, we can see the most stringent
significance level that would allow us to reject the null hypothesis. [Use adspending.
We can support the hypothesis that spending affects the number of impressions on
consumers at any commonly used level of significance, including 1%.]
However, if you want to do any other hypothesis test (e.g., a one-tail test, or a test to see
if the coefficient is significantly different from some number besides zero), you’ll have to
set it up yourself. You can do this using the standard error. Use the following formula:
t
b   Ho
se
Where se = the standard error given in the regression output. For the appropriate
comparison t, use your t-table, with df = n – 2. Why df = n – 2? Because degrees of
freedom are equal to n minus the number of things you’re estimating. In this case, you’re
estimate the y-intercept and the slope on x.
[Replicate the Excel output. The coefficient estimate divided by the standard error gives
you the t-value.]
You can also form confidence intervals around your estimates. The regression output
already includes a 95% confidence interval. But you can form other confidence intervals.
The general formula is:
b  t c se
[Replicate the Excel output. For df = 21 – 2 = 19, and significance level of 5%, the tcritical is 2.093. Multiply this by the standard error from the regression; add and subtract
this from the coefficient estimate to get the given confidence interval.]
V. Finding and Correcting Non-Linearities
It’s important to realize when your relationship might be non-linear. Often you can see
this by looking at a scatterplot. If the relationship doesn’t look like a straight line, it’s
likely you don’t have a non-linear relationship.
In the ad spending example, the scatterplot seems to show a curved shape, getting flatter
as the amount of spending gets larger. This makes sense: there are probably diminishing
marginal returns to advertising.
We can also see this by looking at the plots of residuals and line fit in the regression
output. The line fit plot shows that the predicted value of y is often greater than the
observed value for low values of x, and less than the observed value for moderate-to-high
values of x. The residuals plot shows this more directly. Residuals are the differences
between predicted and observed values. Notice that in this picture, we have negative
residuals for low values of x, positive residuals for higher values of x. These are all good
signs that we have a non-linear relationships.
How can we correct for non-linearities? Here is where economic theory can be helpful.
We already have an economic explanation: diminishing marginal returns. Do we know
any functional forms that generate that kind of relationship?
It turns out that an exponential function will do it. Suppose that x and y are related like
so:
y  x 
With β < 1 (for example, ½). If you graph this, you’ll get a curve that starts steep but
gets flatter and flatter [show graph]. But how can we estimate the parameters of this nonlinear relationship? We can make use of the properties of logarithms. Take the log of
both sides to get:
ln( y )  ln( x  )
ln( y )  ln    ln x
Now we can transform our variables. Take the y series and create a new series called
ln(y); take the x series and create a new series called ln(x). The relationship between
these transformed variables is linear; the vertical intercept is lnα and the slope is β.
[Do this in Excel with adspending. Notice that we get an even more significant
coefficient on the slope, and a higher R-squared value. Also look at the line fit and
residuals plots; the differences don’t seem systematically related to the size of x.]
How should we interpret the results of a non-linear regression like this? The slope
coefficient tells us that when the log-of-x goes up by one, the log-of-y goes up by the
amount of the coefficient. Not very useful information. But it turns out, for reasons I
won’t explain, that when a function has the exponential form given earlier, the exponent
on x can be interpreted as the elasticity of y with respect to x. Thus, the slope coefficient
tells us that when x goes up by 1%, y will increase by a percentage equal to the
coefficient.
[For adspending, this means a 1% increase in your advertising budget will increase your
number of consumer impressions by 0.6%.]
Download