Lecture 8: Ordinary Least Squares BUEC 333 Professor David Jacks

advertisement
Lecture 8: Ordinary Least Squares
BUEC 333
Professor David Jacks
1
A lot of the discussion last week surrounding the
population regression function,
Yi   0  1 X 1i   2 X 2i   3 X 3i     k X ki   i
We said because the coefficients (β) and the errors
(εi) are population quantities, we do not and
cannot observe them.
This lead us to a consideration to the sample
analog of the regression function above, namely
Overview of regression analysis
2
Overview of regression analysis
3
Overview of regression analysis
4
Overview of regression analysis
5
Overview of regression analysis
6
Most of the time, our primary interest is the
coefficients themselves.
βk measures the marginal effect of independent
variable Xki dependent variable Yi, holding the
value
Sometimes we are more interested in predicting Yi;
given sample data, we can calculate predicted
values
Overview of regression analysis
7
In either case, we need some way to estimate the
unknown β’s.
That is, we need a way to compute ̂ ' s from a
sample of data.
Unsurprisingly, there are lots and lots of ways to
estimate the β’s (compute ̂ ' s).
By far, the most common—and one of the most
intuitive—methods is called ordinary least
squares, or OLS.
Overview of regression analysis
8
Recall that we can write the following:
Yi   0  1 X 1i   2 X 2i     k X ki   i
 ˆ  ˆ X  ˆ X    ˆ X  e
0
1
1i
2
2i
k
ki
i
 Yˆi  ei
where ei are known as residuals.
Residuals as: 1.) the sample counterpart to εi;
2.) how far our predicted values
are from the observed values
What does OLS do?
9
We want to estimate the β’s in a way that makes
the residuals as small as possible.
That is, we want our predicted values to be as
close to “the truth” as possible.
Or in other words, we want to minimize our
“prediction mistakes”.
To accomplish this, OLS minimizes the
What does OLS do?
10
Computationally, OLS is “easy”: computers can
perform it in a fraction of a second and you could
do it by hand if necessary (albeit slowly).
Comparatively, OLS estimates are not only
unbiased but also most efficient in the class of
linear unbiased estimators (more on this later).
Conceptually, minimizing squared residuals is
But why “least squares”?
11
If one were to minimize the sum (or average) of
residuals, the positive and negative residuals
would only serve to cancel one another out.
Consequently, we may end up with really
inaccurate predicted values.
Thus, squaring penalizes “big” mistakes (large ei)
But why “least squares”?
12
Suppose you have a linear regression model with
one independent variable:
Yi   0  1 X i   i
The OLS estimates of β0 and β1 are the values that
minimize:
n
n

ˆ
e

Y

Y
  i i
i 1
2
i
i 1

2

How to determine this minimum?
Differentiate w.r.t. β0 and β1, set to zero, and solve.
How does OLS work?
13
At the end of the day, the solutions to this
minimization problem are:
βˆ0  Y 
 X
n

i 1
i
 X Yi  Y 
 X
n
i 1
i
X
2
But where did these equations come from?
How does OLS work?
14
Yi   0  1 X i   i
 ˆ  ˆ X  e
0
1
1
i
First, define your residual as ei  Yi  Yˆi .
Next, set up a minimization problem for the linear
regression model with one independent variable.
That is, let 𝛽 be defined as a set of estimators that
How does OLS work?
15
n
Min
𝛽
n

ˆ  ˆ X
e

Y


  i 0 1 i
i 1
2
i
i 1

2
As usual, this involves taking the first derivatives
with respect to the betas and setting the two
equations equal to zero.
Once we have the two first-order conditions
(FOCs), we solve for the value of the betas where
How does OLS work?
16
In this case, we need to apply the chain rule:
if y is a differentiable function of u and u is a
differentiable function of x, then:
dy dy du

*
dx du dx
y and u are the “outside”
and “inside”
functions…
n
n
2
2
y   ei   u
i 1
i 1
u  Yi  ˆ0  ˆ1 X i
How does OLS work?
17
n
d  ei2
i 1
d ˆ0
n

d  ui2
i 1
dui
 (Yi  ˆ0  ˆ1 X i )
*
 0
n
d  ei2
i 1
d ˆ0
n
 2 ui *

 Yi  ˆ0  ˆ1 X i
i 1
ˆ0
n
d  ei2
i 1
d ˆ
0
n
 2
i 1


Yi  ˆ0  ˆ1 X i *
How does OLS work?


 Yi  ˆ0  ˆ1 X i
ˆ0
 0
18
Solve for the partial derivatives:

 Yi  ˆ0  ˆ1 X i
ˆ0

 Yi  ˆ0  ˆ1 X i
ˆ1


Now, substitute in the partial derivatives:
n


2 Yi  ˆ0  ˆ1 X i  0
i 1
How does OLS work?
19
The FOCs actually tells us a few useful things:
n


1.)  2 Yi  ˆ0  ˆ1 X i  0 
i 1
n


2.)  2 X i Yi  ˆ0  ˆ1 X i  0 
i 1
1.) the mean of the residuals is going to be equal
2.) the covariance of the residuals and X is going
to be equal
How does OLS work?
20
Finally, solve for the values of the coefficients:
ˆ0 :
n


2 Yi  ˆ0  ˆ1 X i  0 
i 1
 Y  ˆ
n
i
i 1
n
0
 ˆ1 X i
n
n
i 1
n
i 1

ˆ  ˆ X
Y


i  0  1 i
i 1
n
 Y   ˆ X
i 1
i
i 1
How does OLS work?
1
i
21
n
n
i 1
n
i 1
n
ˆX 
Y


i  1 i
 Y   ˆ X
i 1
i
1
i 1
n
i
1
ˆ
 0   Yi
n i 1
n
ˆ0 
Y
i 1
n
i
n
 ˆ1
X
i 1
n
i

Great, but what about the slope coefficient?
How does OLS work?
22
ˆ1 :

n
2 X i Yi  ˆ0  ˆ1 X i
i 1


 

 X  Y  Y   ˆ  X  X    0
n
ˆ X  ˆ X
X
Y

Y


 i i
1
1 i
i 1
n
i 1
i
i
How does OLS work?
1
i
23
n




ˆ X X X 0
X
Y

Y


 i i
1 i
i
i 1
 X Y  Y 
n
ˆ1 
i
i 1
n
 X X
i 1
i
i
i
X

Great, but what do we do with this?
How does OLS work?
24
Next, we need to work some mathemagics:
1.) Expand the previous expression
 X Y  Y  
n
ˆ1 
n
i
i 1
n
i

 Xi Xi  X
i 1


i 1
n
2
X
 i  Xi X 
i 1
2.) Note that
n
X
X
i 1
n
n
i
n
  Xi 
i 1
 ˆ1 
  X Y   nXY
i 1
n
i i
2
X
  i   nXX
i 1
How does OLS work?
25
3.) Finally, make use of the facts that
n
  X Y   nXY 
i 1
i i
2
X
  i   nXX 
n
i 1
n
ˆ1 
  X Y   nXY
i 1
n
i i
  X   nXX
i 1

2
i
How does OLS work?
26
An important interpretation…the estimated
coefficients are simply weighted averages of Y:


X i  X Yi  Y


n 
Xi  X
1
ˆ1  i 1 n
 
 Yi 
2
2
n
n
i 1 
Xi  X
  Xi  X


i 1
 i 1




 X X

n 
i
1
1 
ˆ0  Y  ˆ1 X     X  n

Yi 
2


n 
i 1 n

  Xi  X
 
 i 1


n













Thus, it is a special kind of sample mean
How does OLS work?
27
A second important interpretation:
n
ˆ1 
 X
i 1
i
 X Yi  Y 
n
 X
i 1
i
X
2
ˆ1 
ˆ1 
How does OLS work?
28
For the basic regression model Yi = β0 + β1Xi + εi
y represents variation in y;
x represents variation in x.
y
x
Overlap between the two
(in green) represents
variation that y and x have
in common.
Another way of “seeing” OLS
29
Knowing the summation formulas for OLS
estimates is useful for understanding how it works.
But once we add more than one independent
variable, these formulas become problematic.
In practice, we rarely do least squares calculations
by hand—this is why God invented computers.
Time for an example via Stata.
OLS in practice
30
Suppose we are interested in how an NHL hockey
player’s salary varies with the number of points
they score.
That is, variation in salary is related to variation
in points scored (with causality presumably from
latter to former).
Dependent variable (Yi) will be SALARY.
Independent variable (Xi) will be POINTS.
An example
31
A few helpful steps:
1.) open NHL 1601.xlsx
2.) copy all columns
3.) open Stata > type “edit” in command window
An example
32
Stata now does the heavy lifting…your results
should look like the following.
An example
33
The column labeled “Coef.” gives the least
squares estimates of the regression coefficients.
So our estimated model is:
SALARY = 365,739.80 + (40,546.89)*POINTS
Players who scored zero points earned
$365,739.80 on average.
For each point scored, players were paid an
additional $40,546.89 with the “average” 100point player being paid $4,420,428.80.
What the results mean
34
The column labeled “Std. Err.” gives the standard
error—that is, the square root of the sampling
variance—of the regression coefficients.
Remember: the OLS estimates are a function of
the sampled data and, therefore, RVs.
Every RV has a sampling distribution, so
“Std. Err.” tells us how spread out estimates are.
What the results mean
35
In particular, the column labeled “t-Statistic” is a
test statistic for the null hypothesis that the
corresponding regression coefficient is zero.
The column labeled “Prob.” is the p-value
associated with this test; it is the probability of
making a type I error if we reject the null.
We can ignore the rest for the time being.
Now, add a player’s age & years of NHL
experience to our model.
What the results mean
36
An example
37
First thing to notice: the estimated coefficient on
POINTS and the intercept have changed; this is
because they now measure different things.
In our original model, the intercept (_cons)
measured the average SALARY when POINTS
was zero ($365,739.80).
That is, the intercept originally estimated
E(SALARY | POINTS = 0) which put no restriction
What the results mean
38
In the new model, the intercept measures the
average SALARY when POINTS, AGE, and
YEARS_EXP are all zero ($309,516.30)
That is, the new intercept estimates
E(SALARY | POINTS = 0, AGE = 0,
YEARS_EXP = 0).
The point is that what your estimated regression
coefficients measure depends
What the results mean
39
Originally, the coefficient on POINTS was an
estimated of the marginal effect of POINTS on
SALARY:
d (SALARY)
 40,546.89
d (POINTS)
Now, the coefficient on POINTS measures the
marginal effect of POINTS on SALARY, holding
AGE and YEARS_EXP constant:
 (SALARY)
 35,150.35
 (POINTS)
What the results mean
40
Download