Chapter 2: Ordinary Least Squares

advertisement
ECO391 Lecture Handout Over 15.3
Spring 2003, G. Hoyt
I. What is The Method of Least Squares(Ordinary Least Squares)
A. Theory
B. Formula
C. Application
II. Standard Error of the Estimate
I. Ordinary Least Squares (OLS) (also called The Method of Least Squares)
A. The Theory
Ordinary least squares is a statistical technique that uses sample data to estimate the true population relationship
between two variables.
Recall that :
1) E(YiXi) = o + 1Xi is the population regression line
2) Yi(hat) = bo + b1Xi is the sample regression equation
OLS allows us to find = bo and b1.
Consider the following scatter plot diagram the shows the actual, observed data points in a sample:
Y
X
Many lines could fit through these data points. We want to determine the line with the "best fit."
What does it mean to say a line fits the data the best?
Recall that ei(hat), the residual, represents the distance between the sample regression line and the observed
data point, (Xi,Yi). The line that minimizes the sum of these distances is the one that gives us the best fit.
However, some of the values of the residuals are negative in sign while others are positive. If we sum the
residuals, positive values will cancel out negative values so the sum will not accurately reflect the total
amount of error.
To solve this problem we square the residuals before we add them together.
The Method of least squares: (OLS) produces a line that minimizes the sum of the squared vertical distances
from the line to the observed data points.
i.e. it minimizes  ei2 = e12 + e22 + e32 +.........+ en2 , where n is the sample size
(hats over all of the e's)
The sum of the residuals (unsquared) is exactly zero. (Later, you can use this bit of information to check your
work.)
B. Formulas - How does OLS get estimates of the coefficients?
ei2 is also called the residual sum of squares (SSE). This is the amount that we want to minimize.
SSE
=  ei(hat)2
=  (Yi - Y(i(hat))2
=  (Yi - bo - b1Xi)2
(1)
(2)
(3)
Now consider equation (3). I am going to ask you to try to remember a little calculus. We can consider (3) as a
mathematical function of bo and b1, f(bo, b1)
We want to minimize the sum of the squared error terms so we want to minimize equation (3). In terms of calculus
this means we want to find the critical points of a function. We want to find the values of bo and b1 that minimize
the function. To do this we take a first derivative of function (3) with respect to bo and set it equal to zero and
solve for bo. When we do this we get the function:
bo =
Yi - b1X i
n
(4)
(4)
If we take the first derivative of (3) with respect to b1 and set it equal to zero and then solve for b1 we get the
following equation:
b1 =
nX i Yi - X i Yi
nX i2 - ( X i )2
(5)
(5)
Equations (4) and (5) give us
the formulas that we need to find the values of bo and b1 that estimate the true population relationship between X
and Y. If we plug (5), the formula for b1, into (4), the formula for bo, we may also write bo as follows:
bo =
X Yi - X i X i Yi
2
i
nX i2 - ( X i )2
(Note that X2  (X)2) Equations (4) (6) give us bo and b1 solely in terms or X (6)
and Y (sample data.)
C. EXAMPLE:
Let's try an example. Consider the following dependent and independent variables:
Xi = the number of children in a family
Yi = the number of loaves of bread consumed by a family in a given three week period
Family #,
n=5
XI
Yi
1
2
4
2
3
7
3
1
3
4
5
9
5
9
17
n=5
Xi =
Yi =
1) Find bo:
2)Find b1:
XiYi
Xi2
XiYi =
Xi2 =
Yi(hat)
ei(hat)
ei(hat)2
ei(hat) =
ei(hat)2 =
3) Write out the full sample regression line? Interpret the coefficients, bo and b1.
4) Prediction:
Given that Xi = 6, predict the value that we expect Yi to take given our sample regression line. (i.e. find Yi(hat).)
Complete the sixth column of the table.
5) Calculate the residuals:
Recall ei(hat) = Yi - Yi(hat) (Fill in the seventh column of the table.)
Check: The sum of the residuals should be approximately zero, ei(hat) = 0.
6) Find ei(hat)2 or SSE: (Complete the eighth column of the table.)
Alternative Formulas:
The formulas for the estimated coefficients can be manipulated and written in a variety of ways. Here are a few
other alternatives. One set of alternatives are the formulas given in the text.
b1 
b1 
X i  X Yi  Y 
 X i  X 
2
X i Yi  nXY
X i
, andX 
2
2
n
X  nX
bo  Y  b1 X , andY 
Yi
n
15.3 The Standard Error of the Estimate (Se)
Se 
eˆi2
n2
This measure approximates the average distance of the real data points from the estimated regression line.
i)
Se is measured in the same unit of measure as the Y variable. So if the Y variable is measured in dollars and Se =
$9.23, on average, our actual data points vary from their estimated values by about $9.23.
ii)
Se can be used as a measure of the quality of fit of the sample regression line. The smaller the S e, the better the
fit.
iii) An alternative formula for Se:
Se 
Yi 2  bo Yi  b1X i Yi
n2
Download