Theory of Best Fit Lines

advertisement
STEM 698 Theory of Lines of Best Fit
Suppose we have a data set of two related variables:
x
x1
y
y1
x2
y2
x3
y3
xn
xn
We could make a scatter plot of the data as follows.
y
x
If the data seem to follow a linear relationship, it seems reasonable to try to fit a line to
the data. How do we go about this?
1
y
x
There are many ways one could go about finding a “best fit” line to the data.
Traditionally, the approach to fitting the line focuses on the vertical errors, depicted in
red below.
y
x
2
However there are other possibilities one could consider. For example one could
consider the horizontal errors:
y
x
Or one could focus on the perpendicular errors:
y
x
3
While in some specialized settings, one might use the perpendicular errors, long tradition
focuses on the vertical errors.
If the line is given by y  mx  b , then the vertical error for the point ( xi , yi ) is
Vertical error = (actual y value associated with xi ) − (predicted value from the line for xi )
Vertical error  yi  (mxi  b)
The vertical error for the point ( xi , yi ) is so important that we have a special name: the
vertical error for xi is called the ith residual.
Now, even after we have decided on what errors we would like to focus on, there are
many different approaches we could use to choose a best fit line. We could try to
minimize the maximum absolute value of the error or minimize the sum of the absolute
values of the errors. For rather technical reasons, we minimize the sum of the squares
of the vertical errors.
The sum of the vertical errors is
n
S   ( yi  (mxi  b)) 2
i 1
The easiest way to find the m and b that minimize this is sum is to take the partial
derivatives with respect to m and b and set them equal to zero.
n
S
  2( yi  (mxi  b))( xi )
m i 1
n
S
  2( yi  (mxi  b))(1)
b i 1
We set them equal to 0:
n
0   2( yi  (mxi  b))( xi )
i 1
n
0   2( yi  (mxi  b))(1)
i 1
And simplify:
4
n
0   ( xi yi  mxi2  bxi )
i 1
n
0   ( yi  mxi  b)
i 1
Break up the sums:
n
n
n
0    xi yi  m xi2  b xi
i 1
i 1
i 1
n
n
n
i 1
i 1
i 1
0    yi  m xi   b
n
 b  nb .
Note
i 1
Putting the first sums over the other side of the equation we get:
n
x y
i 1
i
n
n
i 1
i 1
n
y
i 1
n
 m xi2  b xi
i
i
 m xi  nb
i 1
This is a system of two linear equations in two unknowns m and b.
Using any method for solving system of two equations in two unknowns, we get
n
m
n
n
n xi yi   xi  yi
i 1
i 1
i 1


n x    xi 
i 1
 i 1 
n
n
2
2
i
b
1 n 
y

m

i
 n  xi 
i 1
 i 1 
n
1
n
Using the symbols x and y to denote the average of the xi ’s and of the yi ’s
respectively, the equations for m and b can be written in the more convenient form:
5
n
m
n
 x y  nx y  ( x  x )( y  y )
i 1
n
i
i
x
2
i
i 1

 nx
i
i 1
i
n
 (x  x )
2
2
i
i 1
and
b  y  mx .
We can also explicitly calculate the minimal sum of the vertical errors S. Substituting the
n
formula for b in the expression for S   ( yi  (mxi  b)) 2 ,
i 1
n
S   ( yi  (mxi  ( y  mx ))) 2
i 1
n
  (( yi  y )  m( xi  x )) 2
i 1
n
  ( yi  y ) 2  2m( yi  y )( xi  x )  m 2 ( xi  x ) 2 
i 1
n
n
n
i 1
i 1
i 1
  ( yi  y ) 2  2m ( yi  y )( xi  x )  m 2  ( xi  x ) 2
Now substitute the formula for m in this expression:
n
S   ( yi  y ) 2  2
 n

 ( yi  y )( xi  x ) 


 i 1


2

n
 ( xi  x )
i 1
2
i 1
n
 n

 ( yi  y )( xi  x ) 


 i 1

i 1
 ( xi  x )2
  ( yi  y ) 2  2

2

n
i 1
n
 n

 ( yi  y )( xi  x ) 


 i 1

i 1
 ( xi  x )2
  ( yi  y ) 2 

 n

 ( yi  y )( xi  x ) 


 i 1


 n

 ( xi  x )2 


 i 1


2
n
 (x  x )
2
 n

 ( yi  y )( xi  x ) 


 i 1

 n

2
 ( xi  x ) 


 i 1


i 1
2
i
2

2
n
i 1
If we use the symbols sxx 
, sx 
n
1
n
 ( xi  x )2 s yy 
i 1
n
1
n
 ( xi  x )2 , sy 
i 1
n
1
n
 ( yi  y )2 and sxy 
i 1
n
1
n
 ( x  x )( y  y )
i 1
i
n
1
n
( y  y)
i 1
i
2
we can summarize the formulas as
6
i
m
sxy
sxx
b  y  mx

sxy2
S  ns yy  n
 n  s yy 

sxx
sxx

sxy2
The quantity r 



sxy
is defined as the correlation coefficient. It is the covariance of the
sx s y
x’s and the x’s divided by the product of the standard deviation of the x’s and y’s.
Cancelling the n’s , we can write:
n
 ( x  x )( y  y )
r
i 1
i
i
n
 ( xi  x )2
i 1
Note that m 
sxy
sxx

sxy

sy
sx s y sx
r
sy
sx
n
(y  y)
i 1
2
i
.
Also,

sxy2
S  n  s yy 

sxx

So



sxy2
s yy   ns yy (1  r 2 )
  n  s yy 

sxx s yy



S
S
 s yy (1  r 2 ) . Since  0 and s yy  0 , this means 1  r 2  0 or 1  r  1 .
n
n
S
S
 s yy , the variance of the y’s. The quantity
can be interpreted as
n
n
the “variance of the errors.”
It also means that
A consequence of all these formulas and relationship is that we can think of the variance
of y as being “partitioned” as follows.
7
s yy 
S
 r 2 s yy
n
We will interpret each term on the right side as a “variance.”
S

n
n
1
n
 ( y  (mx  b) ) , which we can think of the “variance associated with the error.”
2
i 1
i
i
And note the following:
1 n
1 n
2
((
mx

b
)

y
)

(mxi  y  mx  y ) 2


i
n i 1
n i 1
1 n
  (mxi  mx ) 2
n i 1
1 n
  m( xi  x ) 2
n i 1
1 2 n
 m  ( xi  x ) 2
n
i 1
 m 2 sxx
 r2
s yy
sxx
sxx
 r 2 s yy
Notice also that
y  mx  b
So
1 n
1 n
((mxi  b)  y ) 2   ((mxi  b)  (mx  b)) 2 = Variance of the predicted values.

n i 1
n i 1
This means
r2 
Variance of predicted y-values
Variance of y-values
8
Hence we can say colloquially that “ r 2 give the percentage of the variance accounted for
by the line.” r 2 is called “the coefficient of determination.”
Another way of thinking of this relation is to think of variance as being “partitioned” as
follows.
 n

 n

  1n  ( yi  (mxi  b))2    1n  ((mxi  b)  y ) 2 
 i 1

 i 1

Variance of the y's
s yy
Variance of the error
Variance "accounted for" by the best fit line
In this interpretation, the variance of the y’s is “partitioned” between a sum of squares
related to the error and variance of the predicted values. The latter is equal to r 2 s yy , so r 2
represents the percentage of the variance accounted for by the best fit line.
9
Download