STEM 698 Theory of Lines of Best Fit Suppose we have a data set of two related variables: x x1 y y1 x2 y2 x3 y3 xn xn We could make a scatter plot of the data as follows. y x If the data seem to follow a linear relationship, it seems reasonable to try to fit a line to the data. How do we go about this? 1 y x There are many ways one could go about finding a “best fit” line to the data. Traditionally, the approach to fitting the line focuses on the vertical errors, depicted in red below. y x 2 However there are other possibilities one could consider. For example one could consider the horizontal errors: y x Or one could focus on the perpendicular errors: y x 3 While in some specialized settings, one might use the perpendicular errors, long tradition focuses on the vertical errors. If the line is given by y mx b , then the vertical error for the point ( xi , yi ) is Vertical error = (actual y value associated with xi ) − (predicted value from the line for xi ) Vertical error yi (mxi b) The vertical error for the point ( xi , yi ) is so important that we have a special name: the vertical error for xi is called the ith residual. Now, even after we have decided on what errors we would like to focus on, there are many different approaches we could use to choose a best fit line. We could try to minimize the maximum absolute value of the error or minimize the sum of the absolute values of the errors. For rather technical reasons, we minimize the sum of the squares of the vertical errors. The sum of the vertical errors is n S ( yi (mxi b)) 2 i 1 The easiest way to find the m and b that minimize this is sum is to take the partial derivatives with respect to m and b and set them equal to zero. n S 2( yi (mxi b))( xi ) m i 1 n S 2( yi (mxi b))(1) b i 1 We set them equal to 0: n 0 2( yi (mxi b))( xi ) i 1 n 0 2( yi (mxi b))(1) i 1 And simplify: 4 n 0 ( xi yi mxi2 bxi ) i 1 n 0 ( yi mxi b) i 1 Break up the sums: n n n 0 xi yi m xi2 b xi i 1 i 1 i 1 n n n i 1 i 1 i 1 0 yi m xi b n b nb . Note i 1 Putting the first sums over the other side of the equation we get: n x y i 1 i n n i 1 i 1 n y i 1 n m xi2 b xi i i m xi nb i 1 This is a system of two linear equations in two unknowns m and b. Using any method for solving system of two equations in two unknowns, we get n m n n n xi yi xi yi i 1 i 1 i 1 n x xi i 1 i 1 n n 2 2 i b 1 n y m i n xi i 1 i 1 n 1 n Using the symbols x and y to denote the average of the xi ’s and of the yi ’s respectively, the equations for m and b can be written in the more convenient form: 5 n m n x y nx y ( x x )( y y ) i 1 n i i x 2 i i 1 nx i i 1 i n (x x ) 2 2 i i 1 and b y mx . We can also explicitly calculate the minimal sum of the vertical errors S. Substituting the n formula for b in the expression for S ( yi (mxi b)) 2 , i 1 n S ( yi (mxi ( y mx ))) 2 i 1 n (( yi y ) m( xi x )) 2 i 1 n ( yi y ) 2 2m( yi y )( xi x ) m 2 ( xi x ) 2 i 1 n n n i 1 i 1 i 1 ( yi y ) 2 2m ( yi y )( xi x ) m 2 ( xi x ) 2 Now substitute the formula for m in this expression: n S ( yi y ) 2 2 n ( yi y )( xi x ) i 1 2 n ( xi x ) i 1 2 i 1 n n ( yi y )( xi x ) i 1 i 1 ( xi x )2 ( yi y ) 2 2 2 n i 1 n n ( yi y )( xi x ) i 1 i 1 ( xi x )2 ( yi y ) 2 n ( yi y )( xi x ) i 1 n ( xi x )2 i 1 2 n (x x ) 2 n ( yi y )( xi x ) i 1 n 2 ( xi x ) i 1 i 1 2 i 2 2 n i 1 If we use the symbols sxx , sx n 1 n ( xi x )2 s yy i 1 n 1 n ( xi x )2 , sy i 1 n 1 n ( yi y )2 and sxy i 1 n 1 n ( x x )( y y ) i 1 i n 1 n ( y y) i 1 i 2 we can summarize the formulas as 6 i m sxy sxx b y mx sxy2 S ns yy n n s yy sxx sxx sxy2 The quantity r sxy is defined as the correlation coefficient. It is the covariance of the sx s y x’s and the x’s divided by the product of the standard deviation of the x’s and y’s. Cancelling the n’s , we can write: n ( x x )( y y ) r i 1 i i n ( xi x )2 i 1 Note that m sxy sxx sxy sy sx s y sx r sy sx n (y y) i 1 2 i . Also, sxy2 S n s yy sxx So sxy2 s yy ns yy (1 r 2 ) n s yy sxx s yy S S s yy (1 r 2 ) . Since 0 and s yy 0 , this means 1 r 2 0 or 1 r 1 . n n S S s yy , the variance of the y’s. The quantity can be interpreted as n n the “variance of the errors.” It also means that A consequence of all these formulas and relationship is that we can think of the variance of y as being “partitioned” as follows. 7 s yy S r 2 s yy n We will interpret each term on the right side as a “variance.” S n n 1 n ( y (mx b) ) , which we can think of the “variance associated with the error.” 2 i 1 i i And note the following: 1 n 1 n 2 (( mx b ) y ) (mxi y mx y ) 2 i n i 1 n i 1 1 n (mxi mx ) 2 n i 1 1 n m( xi x ) 2 n i 1 1 2 n m ( xi x ) 2 n i 1 m 2 sxx r2 s yy sxx sxx r 2 s yy Notice also that y mx b So 1 n 1 n ((mxi b) y ) 2 ((mxi b) (mx b)) 2 = Variance of the predicted values. n i 1 n i 1 This means r2 Variance of predicted y-values Variance of y-values 8 Hence we can say colloquially that “ r 2 give the percentage of the variance accounted for by the line.” r 2 is called “the coefficient of determination.” Another way of thinking of this relation is to think of variance as being “partitioned” as follows. n n 1n ( yi (mxi b))2 1n ((mxi b) y ) 2 i 1 i 1 Variance of the y's s yy Variance of the error Variance "accounted for" by the best fit line In this interpretation, the variance of the y’s is “partitioned” between a sum of squares related to the error and variance of the predicted values. The latter is equal to r 2 s yy , so r 2 represents the percentage of the variance accounted for by the best fit line. 9