ANOVA, REGRESSION, CORRELATION

1 252regr 2/26/07 (Open this document in 'Outline' view!) Roger Even Bove G. LINEAR REGRESSION-Curve Fitting 1. Exact vs. Inexact Relations 2. The Ordinary Least Squares Formula We wish to estimate the coefficients in Y   0  1 X   . Our ‘prediction’ will be Yˆ  b0  b1 X and our error will be e  Y  Yˆ so that Y  b  b X  e . 0 1 (See appendix for derivation) b1   XY  nXY  X  nX 2 b0  Y  b1 X 2 3. Example i Y X XY X 2 Y 2 1 0 0 0 0 0 2 2 1 2 1 4 3 1 2 2 4 1 4 3 1 3 1 9 5 1 0 0 0 1 6 3 3 9 9 9 7 4 4 16 16 16 8 2 2 4 4 4 9 1 2 2 4 1 10 2 1 2 1 4 sum 19 16 40 40 49 X  16 , First copy n  10,   Y  19,  XY  40,  X  40 and  Y  X  16  1.60 Y   Y  19  1.90 . Then compute means: X  2 n 10 Use these to compute ‘Spare Parts’: SS x  Y  nY  49 101.9  12.90  SST   XY  nXY  40  10 1.61.9  9.60 . SS y  S xy X n 2 2 2 2 2  49 . 10  nX  40  101.602  14.40 2 (Total Sum of Squares) Note that SS x and SS y must be positive, while S xy can be either positive or negative. We can compute the coefficients: b1  S xy SS x   XY  nXY  X  nX 2 2  9.60  0.6667 14 .40 b0  Y  b1 X  1.90  0.6667  1.60   0.8333 So our regression equation is Yˆ  0.8333  0.6667 X or Y  0.8333  0.6667 X  e . 2 4. R 2 , the Coefficient of Determination SST  SSR  SSE , SSR  b1 S xy is Regression (Explained) Sum of Squares. SSE  SST  SSR is the Error (Unexplained or Residual) Sum of Squares, and is defined as  Y  Yˆ  2 ,a formula that should never be used for computation. R2  S xy 2 9.62  .4961 SSR b1 S xy    14 .40 12 .90  SST SSy SS x SS y An alternate formula, if no spare parts have been computed, is R 2  b0  Y  b  XY  nY  Y  nY  2 1 2 2 R 2  r 2 . The coefficient of determination is the square of the correlation. Note that SS x , b1 , and r all have the same sign. H. LINEAR REGRESSION-Simple Regression 1. Fitting a Line 2. The Gauss Markov Theorem OLS is BLUE  Y  Yˆ  2 3. Standard Errors – The standard error is defined as s e2  s e2 SSE   n2 n2 . SS y  b 2 SS x SSE SST  SSR SS y  b1 S xy .    n2 n2 n2 n2 Or, if no spare parts are available, s e2  Note also that if R 2 is available s e2  Y  2  b0 SS y 1  R 2  Y  b  XY . . 1 n2 n2 Using data from G3, and using our spare parts SS x  14 .40, SS y  12.90  SST, S xy  9.60 s e2  Y  2   XY  nXY  SS   nY 2  b1 n2 y  b1 S xy n2  12 .90  0.6667 9.60   0.8125 8 4. The Variance of b0 and b1 . 1 X 2  2 2 1  s b20  s e2    and s b1  s e    n SS x   SS x  I. LINEAR REGRESSION-Confidence Intervals and Tests 1. Confidence Intervals for b1 . 1  b1  t 2 sb1 of variation in x . df  n  2 The interval can be made smaller by increasing either n or the amount 3 2. Tests for b1 .  H 0 :  1   10 b  10 To test  use t  1 . Remember  10 is most often zero – and if the null hypothesis is H :    s b1 10  1 1 false in that case we say that 1 is significant. To continue the example in G3: R 2  R2  b0  Y  b  XY  nY 1 SS y 2  S xy 2 SS x SS y  9.62  .4961 or 14.40 12.90  0.8333 19   0.6667 40   10 1.90 2  .4961 12 .90 SSR  b1 S xy  0.66679.60  6.400. We have already computed s e2  0.8125 , which implies that  1  0.8125 s b21  s e2   0.0564 and sb1  0.0564  0.2374 .   SS x  14 .40 H 0 :  1  0 The significance test is now  df  n  2  10  2  8 . Assume that   .05 , so that for a 2H 1 :  1  0 sided test t n2 2  t .8025  2.306 and we reject the null hypothesis if t is below –2.306 or above 2.306. Since t b1  0 0.6667   2.809 is in the rejection region, we say that 1 is significant. A further test says that s b1 0.2374 1 is not significantly different from 1. If we want a confidence interval 1  b1  t sb1  0.6667  2.3060.2374  0.667  0.547 . Note that this 2 includes 1, but not zero. X  s e2  . This indicates that n 1 X 2  nX 2  n  1 s x2  both the a large sample size, n , and a large variance of x will tend to make s b21 smaller and thus decrease Note that since s x2  SS x  n 1 2  nX 2 ,  s b21  s e2     1 the size of a confidence interval for 1 or increase the size (and significance) of the t-ratio. To put it more negatively, small amounts of variation in x or small sample sizes will tend to produce values of b1 that are not significant. The common sense interpretation of this statement is that we need a lot of experience with what happens to y when we vary x to be able to put any confidence in our estimate of the slope of the equation that relates them. 4 3. Confidence Intervals and Tests for b0 H 0 :  0   00 b   00 We are now testing  with t  0 . s b0 H 1 :  0   00  1 1.60 2  1 X 2  s b20  s e2     0.8125 0.2778   0.2256 . So sb0  0.2256  0.4749 . If   0.8125    n SS x  10 14 .40  H 0 :  0  0 b  0 0.8333 we are testing  t 0   1.754 . Since the rejection region is the same as in I2, H :   0 s b0 0.4749  1 0 we accept the null hypothesis and say that  0 is not significant. A confidence interval would be  0  b0  t 2 sb0  0.8333 2.3060.4749  0.883  1.095 A common way to summarize our results is, Yˆ  0.8333  0.6667 X . The equation is written with the 0.4749  0.2374  standard deviations below the equation. For a Minitab printout example of a simple regression problem, see 252regrex1. 4. Prediction and Confidence Intervals for y  1 X X The Confidence Interval is  Y0  Yˆ0  t sYˆ , where sY2ˆ  s e2   0 n SS x    2  and the Prediction Interval   1 X X 2  is Y0  Yˆ0  t sY , where sY2  s e2   0  1 . In these two formulas, for some specific X 0 , n  SS x   ˆ Y  b  b X . For example, assume that X  5 so that for the results in G3, 0 0 1 0 0  1 X X Yˆ0  0.8333 0.66675  4.168 . Then sY2ˆ  s e2   0 n SS x  2   0.8125  1  5  1.62   0.733    10  14 .40    and sYˆ  0.733  0.856 , so that the confidence interval is  Y0  Yˆ0  t sYˆ  4.168  2.306 0.856   4.168  1.974 . This represents a confidence interval for the average value that Y will take when X  5 . For the same data 1 X X 2   1 5  1.62  sY2  s e2   0  1  0.8125    1  1.545 and sY  1.545  1.243 , so that the  10 n  14 .40   SS x    prediction interval is Y  Yˆ  t s  4.168  2.3061.243  4.168  2.866 . This is a confidence interval   0 0 Y for the value that Y will take in a particular instance when X  5 . Ignore the remainder of this document unless you have had calculus! 5 Appendix to G2– Explanation of OLS Formula Assume that we have three points:  X 1 , Y1 ,  X 2 , Y2  and  X 3 , Y3  . We wish to fit a regression line to these points, with the equation Yˆ  b0  b1 X and the characteristic that the sum of squares, SS   Y  Ŷ  2 is a minimum. If we imagine that there is a 'true' regression line Y   0  1 X   we can consider b0 and b1 to be estimates of  0 and 1 . Let us make the definition e  Y  Yˆ . Note that if we substitute our equation for Ŷ , we find that e  Y  Yˆ  Y  b0  b1 X , or Y  b0  b1 X  e . This has two consequences: First the sum of squares can be written as SS   Y  Yˆ    Y  b 2 0  e  0 or the mean of Y and Ŷ  b1 X 2  e 2 ; and second, that if we fit the line so that is the same we have Y  a  bX . Now if we subtract the equation for Y  b0  b1 X Y from the equation for Y we find Y  b0  b1 X  e . Now let us measure X and Y as   b1 X  X  e ~ ~ deviations from the mean, replacing X with X  X  X and Y with Y  Y  Y . This means that ~ ~ ~ ~ Y  b1 X  e or e  Y  b1 X . If we substitute this expression in our sum of squares, we find that ~ ~2 SS  e2  Y  b1 X . Y Y    Now write this expression out in terms of our three points and differentiate it to minimize SS with respect to b1 . To do this, recall that b1 is our unknown and that the X s and Y s are numbers (constants!), ~~ ~~ ~ ~ so that d db1 b1 XY  XY and d db1 b12 X 2  2b1 X 2 .    ~ ~ ~ ~ ~ ~ ~ ~ SS   e   Y  b X   Y  b X   Y  b X   Y  b X   2 2 2 1 1 1 1 2 2 1 2 2 3 1 3 . ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~  (Y12  2b1 X 1Y1  b12 X 12 )  (Y22  2b1 X 2Y2  b12 X 22 )  (Y32  2b1 X 3Y3  b12 X 32 ) If we now take a derivative of this expression with respect to b1 and set it equal to zero to find a minimum, we find that: ~ ~ ~2 ~ ~ ~2 ~ ~ ~2 d db1 SS  0  2 X 1Y1  2b1 X 1  0  2 X 2 Y2  2b1 X 2  0  2 X 3Y3  2b1 X 3 . ~~ ~ ~~ ~   2 XY  2b1 X 2  2 XY  b1 X 2  0 ~~ ~ ~~ ~ ~~ ~ ~~ ~ XY  b1 X 2  0 or XY  b1 X 2  0 , then XY  b1 X 2 or XY  b1 X 2 , so But if  2 ~~ XY ~ ~ that if we solve for b1 , we find b1  ~ 2 . But if we remember that X  X  X and Y  Y  Y , we can X   write this as b1                    X  X Y  Y  or b   XY  nXY .  X  nX  X  X  2 1 2 2 Of course, we still need b0 , but remember that Y  b0  b1 X , so that b0  Y  b1 X . 

ANOVA, REGRESSION, CORRELATION

Related documents

Products

Support

ANOVA, REGRESSION, CORRELATION

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib