SIMPLE REGRESSION CALCULATIONS Here are the calculations needed to do a simple regression. Aside: The word simple here refers to the use of just one x to predict y. Problems in which two or more variables are used to predict y are called multiple regression. The input data are (x1, y1), (x2, y2), …, (xn, yn). The outputs in which we are intereseted (so far) are the values of b1 (estimated regression slope) and b0 (estimated regression intercept). These will allow us to write the fitted regression line Y = b0 + b1 x. n (1) Find the five sums xi , i 1 n yi , i 1 n n xi2 , n yi2 , i 1 x y i i 1 F I xJ G H K, 2 n n (2) n Syy = y 2 i 2 i 1 n n F IF I xJ yJ G G H KH K. n i i 1 i i 1 i 1 F I yJ G H K, S n (3) xi2 Find the five expressions x , y , Sxx = n xy Give the slope estimate as b1 = = x y i i n i S xx i i 1 i 1 n i 1 S xy . i i 1 and the intercept estimate as b0 = y - b1 x . dS i . 2 (4) For later use, record Syy|x = S yy xy S xx Virtually all the calculations for simple regression are based on the five quantities found in step (2). The regression fitting procedure is known as least squares. It gets this name n because the resulting values of b0 and b1 minimize the expression y b i 1 i 0 b1 xi . 2 This is a good criterion to optimize for many reasons, but understanding these reasons will force us to go into the regression model. Page 1 gs2011 SIMPLE REGRESSION CALCULATIONS As an example, consider a data set with n = 10 and with xi = 200 x i2 = 4,250 yi = 1,000 y i2 = 106,250 xi yi = 20,750 It follows that x = 200 = 20 10 Sxx = 4,250 - y = 1,000 = 100 10 200 2 = 250 10 Sxy = 20,750 - 200 1,000 = 750 10 It follows next that b1 = S xy S xx 750 = 3 and b0 = y - b1 x = 100 - 3(20) = 40. 250 The fitted regression line would be given as Y = 40 + 3 x. We could note also Syy 1, 0002 750 2 = 106,250 = 6,250. Then Syy|x = 6,250 10 250 = 4,000. We use Syy|x to get s , the estimate of the noise standard deviation. The relationship is S yy| x 4,000 s = , and here that value is = 500 22.36. n2 10 2 Page 2 gs2011 SIMPLE REGRESSION CALCULATIONS In fact, we can use these simple quantities to compute the regression analysis of variance table. The table is built on the identity SStotal = SSregression + SSresidual The quantity SSresidual is often named SSerror. The subscripts are often abbreviated. Thus, you will see reference to SStot , SSregr , SSresid , and SSerr . For the simple regression case, these are computed as SStot = Syy SSregr = S 2 xy S xx SSresid = S yy S 2 xy S xx The analysis of variance table for simple regression is set up as follows: Source of Variation Regression Residual Total Degrees of freedom 1 n-2 n-1 Sum of Squares Mean Squares F S xy S xy MSRegression S xx S xx MSResid S yy 2 S xy S xx 2 S yy 2 S 2 xy S xx n2 Syy Page 3 gs2011 SIMPLE REGRESSION CALCULATIONS For the data set used here, the analysis of variance table would be Source of Variation Regression Residual Total Degrees of freedom 1 8 9 Sum of Squares Mean Squares F 2,250 500 4.50 2,250 4,000 6,250 Just for the record, let’s note some other computations commonly done for regression. The information given next applies to regressions with K predictors. To see the forms for simple regression, just use K = 1 as needed. The estimate for the noise standard deviation is the square root of the mean square in the residual line. This is 500 22.36, as noted previously. The symbol s is frequently used for this, as are sY | X and s . The R2 statistic is the ration SSregr SStot , which is here The standard deviation of Y can be given as 2, 250 = 0.36. 6, 250 SStot , which is here n 1 6, 250 9 694.4444 26.35. It is sometimes interesting to compare s (the estimate for the noise standard deviation) to sY (the standard deviation of Y). It can be shown that the ratio of these is s sY n 1 1 R2 n 1 K 2 s n 1 1 R2 The quantity 1 = 1 s n 1 K Y 2 statistic, Radj . Page 4 is called the adjusted R2 gs2011