Formulas and Relationships from Simple & Multiple Linear Regression I. Basics for Simple Linear Regression Let ( X , Y ) be sample data from a bivariate normal population ( X / , Y / ) (technically we have ( xi , yi ), i 1,..., n where n is the sample size and will use the notation x for n x ). i 1 i Then we have the following sample statistics: X s X2 x ( x x) n 1 n (sample mean for X ) Y (sample variance for X ) sY2 2 y n (sample mean for Y ) ( y y) n 1 2 (sample variance for Y ) We will also use the following “sums of squares”: SSX ( x x)2 and SSY ( y y)2 Note: Sometimes we refer to SSY as SST . We have the following relationships between the sample statistics and the sums of squares: SSX SSX (n 1) s X2 or s X2 (n 1) and SSY SSY (n 1) sY2 or sY2 (n 1) The sample covariance between X and Y is defined by ( x x)( y y) COV ( X , Y ) s XY n 1 The sample correlation coefficient is defined as z X zY r COV z X , zY n 1 xx xx where z X and zY . By plugging the formulas of z X and zY into the sY sX formula for r we can easily derive that s r XY . s X sY The least squares regression line for the data has the form y bx a where sY and a y b x sX Associated with the regression we have some additional “sums of squares”: br SSresid ( y y ) 2 and SSregress ( y y ) 2 Where for each value of x in the sample data, y is the corresponding y coordinate and y is the predicted value from the regression line when x is used as the predictor (input) to the line. It is clear to see both symbolically and graphically that y y ( y y) ( y y) . However, the more interesting result is that: SST SSY SSresid SSregress i.e. SSY ( y y ) 2 ( y y ) 2 ( y y ) 2 SSresid SSregress This describes the total variation in Y by the sum of the “explained variation” ( SSregress ) and the “unexplained variation” ( SSresid ). Remember, the least squares regression line is the line that fits the data in a way that minimizes the unexplained variation. Also note that if the data perfectly fits a line then SSresid 0 . One can derive that the square of the correlation coefficient can be written in terms of these sums of squares: SST SSresid SSregress r2 SST SST r So here again we see that measures how strongly the data fits a line since if the data is perfectly linear, then r 2 1 . r 2 is called the coefficient of determination. Another relationship that can be easily derived from the formula for r 2 is that SSresid 1 r2 SST We will define the standard error of the linear regression to be the following: SSresid se n2 We will use this standard error to define the standard error for random variables associated with the linear regression and thus to test hypotheses about these random variables. We will also use the standard error of the linear regression to make inferences about the predicted values of y for a fixed value of x . II. Testing the correlation coefficient. The population correlation coefficient is represented by the symbol . We will use the sample correlation coefficient, r , obtained from the sample ( X , Y ) data, to test if there is a significant linear relationship between the population variables X / and Y / . If one assumes that 0 then it can be shown that the standard error for the random variable r is given by 1 r2 sr n2 r r n2 r and that the test statistic t (r ) has a student-t distribution with sr sr 1 r2 n 2 degrees of freedom. Thus we can test the null hypotheses, H 0 : 0 against appropriate alternative hypotheses using this test statistic. The standard error for r , sr , can also be written in terms of the standard error of linear regression and the sum of squares total: SSresid 2 1 r SSresid n 2 se se sr n2 SST (n 2) SST SST SSY III. Testing the slope of the regression line. The population regression line slope is represented by the symbol . We will use the slope of the regression line that fits the sample ( X , Y ) data, b , to make inferences about the slope of the regression line that would fit the population variables X / and Y / , i.e. . If one assumes that 0 then it can be shown that the standard error for the random variable b is given by se sb SSX b b and that the test statistic t (b) has a student-t distribution with n 2 degrees sb sb of freedom. Thus we can test the null hypotheses, H 0 : 0 against appropriate alternative hypotheses using this test statistic. When conducting this test we are essentially determining if the “independent effect” of the variable X / is significant on the predicted (or criterion) variable Y / . The independent effect is measured by slope. That is a change in X / produces a proportional change in Y / as measured by or Y / X / . Now here is a very interesting result. You should be able to follow this derivation by seeing how we are using the formulas above: SSY b SSX SSX SSX SSY r t (b) b r r n 1 r t (r ) sb se sx se se se sr SSX n 1 Thus when testing 0 or 0 the value of the t-statistic will be the same! sy IV. Testing the coefficient of determination. Another test that we will use to test the significance of the linear regression is the F -test. The F -test is a test about the population coefficient of determination ( 2 ). The null hypothesis for the F -test is H 0 : 2 0 with alternative hypothesis H 0 : 2 0 . The F distribution is used to test if two variances from independent populations are equal. This is done by testing the ratio of the two variances. The F distribution has two degrees of freedom parameters: the degrees of freedom of the numerator variance and the degrees of freedom of the denominator variance. In the case of linear regression we will use the following ratio for our F -test statistic: SSregress / k SSresid /(n k 1) Where k is the number of independent predictors (so far our k 1 but this will change once we get to multiple regression) and n is the number of data points in the sample. The numerator of the F -test statistic is the variance in the regression ( SSregress / k ) and is also called the mean square regression. The denominator of the F -test statistic is the variance in the residual ( SSresid /(n k 1) ) and is also called the mean square residual. We think of the F -test statistic as being the ratio of the explained variance to the unexplained variance. Since in regression our goal is to minimize the unexplained variance, then to have a significant result, we would expect the F -test statistic to be greater than 1. F Consider the case when k 1 (i.e. simple (or single variable) linear regression). Then SSregress / k SSregress SSregress F . SSresid /(n k 1) SSresid /(n 2) se2 Another way to think of what the F -test statistic measures is to consider a comparison of how well the regression line ( y bx a ) fits the data versus how well the line y y fits the data. The more the regression line explains the variance in Y (i.e. the higher the ratio of the variance in the regression to the variance in the residual), the more significant the result. Note that the line y y has slope equal to zero, so the F -test is essentially a test about the slope of the regression line (i.e. if the slope of the regression line is zero then the F -test statistic will be zero since SSregress 0 ). Recall the t -test statistic for testing H 0 : 0 against appropriate alternative hypotheses: t (b) b SSY t (r ) r sb se Thus we have t (b) 2 r2 SST SSregress SST SSregress F se2 SST se2 se2 So the F -test statistic is the square of the t -test statistic for the slope of the regression line. Thus it turns out that in single variable regression, using the F -test statistic is equivalent to testing if the slope of the regression line is zero! When doing simple linear regression, if you check your computer output for the p -value for the F -test statistic and for the t -test statistic for the slope, they will be the same value! There will be an analogy to this test in multivariable regression! V. Basics for Multiple Regression Let ( X1 , X 2 ,..., X k , Y ), be sample data from a multivariate normal population ( X 1/ , X 2/ ,..., X k/ , Y / ) (technically we have ( xi1 , xi2 ,...xik , yi ), i 1,..., n where n is the sample size and will use the notation xk for n x i 1 ik ). We will again perform linear regression on the data: i.e. we will use the data to find bi , i 1,..., k and a such that y a b1 x1 b2 x2 ... bk xk As before we will consider the sum of the squares of the residuals: SSresid ( y y ) 2 where y is from the data (i.e., ( x1 , x2 ,...xk , y) ) and y is the predicted value of y using the inputs x1 , x2 ,..., xk . The values that we get for bi , i 1,..., k and a are such that SSresid is minimized (note this is the same criteria we had for single variable (i.e. simple) regression). The formulas for bi , i 1,..., k and a are somewhat complicated and for now we will compute these via computer technology. Now, just as in simple linear regression, we can determine how well the regression equation explains the variance in Y by considering the coefficient of determination: SST SSresid SSregress r2 SST SST SSregress Where is defined as in simple linear regression and we have the relationship of SST SSY SSresid SSregress still true! Testing the coefficient of determination in multiple regression. Again we will use the F -test. The null hypothesis for the F -test is H 0 : 2 0 with alternative hypothesis H 0 : 2 0 . Our F -test statistic will again be: SSregress / k SSresid /(n k 1) Where k is the number of independent predictors and n is the number of data points in the sample. The numerator of the F -test statistic is the variance in the regression ( SSregress / k ) and is also called the mean square regression. The denominator of the F F -test statistic is the variance in the residual ( SSresid /(n k 1) ) and is also called the mean square residual. We think of the F -test statistic as being the ratio of the explained variance to the unexplained variance. Since in regression our goal is to minimize the unexplained variance and have most of the variance in Y explained by the regression equation, then to have a significant result, we would expect the F -test statistic to be greater than 1. Note that an alternative formula for the F -test statistic is r2 / k F (1 r 2 ) /(n k 1) Can you show this is true?? In the case of simple linear regression we saw that the above hypothesis test was equivalent to the hypothesis test: H 0 : 0 with H 0 : 0 . In the case of multiple regression, the above hypothesis test is equivalent to the following hypothesis test: H 0 : 1 2 ... k 0 with alternative hypothesis H a : At least one coefficient is not equal to zero. Where 1 , 2 ,..., k are the slopes of the regression equation that fits the population. So acceptance of the null hypothesis implies that the explanatory (i.e. predictor, i.e. independent) variables do not have any significant impact on explaining the variance in the dependent variable, Y . So to estimate the average value of Y one would not take into account the values of the independent variables and thus use the sample mean Y as a point estimate. Rejection of the null hypothesis does not necessarily mean that all of the coefficients are significantly different from zero but that at least one of the coefficients is significantly different than zero. Rejection does imply that the regression equation is useful for predicting values of Y given values of the independent variables. In particular, given values of X i , i 1,..., k , a point estimate for the average value of Y is Y (i.e. we would take into account the variances of the X i , i 1,..., k from their averages to compute the average value of Y . Once we have determined that the regression equation significantly explains the variance in Y then we can run a t -test on each of the independent variable coefficients to test if they are significantly different from zero. We will then be interested in seeing if we can get a better model by reducing the full model by dropping independent variables whose coefficients are not significantly different than zero. More statistics associated with multiple regression: Then we have the following sample statistics: xm (sample mean for each predictor variable X m , m 1,..., k ) Xm n y (sample mean for Y ) Y n sX2 m (x m xm )2 n 1 (sample variance for each predictor variable X m , m 1,..., k ) ( y y)2 2 sY n 1 (sample variance for Y ) We will also use the following “sums of squares”: SSX m ( xm xm )2 SSY ( y y)2 and Note: Sometimes we refer to SSY as SST . We have the following relationships between the sample statistics and the sums of squares: SSX m s X2 m for each predictor variable X m , m 1,..., k SSX m (n 1)sX2 m or (n 1) and SSY s2 SSY (n 1) sY2 or (n 1) Y The sample covariance between X m and Y is defined by COV ( X m , Y ) sX mY (x m xm )( y y) n 1 X We will also define the covariance between X i and j for i j as follows: COV ( X i , X j ) s X i X j ( x x )( x i i j xj ) n 1 i j Note that when , sX i X j COV ( X i , X j ) COV ( X i , X i ) sX2 i .