Mathematical Properties of the Least Squares Regression The least squares regression line obeys certain mathematical properties which are useful to know in practice. The following properties can be established algebraically: a) The least squares regression line passes through the point of sample means of Y and X. This can be easily seen from (4.9) which can be rewritten as follows, (4.12) Y = b1 + b2 .X b) The mean of the fitted (predicted) values of Y is equal to the mean of the Y values: Let Yˆ i = b1 + b2 . X i then we have 1 1 Yˆ = ∑ ( b1 + b 2 . X i ) = ∑( Y - b2 .X + b2 . X i ) = Y - b2 .X + b 2 .X = Y n n ) c) The residuals of the regression line sum up to zero: 1 1 ∑ ei = ∑ ( Y i - Yˆ i ) = Y - Yˆ = 0 n n d) (4.13 (4.14) The residuals ei are uncorrelated with the Xi values: ∑ ei X i = ∑ ei X i - ∑ ei X since ∑ ei = 0 = ∑ ei ( X i - X ) = ∑ ( Y i - Y ).( X i - X ) - b2 . ∑ ( X i - X ) since b1 =Y - b2 .X 2 = 0 since b2 = ∑ ( Y i - Y ).( X i - X ) / ∑ ( X i - X )2 e) The residuals ei are uncorrelated with the fitted values Yi. This property follows logically from the previous one since each fitted value of Yi is linear function of the corresponding Xi value. f) The least squares regression splits the variation in the Y variable into two components - the explained variation due to the variation in Xi and the residual variation: TSS = RSS + ESS where, (4.15) TSS = ∑ ( Y i - Y )2 ESS = ∑ ( Yˆ i - Yˆ )2 (4.16) RSS = ∑ ei 2 = ∑( Y i - Yˆ )2 TSS is the total variation observed in the dependent variable Y. It is called the total sum of squares. ESS, the Explained Sum of Squares, is the variation of the predicted values (b1+b2.X). This is the variation in Y accounted for by the variation in the explanatory variable X. What is left is the RSS, the Residual Sum of Squares. The reason why the ESS and RSS neatly add up to the TSS is that the residuals are uncorrelated with the fitted Y values and, hence, there is no term with the sum of covariances. This last property suggests a useful way to measure the goodness of fit of the estimated sample regression. This is done as follows, R2 = ESS/TSS (4.17) where R2, called R-square, is the coefficient of determination. It gives us the proportion of the total sum of squares of the dependent variable explained by the variation in the explanatory variable. In fact, the R2 equals the square of the linear correlation coefficient between the observed and the predicted values of the dependent variable Y, computed as follows, r= ∑ ( Y i - Y ).( Yˆ i - Yˆ ) Cov (Y , Yˆ) = V( Y ).V( Yˆ ) ∑ ( Y i - Y )2 . ∑ ( Yˆ i - Yˆ )2 A correlation coefficient measures the degree of linear association between two variables. Note, however, that if the underlying relation between the variables is non-linear, the correlation coefficient may perform poorly, notwithstanding that fact that a strong non-linear association exists between two variables. (4.17a) Statistical Properties of LS Linear Regression We briefly review the main points without much further elaboration, apart from a few specific points which concern regression only. We shall merely remind you of the results of formal derivations without bothering about proofs which can be found in most introductory texts on statistics or econometrics. Standard Errors Given the assumptions of the classical linear regression model, the variances of the least squares estimators are given by, 1 var( b1 ) = σ 2 n (4.19) σ ∑ ( X i - X )2 2 var( b2 ) = (4.20) Furthermore, an unbiased estimator of σ2 is given by s2 as follows: ∑ ( Y i - b1 - b2 X i )2 s = n-2 (4.21) 2 where s2 is called the standard error of regression since σ2 is the variance of the error term which measures the deviation of individuals points from the regression line. Replacing σ2 by s2 in (4.19) and (4.20), we get unbiased estimates of the variances of b1 and b2. Obviously, the estimated standard errors are the square roots of these variances. The total sum of squares of X, ∑ ( X i - X )2 which features in the denominator of the variances of the intercept and slope coefficients is a measure of the total variation in the X values. Thus, other things being equal, the higher the variation in the X values, the lower will be the variances of the estimators, which implies that higher will be the precision in estimation. In other words, the range of observed X plays a crucial role in the reliability of the estimates. Think about this. It would indeed be difficult to measure the response of Y on X if X hardly varies at all. The greater the range over which X varies, the easier it is to capture its impact on the variation in Y. Sampling Distributions To construct the confidence intervals and to perform tests of hypotheses we need the probability distribution of the errors which implies that we use the normality assumption of the error terms. Under this assumption, the least squares estimators b1 and b2 each follow a normal distribution. However, since we generally do not know the variance of the error term, we cannot make use of the normal distribution directly. Instead, we use the t-distribution defined as follows in the case of b2, t= b2 - β 2 _ t ( n-2 ) se( b 2 ) (4.22) where se(b2), the standard error of b2, is given by, se( b 2 ) = s (4.23) [ ∑( X - X ) ] 2 i 1 2 using (4.20) and (4.21). The statistic, t(n-2), denotes the Student's t-distribution with (n-2) degrees of freedom. The reason why we now have only (n-2) degrees of freedom is that, in simple regression, we use the sample data to estimate 2 coefficients: the slope and the intercept of the line. In the case of the sample mean, in contrast, we only estimated one parameter (the mean itself) from the sample. Similarly, for b1, we get, t= b1 - β 1 _ t( n - 2 ) se( b1 ) (4.24) where se(b1), the standard error of b1, is given by, 1 X se( b1 ) = s + 2 n ∑ ( Xi - X 2 ) 1 2 (4.25) using (4.19) and (4.21). Confidence Intervals for the Parameters ß1 and ß2 The confidence limits for ß2 and ß1 with (1-α) per cent confidence co-efficient (say, 95 per cent, in which case α=0.05) are given by, α b2 + _ t n - 2, .se( b2 ) 2 (4.26) α b1 + _ t n - 2, .se( b1 ) 2 (4.27) respectively, where t(n-2,α/2) is the (1-α/2) percentile of a t-distribution with (n-2) degrees of freedom, and se(b2) and se(b1) are given by (4.23) and (4.25) respectively. Confidence Interval for the Conditional Mean of Y At times, we may be interested to construct a confidence interval for the conditional mean. For example, after fitting a regression of household savings on income, we may want to construct a confidence interval for average savings given the level of income in order to assess the savings potential of a certain type of households. Suppose, µ0 = β 1 + β 2 . X 0 (4.28) i.e. µ0 is the conditional mean of Y given X=X0. The point estimate of µ0 is given by, b1 + b2 . X 0 while its (1-α) per cent confidence interval can be obtained as follows, α µ 0 + _ t n - 2, .se( µ 0 ) 2 (4.29) where, 1 1 ( - X )2 2 se( µ 0 ) = s + X 0 2 n ∑( Xi - X ) (4.30) Confidence Interval for the Predicted Y Values There are other occasions where we might be interested in the uncertainty in prediction on the basis of the estimated regression. For example, when estimating a regression of paddy yield (physical output per unit area) on annual rainfall, we may want to predict next year's yield given the anticipated rainfall. In this case, our interest is not to obtain a confidence interval of the conditional mean of the yield i.e. the mean yield at a given level of rainfall. Rather, we want to find a confidence interval for the yield (Y0) itself, given the rainfall (X0)? Obviously, in this case, Y 0 = β 1 + β 2 . X 0 + ε = µ0 + ε where µ0 is given by (4.28). The (1-α) per cent confidence interval for the Y0 given X=X0 is then obtained as follows, α Y 0 + _ t n - 2, .se( Y 0 ) 2 (4.31) where, 2 1 ( -X ) se( Y 0 ) = s 1 + + X 0 n ∑ ( X i - X )2 2 (4.32) In this case, therefore, the standard error of Y0 is larger than that of µ0 since the latter corresponds to the conditional mean of the yield for a given level of rainfall, while the former corresponds to the predicted value of the yield. In both cases, (4.30) and (4.32), the confidence intervals will be larger, the farther the X value is away from its mean in the sample. Standard Error of a Residual Finally, the residuals ei are the estimators of errors εi (see (4.7) and (4.8)). The standard error of ei is obtained as follows, 2 se ( ei ) = s 1 - hi 1 ( Xi - X ) where hi = + n ∑ ( X i - X )2 (4.33) where s is given by (4.21). Note that while the standard deviation of the error term is assumed to be homoscedastic, equation 4.33 shows that the residuals of the regression line are heteroscedastic in nature. The standard error of each residual depends on the value of hi. The statistic hi is called the hat statistic: hi will be larger, the greater the distance of Xi from its mean. A value of X which is far away from its mean (for example, an outlier in the univariate analysis of X) will produce a large hat statistic which, as we shall see in section 4.7, can exert undue influence on the location of a regression line. A data point with a large hat statistic is said to exert leverage on the least squares regression line, the importance of which will be shown in section 4.7.