Heteroscedasticity I. What it is and where to find it A. Variance in Y changes with levels of one or more independent variables. B. It is often a problem in time series data and when a measure is aggregated over individuals. 1) Example: average college expenses measured by sampling .01 of students at each of several institutions differing in size. Because the size of the sample of students changes with institution size, and because average college expenses has variance 2/n, as institution size grows, n grows and 2/n shrinks. II. How to know you have it A. Plot the data B. Plot the residuals C. With categorical independent variable, one can perform a test for the homogeneity of variance (e.g., Box’s test; cf. Winer, 1971). III. What to do about it A. Conceptually, one might want to treat observations with greater variance with less weight because they give a less precise indication of the path of the regression line. B. Instead of minimizing (yi-a-bxi)2, minimize (1/i2)(yi-a-bxi)2. [1] This is called weighted least squares because the ordinary least squares (OLS) expression is “weighted” (by the inverse of the variance). Note than when i2=2 that is, when the variances are all equal (homoscedastic), then this equation gives the ordinary least squares (OLS) solution for a and b. In the heteroscedastic case, this equation gives the maximum likelihood estimates (MLE) of a and b. C. In general it is not possible to solve [1] and one must rely on computer programs that find the minimum by iterative fitting algorithms. D. However, there is a simple solution whenever i is proportional to the values of a variable (e.g., Xi) i.e., whenever i=kXi. In this case, one can obtain the weighted least squares solution by minimizing (1/kXi2)(yi-a-bxi)2 = ((1/k2)(yi/Xi)-(a/Xi)-(bxi/Xi))2. 1 Because the constant (1/k2) multiplier does not affect the location of the minimum, one can find the appropriate estimates of a and b by minimizing: ((yi/Xi)-(a/Xi)-(bxi/Xi))2 ((yi/Xi)-(a/Xi)-b)2 = Therefore, weighted least squares estimates of the regression parameters can be obtained by performing an ordinary least squares regression on the transformed variables obtained by dividing the original variables by Xi: Y/Xi = a 1/Xi + b + e/Xi Note that the constant in this equation (b) corresponds to the regression coefficient for the Xi in the original model and that the regression coefficient for the new independent variable corresponds to the constant term in the original equation. Also, note that since the residuals are conceptually also divided by Xi, they will be normally distributed if the original ei are proportional to the Xi as assumed. IV. Example: Airline transport accidents predicted by proportion of all flights flown by airline. A. Initial regression Model Summary Model 1 R .698a R Square .487 Adjusted R Square .414 Std. Error of the Estimate 4.20085 a. Predictors: (Constant), Proportion of Total Flights ANOVAb Model 1 Regres sion Residual Total Sum of Squares 117.359 123.530 240.889 df 1 7 8 Mean Square 117.359 17.647 F 6.650 Sig. .037a a. Predic tors: (Constant), Proportion of Total Flights b. Dependent Variable: INJURIES 2 Coeffi cientsa Model 1 (Const ant) Proportion of Total Flights Unstandardized Coeffic ient s B St d. Error -.140 3.141 64.975 25.196 St andardiz ed Coeffic ient s Beta t -.045 2.579 .698 Sig. .966 .037 a. Dependent Variable: INJURIES B. Plot of residuals indicates heteroscedasticity Scatterplot Dependent Variable: INJURIES Regression Standardized Residual 1.5 1.0 .5 0.0 -.5 -1.0 -1.5 -1.5 -1.0 -.5 0.0 .5 1.0 1.5 2.0 Regression Standardized Predicted Value C. So new variables are created by dividing the old variables by the proportion of total flights: newinj=injuries/proportion of total flights, newa=1/proportion of total flights, proportion of total flights/proportion of total flights=1. Model Summaryb Model 1 R .150a R Square .022 Adjusted R Square -.117 Std. Error of the Estimate 35.49718 a. Predictors: (Constant), NEWA b. Dependent Variable: NEWINJ 3 ANOVAb Model 1 Sum of Squares 202.575 8820.350 9022.925 Regres sion Residual Total df 1 7 8 Mean Square 202.575 1260.050 F .161 Sig. .700a t 2.623 -.401 Sig. .034 .700 a. Predic tors: (Constant), NEW A b. Dependent Variable: NEWINJ Coeffi cientsa Model 1 Unstandardized Coeffic ients B St d. Error 73.122 27.879 -.883 2.202 (Const ant) NEWA St andardiz ed Coeffic ients Beta -.150 a. Dependent Variable: NEWINJ This gives the WLS solution: Number of incidents=-.883+73.122*p(total flights) Recall (or see above) that the coefficient for the constant and the predictor are switched. The R2 for this model can be obtained by squaring the correlation between the estimated and actual number of incidents (.698)2=.487. The variable statistics can be obtained from the above results (remembering that the coefficient labeled constant is the coefficient for the independent variable). Notice that the t value for the independent variable has increased slightly reflecting the added precision in this model. D. The plot of the residuals indicates that the heteroscedasticity problem has disappeared. Scatterplot Dependent Variable: NEWINJ Regression Standardized Residual 1.5 1.0 .5 0.0 -.5 -1.0 -1.5 -2.0 -1.5 -1.0 -.5 0.0 .5 1.0 1.5 Regression Standardized Predicted Value 4 V. Multivariate Weighted Least Squares A. Recall that the ordinary least squares solution is: B= (X'X)-1X'Y The WLS solution is B= (X'U-1X)-1X'U-1Y where U= 2i 0 0 i 0 0 2 ... ... ... ... 0 0 1/ 2i 0 i 0 ... i 0 0 2 2 and U-1= 0 1/ ... ... 2 i ... ... 0 0 1/ 2i ... 0 1/ 2i That is, the ordinary least squares solution is weighted by the inverse of the variances. The regression equation has the form: U-1Y=U-1XB + U-1e Note that one would obtain the same result if one multiplied the original regression equation by D where D= 1/ i 0 0 0 0 1/ i ... ... ... ... 1/ i ... 0 0 0 1/ i This would yield the solution B=[(DX)'DX]-1(DX)'(DY) = (X'D'DX]-1X'D'DY Because D'D=U-1, this solution is identical to the one above. 5