Chapter 11 Remedial Measures for Non-constant error variance Sometimes a transformation does not result in an easily interpretable or applicable regression model (e.g. difficult to explain a model with various transformations). However, if an appropriate model is found using ordinary least squares regression but the error terms are not constant a weighted least squares method can be employed. A few methods are outlined in the text when the variances are known, but since this is usually unlikely we will restrict our conversation to occasions when the error variances are unknown. Two functions will be reviewed: 1. variance and 2. standard deviation functions. Basic Method: 1. Regress Y on relevant predictor(s) and store residuals 2. Regress either a. e2i on relevant predictor(s) and store fits (variance function) b. |ei| on relevant predictor(s) and store fits (S.D. function) 3. Create weights wi by: 1 a. where vˆi are the fitted values stored in step 2 (variance function) vˆi 1 b. 2 where sˆi are the fitted values stored in step 2 (S.D. function) sˆi 4. Regress Y on relevant predictor(s). Click Options and for “Weights” enter the column contain wi and store residuals 5. If estimated coefficients differ substantially from OLS estimates (no rule of thumb to determine “substantially”) then repeat (i.e. iterate) the weighted least squares process by repeating steps 2 through 4 using the residuals stored in step 4. Usually only one or two iterations are sufficient. This iteration process is referred to as iterative reweighted least squares. NOTE: The R2 produced by weighted least squares does not have a clear-cut meaning so should be viewed cautiously. Example: Blood Pressure Page 427 1. Regress Y on X and store residuals 2. Create Scatterplots in Fig 11.1 of Y vs X (11.1a); e vs X (11.1b); |e| vs X w/ reg line (11.1c). Hang your mouse over the fitted line for plot 11.1c to get regression equation. 3. Obtain fitted values for regressing |e| on X to get weights then get the weights for 3b 1 above by wi = using the calculator function in Minitab. fits 2 4. Regress Y on X, click Options and in Weights enter the column containing the weights. 1 Which function to use? Your text on page 425 illustrates some scenarios. To summarize: 1. If the residual plot produces a megaphone shape use the SD function. But the weights will come from the absolute residuals regressed on whichever variable was used on the horizontal (e.g. predictor(s) or the fitted values from OLS - Yˆ ) 2. Plot of squared residuals against predictor(s) exhibits an upward tendency use the Variance function and regress the squared residuals against these predictor(s). 3. Plot of residuals against predictor(s) increases steadily then slows (think a logarithmic pattern) use the SD function and regress absolute residuals against the first and second order predictor(s). Remedial Measures for Multicollinearity The text discussed Ridge Regression methods which are cumbersome in many stat packages, especially Minitab. Another option not mentioned in the text but available in Minitab is Partial Least Squares under the Stat > Regression option. PLS is useful when there is high correlation between the predictor variables. PLS will not provide a regression equation but will provide a best model using best subset methods, plus offer cross-validation options. Remedial Measures for Influential Cases – Robust Regression The most popular method is to use Iterative Least Squares methods (similar to those for those described for constant variance) using Huber weights where the weights are determined by a manipulation of the median of the residuals. Minitab does not perform this easily as the number of iterations can be several steps meaning each iteration has to be adjusted and repeated. Once outliers have been identified (e.g. Deleted Studentized Residuals, Leverages, Cook’s Distance) we turn to what remedial actions, if any, are appropriate. Example 1: SAT-HS Rank - If we were simply to report the results of the regression model with no indication that there is an outlier, we would essentially be reporting a model which is determined by a simple observation. To report findings that did not address the outlier would be misleading since you would be pretending that the model is really based on n, the total number of observations. Somehow, especially in the social sciences, the reporting of results with outliers included has come to be viewed as the “honest” thing to do and the reporting of results with outliers removed is sometimes viewed as “cheating”. - As a result reporting a model with the outlier(s) omitted with the explicit admission in the report that there were observation(s) which were not understood (e.g. could not 2 - - conclude that the data were improperly recorded) and thus the final model represents data that was understood is a practice that has become acceptable. A possible better solution is to report both models: one with and one without the outlying observation(s) and let the reader make their own decision about the adequacy of the models. Importance: to ignore outliers by failing to detect and report them is dishonest and misleading Example 2: Anscombe Data (as presented in 1973 in “Graphs in statistical analysis”, American Statistician Activity: The data displays four sets of Y and X variables. Note that the X variable is the same for X1 – X3. Test each simple regression model for each set of Y and X and see which model is best. Importance of graphs Each model produces the same parameter estimates, R2 and F* statistics for test of including X in the model. Obviously R2 and F* are not sufficient in distinguishing among these four very different data sets. To visually see how different these data sets are, prepare a Scatterplot, including the regression line, by going to Graph > Scatterplot > Simple. Enter each Y in the Y column and its corresponding X in the X column. Click Multiple graphs and select “In separate panes in same graph” and click OK. Then click Data View > Regression and select linear and be sure the box for intercept is checked. Click OK twice. Set 1 displays an expected linear regression display, with the points scattered about but reasonably close to the regression line. In Set 2 it is clear that there is a strong relationship between x2 and y2 but that it is a curvilinear relationship instead of a linear one presumed by linear regression. The initial linear statistical analysis understates the strength of the true relationship between the two variables because it does not consider the curvilinear part of that relationship (possibly remediation would be to include a power term). For Set 3 there is an obvious outlier. Except for this one observation there would be a perfect linear relationship between x3 and y3. The R2 for this analysis is 66.6% which considerably understates the linear relationship among most of the observations in this data set. Finally, Set 4 there is also an obvious outlier (influential!). Except for this one outlying observation there would be no relationship between x4 and y4 because all other observations would have the same value of x4, making it a useless predictor. Yet this one influential observation fools the analyst into reporting a linear relationship, 66.6%. Thus in this data set the outlier assists in considerably overstating the true relationship between x4 and y4. 3 Note: from the Minitab output there are no large residuals or influential outliers for Sets 1 or 2. Lesson: Do not completely rely on the output to identify problem observation(s). Scatterplot of Y1 vs X1, Y2 vs X2, Y3 vs X3, Y4 vs X4 Y1*X1 Y2*X2 10 10 8 8 6 6 4 4 5.0 7.5 10.0 Y3*X3 12.5 15.0 5.0 7.5 10.0 Y4*X4 12.5 15.0 12.5 12 10 10.0 8 7.5 6 5.0 4 5.0 7.5 10.0 12.5 15.0 10 15 20 4