Lecture 31

advertisement
Chapter 11
Remedial Measures for Non-constant error variance
Sometimes a transformation does not result in an easily interpretable or applicable
regression model (e.g. difficult to explain a model with various transformations).
However, if an appropriate model is found using ordinary least squares regression but the
error terms are not constant a weighted least squares method can be employed. A few
methods are outlined in the text when the variances are known, but since this is usually
unlikely we will restrict our conversation to occasions when the error variances are
unknown. Two functions will be reviewed: 1. variance and 2. standard deviation
functions.
Basic Method:
1. Regress Y on relevant predictor(s) and store residuals
2. Regress either
a. e2i on relevant predictor(s) and store fits (variance function)
b. |ei| on relevant predictor(s) and store fits (S.D. function)
3. Create weights wi by:
1
a.
where vˆi are the fitted values stored in step 2 (variance function)
vˆi
1
b. 2 where sˆi are the fitted values stored in step 2 (S.D. function)
sˆi
4. Regress Y on relevant predictor(s). Click Options and for “Weights” enter the
column contain wi and store residuals
5. If estimated coefficients differ substantially from OLS estimates (no rule of
thumb to determine “substantially”) then repeat (i.e. iterate) the weighted least
squares process by repeating steps 2 through 4 using the residuals stored in step 4.
Usually only one or two iterations are sufficient. This iteration process is referred
to as iterative reweighted least squares.
NOTE: The R2 produced by weighted least squares does not have a clear-cut meaning so
should be viewed cautiously.
Example: Blood Pressure Page 427
1. Regress Y on X and store residuals
2. Create Scatterplots in Fig 11.1 of Y vs X (11.1a); e vs X (11.1b); |e| vs X w/ reg line
(11.1c). Hang your mouse over the fitted line for plot 11.1c to get regression equation.
3. Obtain fitted values for regressing |e| on X to get weights then get the weights for 3b
1
above by wi =
using the calculator function in Minitab.
fits 2
4. Regress Y on X, click Options and in Weights enter the column containing the
weights.
1
Which function to use?
Your text on page 425 illustrates some scenarios. To summarize:
1. If the residual plot produces a megaphone shape use the SD function. But the
weights will come from the absolute residuals regressed on whichever variable
was used on the horizontal (e.g. predictor(s) or the fitted values from OLS - Yˆ )
2. Plot of squared residuals against predictor(s) exhibits an upward tendency use the
Variance function and regress the squared residuals against these predictor(s).
3. Plot of residuals against predictor(s) increases steadily then slows (think a
logarithmic pattern) use the SD function and regress absolute residuals against the
first and second order predictor(s).
Remedial Measures for Multicollinearity
The text discussed Ridge Regression methods which are cumbersome in many stat
packages, especially Minitab. Another option not mentioned in the text but available in
Minitab is Partial Least Squares under the Stat > Regression option. PLS is useful
when there is high correlation between the predictor variables. PLS will not provide a
regression equation but will provide a best model using best subset methods, plus offer
cross-validation options.
Remedial Measures for Influential Cases – Robust Regression
The most popular method is to use Iterative Least Squares methods (similar to those for
those described for constant variance) using Huber weights where the weights are
determined by a manipulation of the median of the residuals. Minitab does not perform
this easily as the number of iterations can be several steps meaning each iteration has to
be adjusted and repeated.
Once outliers have been identified (e.g. Deleted Studentized Residuals, Leverages,
Cook’s Distance) we turn to what remedial actions, if any, are appropriate.
Example 1: SAT-HS Rank
- If we were simply to report the results of the regression model with no indication that
there is an outlier, we would essentially be reporting a model which is determined by
a simple observation. To report findings that did not address the outlier would be
misleading since you would be pretending that the model is really based on n, the
total number of observations. Somehow, especially in the social sciences, the
reporting of results with outliers included has come to be viewed as the “honest”
thing to do and the reporting of results with outliers removed is sometimes viewed as
“cheating”.
- As a result reporting a model with the outlier(s) omitted with the explicit admission
in the report that there were observation(s) which were not understood (e.g. could not
2
-
-
conclude that the data were improperly recorded) and thus the final model represents
data that was understood is a practice that has become acceptable.
A possible better solution is to report both models: one with and one without the
outlying observation(s) and let the reader make their own decision about the
adequacy of the models.
Importance: to ignore outliers by failing to detect and report them is dishonest
and misleading
Example 2: Anscombe Data (as presented in 1973 in “Graphs in statistical analysis”,
American Statistician
Activity: The data displays four sets of Y and X variables. Note that the X variable is the
same for X1 – X3. Test each simple regression model for each set of Y and X and see
which model is best.
Importance of graphs
Each model produces the same parameter estimates, R2 and F* statistics for test of
including X in the model. Obviously R2 and F* are not sufficient in distinguishing
among these four very different data sets. To visually see how different these data sets
are, prepare a Scatterplot, including the regression line, by going to Graph > Scatterplot >
Simple. Enter each Y in the Y column and its corresponding X in the X column. Click
Multiple graphs and select “In separate panes in same graph” and click OK. Then click
Data View > Regression and select linear and be sure the box for intercept is checked.
Click OK twice.
Set 1 displays an expected linear regression display, with the points scattered about but
reasonably close to the regression line.
In Set 2 it is clear that there is a strong relationship between x2 and y2 but that it is a
curvilinear relationship instead of a linear one presumed by linear regression. The initial
linear statistical analysis understates the strength of the true relationship between the two
variables because it does not consider the curvilinear part of that relationship (possibly
remediation would be to include a power term).
For Set 3 there is an obvious outlier. Except for this one observation there would be a
perfect linear relationship between x3 and y3. The R2 for this analysis is 66.6% which
considerably understates the linear relationship among most of the observations in this
data set.
Finally, Set 4 there is also an obvious outlier (influential!). Except for this one outlying
observation there would be no relationship between x4 and y4 because all other
observations would have the same value of x4, making it a useless predictor. Yet this one
influential observation fools the analyst into reporting a linear relationship, 66.6%. Thus
in this data set the outlier assists in considerably overstating the true relationship between
x4 and y4.
3
Note: from the Minitab output there are no large residuals or influential outliers for Sets
1 or 2. Lesson: Do not completely rely on the output to identify problem
observation(s).
Scatterplot of Y1 vs X1, Y2 vs X2, Y3 vs X3, Y4 vs X4
Y1*X1
Y2*X2
10
10
8
8
6
6
4
4
5.0
7.5
10.0
Y3*X3
12.5
15.0
5.0
7.5
10.0
Y4*X4
12.5
15.0
12.5
12
10
10.0
8
7.5
6
5.0
4
5.0
7.5
10.0
12.5
15.0
10
15
20
4
Download