B. Plot of residuals indicates heteroscedasticity

advertisement
Heteroscedasticity
I. What it is and where to find it
A. Variance in Y changes with levels of one or more independent variables.
B. It is often a problem in time series data and when a measure is aggregated over
individuals.
1) Example: average college expenses measured by sampling .01 of students
at each of several institutions differing in size. Because the size of the
sample of students changes with institution size, and because average
college expenses has variance 2/n, as institution size grows, n grows and
2/n shrinks.
II. How to know you have it
A. Plot the data
B. Plot the residuals
C. With categorical independent variable, one can perform a test for the
homogeneity of variance (e.g., Box’s test; cf. Winer, 1971).
III. What to do about it
A. Conceptually, one might want to treat observations with greater variance with
less weight because they give a less precise indication of the path of the regression
line.
B. Instead of minimizing (yi-a-bxi)2, minimize
(1/i2)(yi-a-bxi)2.
[1]
This is called weighted least squares because the ordinary least squares (OLS)
expression is “weighted” (by the inverse of the variance). Note than when
i2=2 that is, when the variances are all equal (homoscedastic), then this
equation gives the ordinary least squares (OLS) solution for a and b. In the
heteroscedastic case, this equation gives the maximum likelihood estimates (MLE)
of a and b.
C. In general it is not possible to solve [1] and one must rely on computer
programs that find the minimum by iterative fitting algorithms.
D. However, there is a simple solution whenever i is proportional to the values of
a variable (e.g., Xi) i.e., whenever i=kXi. In this case, one can obtain the
weighted least squares solution by minimizing
(1/kXi2)(yi-a-bxi)2
=
((1/k2)(yi/Xi)-(a/Xi)-(bxi/Xi))2.
1
Because the constant (1/k2) multiplier does not affect the location of the minimum,
one can find the appropriate estimates of a and b by minimizing:
((yi/Xi)-(a/Xi)-(bxi/Xi))2
((yi/Xi)-(a/Xi)-b)2
=
Therefore, weighted least squares estimates of the regression parameters can be
obtained by performing an ordinary least squares regression on the transformed
variables obtained by dividing the original variables by Xi:
Y/Xi = a 1/Xi + b + e/Xi
Note that the constant in this equation (b) corresponds to the regression coefficient
for the Xi in the original model and that the regression coefficient for the new
independent variable corresponds to the constant term in the original equation.
Also, note that since the residuals are conceptually also divided by Xi, they will be
normally distributed if the original ei are proportional to the Xi as assumed.
IV. Example: Airline transport accidents predicted by proportion of all flights flown by
airline.
A. Initial regression
Model Summary
Model
1
R
.698a
R Square
.487
Adjusted
R Square
.414
Std. Error of
the Estimate
4.20085
a. Predictors: (Constant), Proportion of Total Flights
ANOVAb
Model
1
Regres sion
Residual
Total
Sum of
Squares
117.359
123.530
240.889
df
1
7
8
Mean Square
117.359
17.647
F
6.650
Sig.
.037a
a. Predic tors: (Constant), Proportion of Total Flights
b. Dependent Variable: INJURIES
2
Coeffi cientsa
Model
1
(Const ant)
Proportion of Total Flights
Unstandardized
Coeffic ient s
B
St d. Error
-.140
3.141
64.975
25.196
St andardiz ed
Coeffic ient s
Beta
t
-.045
2.579
.698
Sig.
.966
.037
a. Dependent Variable: INJURIES
B. Plot of residuals indicates heteroscedasticity
Scatterplot
Dependent Variable: INJURIES
Regression Standardized Residual
1.5
1.0
.5
0.0
-.5
-1.0
-1.5
-1.5
-1.0
-.5
0.0
.5
1.0
1.5
2.0
Regression Standardized Predicted Value
C. So new variables are created by dividing the old variables by the proportion of total
flights: newinj=injuries/proportion of total flights, newa=1/proportion of total
flights, proportion of total flights/proportion of total flights=1.
Model Summaryb
Model
1
R
.150a
R Square
.022
Adjusted
R Square
-.117
Std. Error of
the Estimate
35.49718
a. Predictors: (Constant), NEWA
b. Dependent Variable: NEWINJ
3
ANOVAb
Model
1
Sum of
Squares
202.575
8820.350
9022.925
Regres sion
Residual
Total
df
1
7
8
Mean Square
202.575
1260.050
F
.161
Sig.
.700a
t
2.623
-.401
Sig.
.034
.700
a. Predic tors: (Constant), NEW A
b. Dependent Variable: NEWINJ
Coeffi cientsa
Model
1
Unstandardized
Coeffic ients
B
St d. Error
73.122
27.879
-.883
2.202
(Const ant)
NEWA
St andardiz ed
Coeffic ients
Beta
-.150
a. Dependent Variable: NEWINJ
This gives the WLS solution:
Number of incidents=-.883+73.122*p(total flights)
Recall (or see above) that the coefficient for the constant and the predictor are switched.
The R2 for this model can be obtained by squaring the correlation between the estimated
and actual number of incidents (.698)2=.487. The variable statistics can be obtained from
the above results (remembering that the coefficient labeled constant is the coefficient for
the independent variable). Notice that the t value for the independent variable has
increased slightly reflecting the added precision in this model.
D. The plot of the residuals indicates that the heteroscedasticity problem has disappeared.
Scatterplot
Dependent Variable: NEWINJ
Regression Standardized Residual
1.5
1.0
.5
0.0
-.5
-1.0
-1.5
-2.0
-1.5
-1.0
-.5
0.0
.5
1.0
1.5
Regression Standardized Predicted Value
4
V. Multivariate Weighted Least Squares
A. Recall that the ordinary least squares solution is:
B= (X'X)-1X'Y
The WLS solution is B= (X'U-1X)-1X'U-1Y where
U=
 2i
0
0
 i
0
0
2
...
...
...
...
0
0
1/  2i
0
 i
0
...
 i
0
0
2
2
and U-1=
0
1/ 
...
...
2
i
...
...
0
0
1/  2i
...
0
1/  2i
That is, the ordinary least squares solution is weighted by the inverse of the
variances. The regression equation has the form: U-1Y=U-1XB + U-1e
Note that one would obtain the same result if one multiplied the original
regression equation by D where
D=
1/  i
0
0
0
0
1/  i
...
...
...
...
1/  i
...
0
0
0
1/  i
This would yield the solution B=[(DX)'DX]-1(DX)'(DY)
= (X'D'DX]-1X'D'DY
Because D'D=U-1, this solution is identical to the one above.
5
Download