( ) ( ) σ

advertisement
Stat 404
Variance Stabilizing Transformations
( )= σ
A. Recall that when the homoscedasticity assumption, Ε e e
(
MSE * X T X
)
−1
T
2
I n , is not met,
( ) . As a consequence, significance tests
~ ~
is biased as an estimator of Var b
−1
~
for slope estimates would be of questionable accuracy.
1. Many times the variance of the dependent variable is different for different values of a
variable.
2. This can be readily detected in a diagnostic plot in which the residual values ( êi ) for each
unit of analysis are plotted by the corresponding estimated value ( Yˆi ) for the unit, or by
any of the independent variables ( X i ) used in obtaining this estimate.
3. Thus, a diagnostic plot looks as follows:
eˆi = Yi − Yˆi
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
. .
.
Yˆi (or X i )
4. If (with the aid of a box plot) an examination of such a plot shows the variance to change
notably for different Y- or X-values, a variance stabilizing strategy is called for.
1
B. Weighted Least Squares (WLS)
1. WLS is a “brute force” method of ensuring constant variances. Whereas OLS estimates
∑ (Y
n
minimize
i =1
i
− Yˆi
)
2
= SS ERROR
 Y − Yˆi
, WLS estimates minimize ∑  i
σˆ i
i =1 
n
2

 , where σˆ i2 is


T
the expected value of the i,i cell of the e~ e~ matrix.
2. How this is done in practice:
a. Compute the standard deviation of ei for different ranges of values of Yˆ or X .
b. Divide each Yi by the appropriate standard deviation.
3. An illustration: It is hard to imagine anyone who (ignoring women for the moment)
sincerely agrees that “all men are created equal” and who simultaneously is very racially
prejudiced. On the other hand, there may be many reasons why someone might disagree
that “all men are created equal.” Thus, there might be a lot of variation in racial
prejudice among people with such a belief.
a. Imagine that you have the following plot:
high
Racial
Prejudice
Score
(PREJUDICE)
low
..
..
.
.
.
.
.
.
σˆ = 1
σˆ = 2
1
Strongly
Agree
2
Agree
.
.
.
.
.
σˆ = 3
.
.
.
.
.
σˆ = 4
3
4
Undecided Disagree
“All men are created equal.”
(EQUAL)
2
.
.
.
.
.
σˆ = 5
5
Strongly
Disagree
b. In SPSS you could compute five standard deviations with statements such as …
select if (EQUAL EQ 1).
frequencies VARS=prejudice/stddev.
c. Then with the values on these standard deviations (assuming that they equal the
integers from 1 to 5) you could precede you regression command with …
if
if
if
if
if
(EQUAL
(EQUAL
(EQUAL
(EQUAL
(EQUAL
EQ
EQ
EQ
EQ
EQ
1)
2)
3)
4)
5)
prejudice=prejudice/1.
prejudice=prejudice/2.
prejudice=prejudice/3.
prejudice=prejudice/4.
prejudice=prejudice/5.
d. Note how values on the diagnostic plot would change after dividing by these
variances.
i. Before transformation of PREJUDICE (P):
10
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
σˆ = 1
σˆ = 2
σˆ = 3
σˆ = 4
1
Strongly
Agree
2
Agree
..
..
.
e = P − Pˆ
-10
3
4
Undecided Disagree
“All men are created equal.”
(EQUAL)
3
.
.
.
.
.
σˆ = 5
5
Strongly
Disagree
ii. After transformation of PREJUDICE (P):
10
P − Pˆ
ei =
σˆ i
..
..
.
..
..
.
..
..
.
..
..
.
..
..
.
-10
1
Strongly
Agree
2
Agree
3
4
Undecided Disagree
“All men are created equal.”
(EQUAL)
5
Strongly
Disagree
iii. Note how the variances of the dependent variable are now homoscedastic (i.e.,
constant for different values taken by the independent variable).
4. A special case (based on Wonnacott and Wonnacott 1970, pp. 132-135).
a. Actually, we have a special case in this illustration. In particular, note how the
standard deviation of Y increases proportionately with the magnitude of X.
b. For this reason you might have wondered why the series of SPSS “IF statements”
were not combined into the single statement, …
compute newPREJUDICE=PREJUDICE/EQUAL.
c. This is perfectly legitimate in this special case. But if EQUAL is an independent
variable in the regression, note how there must be changes made in the following
“usual” regression model (or Model 1):
4
Yˆ = aˆ + bˆX
d. In this case, variances are stabilized by estimating the following model (Model 2):
Yˆ
X
1
1
= aˆ ′ + bˆ′ = aˆ ′ + bˆ′
X
X
X
X
e. Notice that the constant, bˆ′ , estimated in Model 2 is an improved estimate of the
slope, b̂ , in Model 1. Likewise the slope, aˆ ′ , estimated in Model 2 is an improved
estimate of the constant, â , in Model 1.
f. In practice, one would estimate such a model using the following SPSS commands:
compute newY=Y/X.
compute newX=1/X.
regression vars=newY, newX/dep=newY/enter.
g. After obtaining the slope and constant from this regression, one can express the
results in the form of Model 1, where the slope found via SPSS is one’s estimate of
â , and the constant found via SPSS is one’s estimate of b̂ .
5. But WLS has problems: Dividing the dependent variable by a variety of standard
deviations leaves the researcher with no units on the thus-transformed dependent
variable. Worse, it is difficult to interpret the slopes: A “one unit” increase in the
transformed dependent variable corresponds to a different number of units on the original
PREJUDISM measure—units that depend on the value of EQUAL. Fortunately, in the
just-discussed case the ability to reexpress the regression model back into its “usual”
form allows (in this special case only) the original units to be preserved.
C. Oftentimes heteroscedasticity is a consequence of the nature of the dependent variable itself.
You have already experienced this when working with the recall data.
1. The recall data are in proportions.
5
2. Sample proportions have variances that differ both according to the magnitudes of the
population proportions that they estimate, and according to the number of cases on which
they are based:
a.
2
ˆ
σ
ˆ =
p
That is,
i
pi (1 − pi )
.
ni
b. Note that the standard error of a proportion is smallest when the corresponding
population proportion is near zero (0) or unity (1), and when the number of trials on
which it is based is relatively large.
3. If an independent variable (e.g., year of event (YOE)), is positively associated with the
proportion of age groups’ recalling each among a set of political events (namely,
RECALL), then the variance of RECALL will be different across different levels of
YOE. (Note that these differences have nothing to do with the independent variable, but
everything to do with the inherent nature of the dependent variable.)
4. Such heteroscedasticity can be detected in two types of scatterplots:
a. RECALL (Y) by YOE (X)
1
RECALL
.
. . .. .. .. .....
. . . .
. .
. . . . .. .
. . . . . .
.
. . . . . .
.
.
.
.
.. . . .. .. .
. .. .. .. . . .
... .
.
. .
0
YOE
6
b. Residuals ( ê ) by Estimates ( Yˆ )
i. After regressing RECALL on YOE, you can find…
1) the estimated values of RECALL: Yˆ = aˆ + bˆX
2) the errors from the regression: Y − Yˆ = eˆ
ii. If there is homoscedasticity, the variance of the ê should remain constant
throughout all (here, estimated) values of RECALL.
iii. When one’s dependent variable takes values of proportions (or percentages), then
the diagnostic plot will likely appear as follows:
ê
. .
. . . ....
. .. . . . . .
.
. .. . . . . . . .. . ..
... .. . .. .. . .
. . .. . . . .
. ...
0
Yˆ
0
. .
.....
. .
1
5. A commonly used transformation for proportions is the arc-sine-square root
transformation.
a. This is sometimes written, arcsin
( p ) or sin ( p ).
−1
b. In SPSS one combines the arc-sine and square root functions as arsin(sqrt(p)).
7
c. Like p ,
p takes values ranging from zero (0) to unity (1). However, values of
y = sin −1 ( x ) for 0 ≤ x ≤ 1 are as follows:
π
2
y = sin −1 ( x )
(radians)
The curved line
is a segment
from a sine
wave.
0
x
0
Defining x =
p
sin
−1
( p)
0
0
change
between
adjacent
values
.1
.32
.32
1
p , selected values of p and sin −1
.2
.46
.14
.3
.58
.12
.4
.68
.10
.5
.79
.11
( p ) are listed below:
.6
.89
.10
.7
.99
.10
.8
1.11
.12
.9
1.25
.15
1.0
1.57
.32
6. The arcsin-square root transformation will suffice if each pi is calculated based on the
same number of observations. When this is not the case, the transformation should be as
follows:
1
ni
sin −1
( p ) , where p
i
i
is a sample proportion calculated from ni observations.
8
D. Another common pattern of heteroscedasticity is one in which the magnitude of the
residuals’ variances is positively associated with the estimated values of the dependent
variable (i.e., with the Yˆi ). This pattern is thus evident when one’s diagnostic plot looks as
follows:
ê
0
.
. . .
. . .. . .
. . . ..... . .
. .. . . . . . . .
.
. .. . . . . . . .. . . . ... .
... .. . .. .. . . . . .
. . . . . . . .
. ..
.
..
.
. . . .
. ..
Yˆ
Three variable transformations are commonly used to correct this pattern. The choice of one
among these transformations depends on how quickly the residual variance increases as a
function of the conditional means, Yˆ .
1. Square root transformation: Y ′ = Y
When your dependent variable consists of random observations from a Poisson
distribution, this is the transformation usually called for.
a. A Poisson random variable is always a nonnegative integer presenting a number of
counts over a specified time interval. For example, …
i. the number of aggressive acts per minute
ii. the number of labor strikes per year
9
iii. the number of stock purchases per day
b. If some values of Y equal zero (e.g., no aggressive acts for a one minute period), use
the following transformation instead:
Y ′ = Y +1 + Y
c. With any Poisson random variable, the variance of the distribution equals its mean.
The square root transformation should therefore be used whenever the variance of
your data increases as a linear function of the conditional means of your dependent
variable.
2. Logarithmic transformation: Y ′ = log10 (Y ) or Y ′ = ln (Y )
When your dependent variable has a distribution that is severely skewed to the right
(having a few very large values of Y in comparison to many smaller values of Y), this is
the appropriate transformation.
a. Some examples are…
i. annual household income
ii. annual human rights violations per million population
iii. monthly advertising expenditures
b. If some values of Y equal zero (e.g., no income), use the following transformation:
Y ′ = log(Y + 1)
c. Like the square root transformation (and the inverse transformation to be discussed
shortly), this transformation often is a corrective for nonnormality as well as for
heteroscedasticity, since it reduces the magnitude of extremely large Ys much more
than of moderate to small Ys. For example, consider a distribution with a median of
125 but with a range from 3 to 5,000:
10
Y
log10 (Y )
minimum
3
0.5
median
12.5
2.1
maximum
5,000
3.7
Notice how the ratio of “the distance from the maximum to the median” to “the
distance from the median to the minimum” is reduced from 40:1 to 1:1 with this
transformation.
d. The logarithmic transformation is needed whenever the variance of your data
increases as a function of the square of the conditional means of your dependent
variable. That is, it should be used when the standard deviation increases as a
function of the conditional means.
3. Inverse transformation: Y ′ =
1
Y
When your dependent variable takes many (but not all) values near zero, taking the
inverse of the variable might be advisable.
a. The inverse transformation often involves reconceptualizing the transformed
variable’s units. For example…
i. in a sample of census tracts, children per household might be transformed to
households per thousand children
ii. in a survey of families, daily vacation expenses might be transformed to the
number of vacation days purchased by $500
b. If some values of Y equal zero (e.g., no children living in a particular census tract),
use the following transformation:
Y′ =
1
Y +1
11
c. The inverse transformation is used when the variance of your data increases as a
function of the fourth power of the conditional means of the dependent variable. That
is, it is used when the standard deviation increases as a function of the square of the
conditional means.
4. The square root, logarithm, and inverse transformations are all special cases of weighted
least squares.
a. Note that by multiplying a random variable, X, times a constant, k, a new variable is
created with a mean equal to kµ X and variance equal to k 2σ X2 . For example, if X is
normally distributed, …
(
)
kX ~ N kµ X , k 2σ X2 .
When weighted least squares are used, one multiplies the values of a random variable
by the inverse of their corresponding conditional standard deviations. Variable
transformations work the same way, except in that they adjust for the variances'
systematic dependence on the value(s) of an independent variable(s) or, as in the
examples below, of the conditional mean(s) of the dependent variable.
b. Consider the case in which Y (possibly a Poisson random variable) has a variance that
is directly proportional to its mean. That is, consider…
(
)
Yi ~ a + bXi , [a + bXi ]σ 2 .
i. Noting that each observed Yi is an estimator of a + bXi , it follows that a square
root transformation, Yi ×
1
a + bXi
≈ Yi ×
1
Yi
= Yi , would be appropriate.
ii. Consequently, …
Yi ~
12
(
)
a + bXi , σ 2 .
1
iii. IMPORTANT: Notice how multiplying through by
a + bXi
serves to “cancel
out” the dependence of the variance of Yi on a + bXi through a process much like
that of multiplying a random variable by a constant.
c. Now consider a random variable, Y, with variance that is directly proportional to the
square of its mean. That is to say, …
(
Yi ~ a + bXi , [a + bXi ] σ 2
2
)
i. As before, note that each observed Yi is an estimator of a + bXi , such that…
Yi ×
1
1
≈ Yi × = 1 .
a + bXi
Yi
ii. Although this may appear to be a dead end, it is not. Note that the observed Yi
almost surely do not exactly equal their conditional means, a + bXi . Instead,
Yi
= Yi λ , where λ ≈ 0 .
a + bXi
iii. Following Draper & Smith (1981, p. 289), “(I)f we take a small positive or
negative power of Y (say, Y.01 or Y─.01), it will plot against log Y very nearly as a
straight line, and linearity will be more and more nearly achieved the smaller the
power we take.” That is, the closer the observed values of your dependent
variable (i.e., the Yi ) are to their true conditional means (i.e., their means
conditional on each’s corresponding value of Xi ), the closer that log( Yi ) comes to
comprising a perfect linear mapping of
13
Yi
. For the purpose of regression,
a + bXi
measures are interchangeable whenever they (like feet and inches) comprise a
perfect linear mapping of each other.
iv. Accordingly, in this case…
(
)
log(Yi ) ~ log[a + bXi ], σ 2 .
d. Finally, consider a random variable, Y, with variance that is directly proportional to
the fourth power of its mean. In other words, …
(
)
Yi ~ a + bXi , [a + bXi ] σ 2 .
4
i. Once again, note that each observed Yi is an estimator of a + bXi , such that…
Yi ×
1
(a + bXi )
2
≈ Yi ×
1
1
.
=
2
Yi
Yi
ii. The distribution of this inverse transformation will have the following mean and
variance:

1  1
~ 
, σ 2 
Yi  a + bXi

E. Some references:
Box, George E.P., and Normal R. Draper. 1987. Empirical Model Building and Response
Surfaces. New York: Wiley.
Draper, N.R., and H. Smith. 1981. Applied Regression Analysis, 2nd ed. New York: Wiley. See
237-241.
Neter, John, and William Wasserman. 1974. Applied Linear Statistical Models: Regression,
Analysis of Variance, and Experimental Design, 2nd ed. Homewood, IL: Irwin. See
pp. 131-136, 506-508.
Neter, John, William Wasserman, and Michael H. Kutner. 1985. Applied Linear Statistical
Models: Regression, Analysis of Variance, and Experimental Design, 2nd ed.
Homewood, IL: Irwin. See pp. 134-141, 615-617.
Weisberg, Sanford. 1980. Applied Linear Regression. New York: Wiley. See pp. 122-131. In
2nd edition (published in 1985) see pp. 133-146.
Wonnacott, Ronald J., and Thomas H. Wonnacott. 1970. Econometrics. New York: Wiley. See
pp. 132-136.
14
Download