Stat 404 Variance Stabilizing Transformations ( )= σ A. Recall that when the homoscedasticity assumption, Ε e e ( MSE * X T X ) −1 T 2 I n , is not met, ( ) . As a consequence, significance tests ~ ~ is biased as an estimator of Var b −1 ~ for slope estimates would be of questionable accuracy. 1. Many times the variance of the dependent variable is different for different values of a variable. 2. This can be readily detected in a diagnostic plot in which the residual values ( êi ) for each unit of analysis are plotted by the corresponding estimated value ( Yˆi ) for the unit, or by any of the independent variables ( X i ) used in obtaining this estimate. 3. Thus, a diagnostic plot looks as follows: eˆi = Yi − Yˆi . . . . . . . . . . . . . . . . . . Yˆi (or X i ) 4. If (with the aid of a box plot) an examination of such a plot shows the variance to change notably for different Y- or X-values, a variance stabilizing strategy is called for. 1 B. Weighted Least Squares (WLS) 1. WLS is a “brute force” method of ensuring constant variances. Whereas OLS estimates ∑ (Y n minimize i =1 i − Yˆi ) 2 = SS ERROR Y − Yˆi , WLS estimates minimize ∑ i σˆ i i =1 n 2 , where σˆ i2 is T the expected value of the i,i cell of the e~ e~ matrix. 2. How this is done in practice: a. Compute the standard deviation of ei for different ranges of values of Yˆ or X . b. Divide each Yi by the appropriate standard deviation. 3. An illustration: It is hard to imagine anyone who (ignoring women for the moment) sincerely agrees that “all men are created equal” and who simultaneously is very racially prejudiced. On the other hand, there may be many reasons why someone might disagree that “all men are created equal.” Thus, there might be a lot of variation in racial prejudice among people with such a belief. a. Imagine that you have the following plot: high Racial Prejudice Score (PREJUDICE) low .. .. . . . . . . σˆ = 1 σˆ = 2 1 Strongly Agree 2 Agree . . . . . σˆ = 3 . . . . . σˆ = 4 3 4 Undecided Disagree “All men are created equal.” (EQUAL) 2 . . . . . σˆ = 5 5 Strongly Disagree b. In SPSS you could compute five standard deviations with statements such as … select if (EQUAL EQ 1). frequencies VARS=prejudice/stddev. c. Then with the values on these standard deviations (assuming that they equal the integers from 1 to 5) you could precede you regression command with … if if if if if (EQUAL (EQUAL (EQUAL (EQUAL (EQUAL EQ EQ EQ EQ EQ 1) 2) 3) 4) 5) prejudice=prejudice/1. prejudice=prejudice/2. prejudice=prejudice/3. prejudice=prejudice/4. prejudice=prejudice/5. d. Note how values on the diagnostic plot would change after dividing by these variances. i. Before transformation of PREJUDICE (P): 10 . . . . . . . . . . . . . . . σˆ = 1 σˆ = 2 σˆ = 3 σˆ = 4 1 Strongly Agree 2 Agree .. .. . e = P − Pˆ -10 3 4 Undecided Disagree “All men are created equal.” (EQUAL) 3 . . . . . σˆ = 5 5 Strongly Disagree ii. After transformation of PREJUDICE (P): 10 P − Pˆ ei = σˆ i .. .. . .. .. . .. .. . .. .. . .. .. . -10 1 Strongly Agree 2 Agree 3 4 Undecided Disagree “All men are created equal.” (EQUAL) 5 Strongly Disagree iii. Note how the variances of the dependent variable are now homoscedastic (i.e., constant for different values taken by the independent variable). 4. A special case (based on Wonnacott and Wonnacott 1970, pp. 132-135). a. Actually, we have a special case in this illustration. In particular, note how the standard deviation of Y increases proportionately with the magnitude of X. b. For this reason you might have wondered why the series of SPSS “IF statements” were not combined into the single statement, … compute newPREJUDICE=PREJUDICE/EQUAL. c. This is perfectly legitimate in this special case. But if EQUAL is an independent variable in the regression, note how there must be changes made in the following “usual” regression model (or Model 1): 4 Yˆ = aˆ + bˆX d. In this case, variances are stabilized by estimating the following model (Model 2): Yˆ X 1 1 = aˆ ′ + bˆ′ = aˆ ′ + bˆ′ X X X X e. Notice that the constant, bˆ′ , estimated in Model 2 is an improved estimate of the slope, b̂ , in Model 1. Likewise the slope, aˆ ′ , estimated in Model 2 is an improved estimate of the constant, â , in Model 1. f. In practice, one would estimate such a model using the following SPSS commands: compute newY=Y/X. compute newX=1/X. regression vars=newY, newX/dep=newY/enter. g. After obtaining the slope and constant from this regression, one can express the results in the form of Model 1, where the slope found via SPSS is one’s estimate of â , and the constant found via SPSS is one’s estimate of b̂ . 5. But WLS has problems: Dividing the dependent variable by a variety of standard deviations leaves the researcher with no units on the thus-transformed dependent variable. Worse, it is difficult to interpret the slopes: A “one unit” increase in the transformed dependent variable corresponds to a different number of units on the original PREJUDISM measure—units that depend on the value of EQUAL. Fortunately, in the just-discussed case the ability to reexpress the regression model back into its “usual” form allows (in this special case only) the original units to be preserved. C. Oftentimes heteroscedasticity is a consequence of the nature of the dependent variable itself. You have already experienced this when working with the recall data. 1. The recall data are in proportions. 5 2. Sample proportions have variances that differ both according to the magnitudes of the population proportions that they estimate, and according to the number of cases on which they are based: a. 2 ˆ σ ˆ = p That is, i pi (1 − pi ) . ni b. Note that the standard error of a proportion is smallest when the corresponding population proportion is near zero (0) or unity (1), and when the number of trials on which it is based is relatively large. 3. If an independent variable (e.g., year of event (YOE)), is positively associated with the proportion of age groups’ recalling each among a set of political events (namely, RECALL), then the variance of RECALL will be different across different levels of YOE. (Note that these differences have nothing to do with the independent variable, but everything to do with the inherent nature of the dependent variable.) 4. Such heteroscedasticity can be detected in two types of scatterplots: a. RECALL (Y) by YOE (X) 1 RECALL . . . .. .. .. ..... . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . .. .. . . .. .. .. . . . ... . . . . 0 YOE 6 b. Residuals ( ê ) by Estimates ( Yˆ ) i. After regressing RECALL on YOE, you can find… 1) the estimated values of RECALL: Yˆ = aˆ + bˆX 2) the errors from the regression: Y − Yˆ = eˆ ii. If there is homoscedasticity, the variance of the ê should remain constant throughout all (here, estimated) values of RECALL. iii. When one’s dependent variable takes values of proportions (or percentages), then the diagnostic plot will likely appear as follows: ê . . . . . .... . .. . . . . . . . .. . . . . . . .. . .. ... .. . .. .. . . . . .. . . . . . ... 0 Yˆ 0 . . ..... . . 1 5. A commonly used transformation for proportions is the arc-sine-square root transformation. a. This is sometimes written, arcsin ( p ) or sin ( p ). −1 b. In SPSS one combines the arc-sine and square root functions as arsin(sqrt(p)). 7 c. Like p , p takes values ranging from zero (0) to unity (1). However, values of y = sin −1 ( x ) for 0 ≤ x ≤ 1 are as follows: π 2 y = sin −1 ( x ) (radians) The curved line is a segment from a sine wave. 0 x 0 Defining x = p sin −1 ( p) 0 0 change between adjacent values .1 .32 .32 1 p , selected values of p and sin −1 .2 .46 .14 .3 .58 .12 .4 .68 .10 .5 .79 .11 ( p ) are listed below: .6 .89 .10 .7 .99 .10 .8 1.11 .12 .9 1.25 .15 1.0 1.57 .32 6. The arcsin-square root transformation will suffice if each pi is calculated based on the same number of observations. When this is not the case, the transformation should be as follows: 1 ni sin −1 ( p ) , where p i i is a sample proportion calculated from ni observations. 8 D. Another common pattern of heteroscedasticity is one in which the magnitude of the residuals’ variances is positively associated with the estimated values of the dependent variable (i.e., with the Yˆi ). This pattern is thus evident when one’s diagnostic plot looks as follows: ê 0 . . . . . . .. . . . . . ..... . . . .. . . . . . . . . . .. . . . . . . .. . . . ... . ... .. . .. .. . . . . . . . . . . . . . . .. . .. . . . . . . .. Yˆ Three variable transformations are commonly used to correct this pattern. The choice of one among these transformations depends on how quickly the residual variance increases as a function of the conditional means, Yˆ . 1. Square root transformation: Y ′ = Y When your dependent variable consists of random observations from a Poisson distribution, this is the transformation usually called for. a. A Poisson random variable is always a nonnegative integer presenting a number of counts over a specified time interval. For example, … i. the number of aggressive acts per minute ii. the number of labor strikes per year 9 iii. the number of stock purchases per day b. If some values of Y equal zero (e.g., no aggressive acts for a one minute period), use the following transformation instead: Y ′ = Y +1 + Y c. With any Poisson random variable, the variance of the distribution equals its mean. The square root transformation should therefore be used whenever the variance of your data increases as a linear function of the conditional means of your dependent variable. 2. Logarithmic transformation: Y ′ = log10 (Y ) or Y ′ = ln (Y ) When your dependent variable has a distribution that is severely skewed to the right (having a few very large values of Y in comparison to many smaller values of Y), this is the appropriate transformation. a. Some examples are… i. annual household income ii. annual human rights violations per million population iii. monthly advertising expenditures b. If some values of Y equal zero (e.g., no income), use the following transformation: Y ′ = log(Y + 1) c. Like the square root transformation (and the inverse transformation to be discussed shortly), this transformation often is a corrective for nonnormality as well as for heteroscedasticity, since it reduces the magnitude of extremely large Ys much more than of moderate to small Ys. For example, consider a distribution with a median of 125 but with a range from 3 to 5,000: 10 Y log10 (Y ) minimum 3 0.5 median 12.5 2.1 maximum 5,000 3.7 Notice how the ratio of “the distance from the maximum to the median” to “the distance from the median to the minimum” is reduced from 40:1 to 1:1 with this transformation. d. The logarithmic transformation is needed whenever the variance of your data increases as a function of the square of the conditional means of your dependent variable. That is, it should be used when the standard deviation increases as a function of the conditional means. 3. Inverse transformation: Y ′ = 1 Y When your dependent variable takes many (but not all) values near zero, taking the inverse of the variable might be advisable. a. The inverse transformation often involves reconceptualizing the transformed variable’s units. For example… i. in a sample of census tracts, children per household might be transformed to households per thousand children ii. in a survey of families, daily vacation expenses might be transformed to the number of vacation days purchased by $500 b. If some values of Y equal zero (e.g., no children living in a particular census tract), use the following transformation: Y′ = 1 Y +1 11 c. The inverse transformation is used when the variance of your data increases as a function of the fourth power of the conditional means of the dependent variable. That is, it is used when the standard deviation increases as a function of the square of the conditional means. 4. The square root, logarithm, and inverse transformations are all special cases of weighted least squares. a. Note that by multiplying a random variable, X, times a constant, k, a new variable is created with a mean equal to kµ X and variance equal to k 2σ X2 . For example, if X is normally distributed, … ( ) kX ~ N kµ X , k 2σ X2 . When weighted least squares are used, one multiplies the values of a random variable by the inverse of their corresponding conditional standard deviations. Variable transformations work the same way, except in that they adjust for the variances' systematic dependence on the value(s) of an independent variable(s) or, as in the examples below, of the conditional mean(s) of the dependent variable. b. Consider the case in which Y (possibly a Poisson random variable) has a variance that is directly proportional to its mean. That is, consider… ( ) Yi ~ a + bXi , [a + bXi ]σ 2 . i. Noting that each observed Yi is an estimator of a + bXi , it follows that a square root transformation, Yi × 1 a + bXi ≈ Yi × 1 Yi = Yi , would be appropriate. ii. Consequently, … Yi ~ 12 ( ) a + bXi , σ 2 . 1 iii. IMPORTANT: Notice how multiplying through by a + bXi serves to “cancel out” the dependence of the variance of Yi on a + bXi through a process much like that of multiplying a random variable by a constant. c. Now consider a random variable, Y, with variance that is directly proportional to the square of its mean. That is to say, … ( Yi ~ a + bXi , [a + bXi ] σ 2 2 ) i. As before, note that each observed Yi is an estimator of a + bXi , such that… Yi × 1 1 ≈ Yi × = 1 . a + bXi Yi ii. Although this may appear to be a dead end, it is not. Note that the observed Yi almost surely do not exactly equal their conditional means, a + bXi . Instead, Yi = Yi λ , where λ ≈ 0 . a + bXi iii. Following Draper & Smith (1981, p. 289), “(I)f we take a small positive or negative power of Y (say, Y.01 or Y─.01), it will plot against log Y very nearly as a straight line, and linearity will be more and more nearly achieved the smaller the power we take.” That is, the closer the observed values of your dependent variable (i.e., the Yi ) are to their true conditional means (i.e., their means conditional on each’s corresponding value of Xi ), the closer that log( Yi ) comes to comprising a perfect linear mapping of 13 Yi . For the purpose of regression, a + bXi measures are interchangeable whenever they (like feet and inches) comprise a perfect linear mapping of each other. iv. Accordingly, in this case… ( ) log(Yi ) ~ log[a + bXi ], σ 2 . d. Finally, consider a random variable, Y, with variance that is directly proportional to the fourth power of its mean. In other words, … ( ) Yi ~ a + bXi , [a + bXi ] σ 2 . 4 i. Once again, note that each observed Yi is an estimator of a + bXi , such that… Yi × 1 (a + bXi ) 2 ≈ Yi × 1 1 . = 2 Yi Yi ii. The distribution of this inverse transformation will have the following mean and variance: 1 1 ~ , σ 2 Yi a + bXi E. Some references: Box, George E.P., and Normal R. Draper. 1987. Empirical Model Building and Response Surfaces. New York: Wiley. Draper, N.R., and H. Smith. 1981. Applied Regression Analysis, 2nd ed. New York: Wiley. See 237-241. Neter, John, and William Wasserman. 1974. Applied Linear Statistical Models: Regression, Analysis of Variance, and Experimental Design, 2nd ed. Homewood, IL: Irwin. See pp. 131-136, 506-508. Neter, John, William Wasserman, and Michael H. Kutner. 1985. Applied Linear Statistical Models: Regression, Analysis of Variance, and Experimental Design, 2nd ed. Homewood, IL: Irwin. See pp. 134-141, 615-617. Weisberg, Sanford. 1980. Applied Linear Regression. New York: Wiley. See pp. 122-131. In 2nd edition (published in 1985) see pp. 133-146. Wonnacott, Ronald J., and Thomas H. Wonnacott. 1970. Econometrics. New York: Wiley. See pp. 132-136. 14