SOME SIMPLIFICATIONS FOR THE EXPECTATION-MAXIMIZATION (EM)
ALGORITHM by
Daniel A. Griffith
Ashbel Smith Professor, School of Social Sciences
University of Texas @ Dallas
ABSTRACT The EM algorithm is a generic tool that offers maximum likelihood solutions when data sets are incomplete with data values missing at random or completely at random. At least for its simplest form, the algorithm can be rewritten in terms of an ANCOVA regression specification. This formulation allows several analytical results to be derived that permit the EM algorithm solution to be expressed in terms of new observation predictions and their variances. Implementations can be made with a linear regression, with a nonlinear regression, and with a generalized linear model routine, allowing missing value imputations, even when they must satisfy constraints or involve dependent observations.
Thirteen example data sets gleaned from the EM algorithm literature are reanalyzed, and supplemented with analyses of five additional data sets that are accessible via the internet. Imputation results have been verified with SAS PROC
MI coupled with log-normal approximations, when necessary, especially for extensions to generalized linear model descriptions of non-normal data. An appendix presents some relevant, novel normality transformation and backtransformation results. Extensions to spatially correlated data link the EM algorithm solution with spatial autoregression and geostatistical kriging. Four theorems are proved, two corollaries are derived, and three conjectures are posited that broadly contextualize imputation findings in terms of the theory, methodology, and practice of statistical science.
KEY WORDS: EM algorithm, missing value, imputation, binomial regression,
Poisson regression, spatial autocorrelation
1. MOTIVATION
The E xpectationM aximization (EM) algorithm (Dempster, Laird, and Rubin 1977), an iterative procedure for computing maximum likelihood estimates (MLEs) when data sets are incomplete, with data values missing at random (MAR) or completely at random (CMAR), is a useful device for helping to solve a wide range of model-based estimation problems. Flury and Zoppè (2000, p.
209) emphasize that it can not be stressed enough that the E-step does not simply involve replacing missing data by their conditional expectations (although this is true for many important applications of the algorithm).
But frequently model-based estimation problems desire just this type of imputation output from the algorithm. Furthermore, in certain situations focusing on imputation dramatically simplifies the EM solution.
More recent descriptions of the EM algorithm may be found in Flury and Zoppè (2000),
Meng (1997), and McLachlan and Krishnan (1997), among others. The objective of this paper is to present regression solutions that render conditional expectations for missing values in a data set that are equivalent to EM algorithm results. Because the EM procedure requires imputation of the complete-data sufficient statistics, rather than just the individual missing observations, the equivalency discussed here initially derives from an assumption of normality, for which the means and covariances constitute the sufficient statistics. An assumption of normality links ordinary least squares (OLS) and MLE regression results, too; application of the Rao-Blackwell factorization theorem verifies that the means and covariances are sufficient statistics in this situation.
2. BACKGROUND
Yates (1933) shows for analysis of variance (ANOVA) that if each missing observation is replaced by a parameter to be estimated (i.e., the conditional expectation for a missing value), the resulting modified analysis becomes straightforward by treating the estimated missing value as a parameter (i.e., an imputation). Rewriting the ANOVA as an OLS regression problem involves introducing a binary indicator variable for each missing value—the value of 1 denoting the missing value observation in question, and 0 otherwise—with the estimated regression coefficients for these indicator variables being the negative of the missing value estimates. This approach is equivalent to subtracting each observation’s missing value, in turn, from each side of a regression equation. Generalizing this regression formulation to include covariates allows missing values to be estimated with an analysis of covariance (ANCOVA) regression specification, one in fact suggested by Bartlett (1937) and by Rubin (1972). Replacing the arbitrarily assigned value of 1 in each individual observation missing value indicator variable by the value -1 yields estimated regression parameters with the correct sign.
Consider a bivariate set of n observed values, each pair denoted by (y i
, x i
), i=1, 2, …, n.
Suppose only the response variable, Y, contains incomplete data. First, the n m
missing values need to be replaced by 0. Second, n m
0/-1 indicator variables, -I m
(m = 1, 2, …, n m
), need to be constructed; I m
contains n-1 0s and a single 1 corresponding to the m th
missing value observation. The minus sign for –I m
indicates that a -1 actually is entered into each of the m indicator variables. Regressing Y on a complete data predictor variable, X—which furnishes the redundant information that is exploited to compute imputations—together with the set of m indictor variables constitutes the ANCOVA.
Suppose Y o
denotes the n o
-by-1 (n o
= n - n m
) vector of observed response values, and Y m denotes the n m
-by-1 vector of missing response values. Let X o
denote the vector of predictor values for the set of observed response values, and X m
denote the vector of predictor values for the set of missing response values. Further, let 1 denote an n-by-1 vector of ones that can be partitioned into 1 o
, denoting the vector of ones for the set of observed response values, and 1 m
, denoting the vector of ones for the set of missing response values. Then the ANCOVA specification of the regression model may be written in partitioned matrix form as
2
Y
0 m o
1 o
1 m
X
X m o
α
β
0 om
I mm
m
ε o
0 m
, (1) where 0 j
(j = o, m) is an n j
-by-1 vector of zeroes, 0 om
is an n o
-by-n m
matrix of zeroes,
α and β respectively are the bivariate intercept and slope regression parameters,
β m
is an n m
-by-1 vector of conditional expectation regression parameters, I mm
is an n m
-by-n m
identity matrix, and ε o
is an n o
-by-1 vector of random error terms. The bivariate OLS regression coefficient estimates, a and b, of
α and β
, respectively, for this ANCOVA specification are given by
a b
X
T o n
1 o m
1
T o
X o
X o
T
X o
1
1
T o
Y o
X
T o
Y o
, (2) where T denotes matrix transpose, which is the regression results for the observed data only. In addition, the regression coefficients, b m
, for the indicator variables are given by b m
= a 1 m
+ b X m
=
ˆ m
, (3) which is the vector of point estimates for additional observations (i.e., the prediction of new observations) that should have X values within the interval defined by the extreme values contained in the vector X o
. This is a standard OLS regression result, as is the prediction error that can be attached to it (see, for example, Montgomery and Peck, 1982, pp. 31-33). In addition, the values here are positive because the -I m
indicator variables contain negative ones.
Dodge (1985, p. 159) cautions that the OLS equivalency highlighted here rests on the existence of estimable equations, which in some instances means that the ANCOVA solution is appropriate only when the number of missing values is not excessive. If enough observations are missing, the number of degrees of freedom can become zero or negative, the matrix
X n
T o o
1 o
1
X
T o o
X o
T
X o
can become singular, and as such not all of the parametric functions would be estimable. But in selected instances standard regression routines can be tricked into still computing results.
2.1. An iterative example
Consider the example in McLachlan and Krishnan (1997, pp. 49-51) for which iterative sufficient statistics results are reported. Bivariate linear regression theory states that b
σˆ
XY
σˆ 2
X
and a
μˆ
Y
b μˆ
X
.
Computing a and b from the tabulated iterative numerical values reported in McLachlan and
Krishnan render the corresponding results appearing in Table 1.
3
The ANCOVA form of the linear regression problem can be formulated as the following nonlinear regression problem, where
τ is the iteration counter:
Step 1: initialize a
τ
0
and b
τ
0
Step 2:
ˆ m, τ
= a
τ
1 m
+ b
τ
X m
, and hence Y
τ
=
Y o
ˆ m, τ
Step 3: regress Y
τ
on X to obtain iteration estimates
Step 4: repeat Steps 2 and 3 until a
τ
1
and b
τ
1 a
τ
1
and
, and hence b
τ
1
ˆ m, τ
, converge.
This iterative procedure can be implemented with a nonlinear regression routine (e.g., SAS
PROC NLIN). Beginning with the same initial values (i.e., iteration 0) used by McLachlan and
Krishnan for their problem renders the corresponding results appearing in Table 1.
Two points merit discussion here. First, if b
τ
0
= b o
and a
τ
0
= a o
, respectively the slope and the intercept estimates obtained by regressing only the observed values of Y on their corresponding values of X, then a nonlinear regression routine instantly converges (i.e., for
τ =
0) with a
τ
0
and b
τ
0
. Second, the nonlinear regression routine converges faster than the conventional EM algorithm. These findings suggest the following theorem:
THEOREM 1. When missing values occur only in a response variable, Y, then the iterative solution to the EM algorithm produces the regression coefficients calculated with only the complete data.
PF: Let b denote the vector of regression coefficients that is converged upon.
Then if
ˆ m
X m b , b
X o
X m
T
X o
X m
1
X o
X m
T
Y o
X m b
( X o
T
X o
X
T m
X m
)
1
( X o
T
Y o
X
T m
X m b )
( X o
T
X o
)
1
X
T o
Y o
b o
b is equivalent to the vector of known data regression coefficients, which yields the point estimate prediction equation for a new observation.
This equivalency of regression coefficients is demonstrated in Figures 1a and 1b, for which
10,000 multiple imputations were computed using SAS PROC MI (see Horton and Lipsitz,
2001) for the set of examples appearing in Table 2. This finding furnishes a simplified version of the EM algorithm in this case. In addition, for the example presented by Little and Rubin (1987, p. 100) for which explicit MLE formulae exist and that is analyzed in McLachlan and Krishnan
(1997, p. 49),
2,9
= 12.53711 and
2,10
= 15.61523, rendering
i
8
1 w i,2
9,2
10,2
/ 10 = 14.61523 , which is equivalent to
4
s
11
8 i
1
(w i,1
108
)
8
2
/8 , s
22
i
8
1
(w i,2
119
8
)
2
/8 , and s
12
i
8
1
(w i,1
108
)(w
8 i,2
119
)/8
8
, which yields σˆ = 28.85937 + (199.5/384)(402/10 – 384/8) = 26.75405, and
22
σˆ
12
[
8 i
1
(w i,1
13)(w i,2
14.61523)
(9
13)(12.537
11
14.61523)
(13
13)(15.615
23
14.61523] / 10
20.88516
.
These calculations are those reported in Table 2, the values upon which the EM algorithm iteratively converges (see McLachlan and Krishner, 1997, Table 2.1, p. 50).
2.2. Simplicity from equation (1)
Consider a selection of the example data sets contained in Montogmery and Peck (1982), Little and Rubin (1987), McLachlan and Krishnan (1997), and Schafer (1997). Comparative results appear in Table 2. Results obtained with the ANCOVA procedure outlined in this section are equivalent in all bivariate cases. In addition, as is illustrated in Table 2, associated calculations can be obtained from the ANCOVA results (see McLachlan and Krishnan, p. 49), and the basics for multiple imputation calculations can be established with the ANCOVA results (see Schafer, p. 195).
The regression result stated in Theorem 1 leads to the following consequence derived from equation (2):
THEOREM 2. When missing values occur only in a response variable, Y, then by replacing the missing values with zeroes and introducing a binary 0/-1 indicator variable covariate -I m
for each missing value m, such that I m
is 0 for all but missing value observation m and 1 for missing value observation m, the estimated regression coefficient b m
is equivalent to the point estimate for a new observation, and as such furnishes EM algorithm imputations.
PF: Let b m
denote the vector of regression coefficients for the missing values, and partition the data matrices such that
b b m o
X o
X m
0
I om mm
(
X m
X
( o
T
X o
X o
)
T
X o
1
)
1
T
X
X m o
I mm
0 om
I mm
1
X
X m o
0
I om mm
T
Y o
0 m
( X
T o
X m
X o
( X
)
1
X
T m o
T
X o
)
1
X
T m
X
T o
Y o
0 m
b o b m
( X
T o
X o
)
1
X
T o
Y o
, and
X m b o
, where I mm
is an n m
-by-n m
identity matrix and 0 om
is a n o
-by-n m
null matrix.
Therefore, b m
is the point estimate prediction equation for new observations (see Montgomery and Peck, 1982, p. 141). Besides simplicity, one advantage of this result is that the correct
5
number of degrees of freedom is produced by the regression formulation, which actually is an
ANCOVA specification. The imputation equivalency is demonstrated in Figure 1c, for which
10,000 multiple imputations were computed using SAS PROC MI for each data set in the collection of examples appearing in Table 2.
Part of the issue emphasized in arguments such as those by Flury and Zoppè (2000), and promoted by the multiple imputation perspective (see Schafer, 1997), is that an imputation is of less value when its variance is unknown. Fortunately, the regression result stated in Theorem 2 leads to the following consequence:
THEOREM 3. For imputations computed based upon Theorem 2, each standard error of the estimated regression coefficients b m
is equivalent to the conventional standard deviation used to construct a prediction interval for a new observation, and as such furnishes the corresponding EM algorithm imputation standard error.
PF: Let the subscript diag denote diagonal entries, and s b m
denote the n m
-by-1 vector of imputation standard errors. From standard OLS regression theory, the diagonal elements of the following partitioned matrix furnish the variances of the imputations calculated as regression coefficients:
X o
X m
0 om
I mm
T
X o
X m
0 om
I mm
1
σˆ 2
ε
(
X m
X
(
T o
X
X
T o o
)
X o
1
)
1
I mm
( X
X
T o m
X
( o
X
)
1
T o
X
X
T m o
)
1
X
T m
σˆ
ε
2
s b m
[ I mm
X m
( X
T o
X o
)
1
X
T m
] diag
σˆ 2
ε
.
Therefore, s b m
is the vector of standard errors for the point estimate prediction equation for new observations (see Montgomery and Peck, 1982, p. 141). The imputation equivalency is demonstrated in Figure 1d, for which 10,000 multiple imputations were computed using SAS
PROC MI for the set of examples appearing in Table 2.
Theorems 2 and 3 provide the analytical mean and variance for an EM algorithm imputation when data are missing only in Y. Multiple imputations can be obtained by sampling with a normal probability model whose mean and variance are given by these two theorems.
2.3. Implications
The preceding theorems demonstrate that the EM algorithm reduces to an OLS linear regression analysis, when a normal probability model can be attached to the error term, whose model specification is as follows:
Y
0 m o
α
1
Xβ
m
M
1 y m
I m
ε o
0 m
, (4)
6
where the intercept term is separated from the attribute covariates as in equation (2). The bivariate linear regression equation comparing equation (4) and assembled EM imputation results reported in Table 2 is as follows: reported
0.15609
1.01245
[ equation
(t
0.13) (t
1.66)
(1) results]
e , which has an associated R
2
value of 0.998. In other words, the null hypotheses of
α = 0 and β
=
1 are not rejected. Slight discrepancies detected here are attributable to noise contained in the nine multiple imputation results reported by Schafer (1997, p. 195).
3. MISSING VALUES IN BOTH X AND Y: THE MULTIVARIATE NORMAL CASE
One appealing feature of the ANCOVA specification is that it can be generalized for implementation with nonlinear least squares (NLS) procedures. Rather than a bivariate situation having missing values in only the response variable, consider a bivariate situation with missing values in both variables, and a multiple variable situation with missing values in a number of the three or more variables.
Suppose a bivariate case has P values missing for variable Y, and Q values missing for variable X, but each observation has at least a y i
or an x i
measure. Furthermore, those observations having complete (y i
, x i
) measures must furnish information about the relationship between X and Y. This is the situation for the example presented in Schafer (1997, p. 54).
3.1. A near-linear regression solution
Equation (4) can be modified by introducing additional indicator variables for the missing X values, yielding for a bivariate situation
Y o, o
Y x, m
0 y, m
1 o, o
1 x, m
1 y, m
X
0 x, o, o m
X y, m
α
β
0 o, o m
Q
1 x m
I x, m
0 y, m
β m
P
1
0 o, o y m
0 x, m
I y, m
ε
, (5) where the subscript “o,o” denotes data that are complete in both X and Y, “x,m” denotes data that are incomplete in X and complete in Y, “y,m” denotes data that are incomplete in Y and complete in X, and the vectors of missing values to be estimated are x m
and y m
. Of note here is that although y m
has a negative sign (it has been subtracted from both sides of the equation), x m does not (it is being added because it has been removed from vector X ). Furthermore, equation
(5) does not suffer from the usual drawbacks of conventional dummy variable adjustment (see
Allison, 2002, p. 9-11), because equation (5) contains a set of Q dummy variables, one for each missing value x m
, rather than a single dummy variable with Q ones and (n – Q) zeroes.
Even though equation (5) basically requires an OLS solution, it still needs to be estimated with a NLS routine (e.g., SAS PROC NLIN) because of the interaction between the x m s and β .
Imputation calculations with this equation render the results reported in Table 2. But one serious
7
drawback concerns degrees of freedom; in the example presented by Schafer (1997, p. 54), the number of degrees of freedom already is 0.
3.2. Concatenation to obtain a solution
An ANCOVA specification can be written for each variable, regressing Y on X in one instance
(denoted Y|X), and regressing X on Y in a second instance (denoted X|Y). Given equation (5), the two specifications may be written as follows:
Y o, o
Y x, m
0 y, m
1 o, o
1 x, m
1 y, m
X
0 x, o, o m
X y, m
α
Y | X
β
Y | X
0 o, o
I x, m
0 y, m
x m
β
Y | X
0 o, o
0 x,
I m y, m
y m
ε
Y | X
, and (6)
X o, o
0
X x, m y, m
1 o, o
1
1 x, m y, m
Y
Y x, m
0 o, o y, m
α
β
X | Y
X | Y
0 o, o
I x, m
0 y, m
x m
0 o, o
0
I x, m y, m
y m
β
X | Y
ε
X | Y
, (7) where missingness is indexed by whether it occurs in X or Y, I x,m and I y,m respectively denote Pby-P and Q-by-Q identity matrices, and the intercept and slope parameters and error terms are subscripted by Y|X if they are for a regression of Y on X, and by X|Y if they are for a regression of X on Y. Resulting imputations are governed by the relationship between X o
and Y o
. This formulation can be extended to any number of variables without loss of generality.
Data organized according to the appearance of terms in equations (6) and (7) can be concatenated for a simultaneous estimation of the parameters and missing values. This concatenation requires the creation and inclusion of two additional 0/1 indicator variables, one for each of the two equations. Once again ANCOVA as regression guides the specification. The resulting supra-ANCOVA may be written as
Y y, o
0 y, m
X
0 x, x, o m
eq
1
α
Y | X
β
Y | X
X
0 x, o x, m
β
Y | X
I
0 x, x, o m
x m eq
2
α
X | Y
β
X | Y
Y y, o
0 y, m
β
X | Y
0 y, o
I y, m
y m
0 y, o
I y, m
y m
0
I x, o x, m
x m
ε y, o
0 y, m
ε
0 x, x, o m
,
(8) where eq k
(k = 1, 2) denotes the two 2-by-1 binary 0/1 equation indicator variables, and
denotes Kronecker product. Estimation of this supra-equation requires iterative calculations, and can be executed efficiently with a NLS routine. Of note is that the indicator variables representing missing values in each variable include 0/1 values when the variable appears on the
8
right-hand side of an equation, and 0/-1 values when the same variable originally appears on the left-hand side, but has been subtracted from both sides, of an equation.
Inspection of Table 2 reveals that this formulation of the problem solution renders exactly the results obtained by Schafer (1997, p. 54). By use of concatenation, the NLS regression routine is tricked into thinking that there are twice as many degrees of freedom than actually exist.
Nevertheless, the estimate is an exact result (e.g., the error sum of squares is 0) because only two of ten observations have complete data. Furthermore, the correct solution for the difficult situation of a saddle point also is rendered by this formulation (see McLachan and Krishnan,
1997, p. 91), which can be used to identify maxima diverged to by a small perturbation in the starting correlation value, too. In this latter instance, the ANCOVA result deviates from the reported maxima result.
Next, consider a multiple variable case in which P values are missing for variable X
1
, Q values are missing for variable X
2
, and so on, but each observation has at least a measure on one of the variables. This is the situation for the example presented in Little and Rubin (1987, p.
118), which includes five variables, with three having missing values. An ANCOVA specification can be written for each variable having missing values, regressing each X, in turn, on the remaining variables. As before, these sets of regression equations can be concatenated for simultaneous estimation using a NLS routine. Now there is a set of indicator variables for each variable having missing values. In the Little and Rubin example, three sets of indicator variables need to be created, and three sets of data need to be concatenated. Results from this super-
ANCOVA appear in Table 2, and agree with reported results.
1
3.3. Variance estimates
Estimates of variances for the multivariate normal case where data are missing in more than one variable are less straight forward. The concatenated nonlinear regression standard errors do not agree with those produced by multiple imputation using Markov chain Monte Carlo (MCMC) techniques. The Little and Rubin (1987, p. 118) example contains five variables, with the same four observations having missing values on the same three of these variables, and with one of these three variables also containing an additional three missing values; out of 65 possible data values, 50 are observed and 15 are missing. Both regression and multiple imputation results appear in Table 3. For this situation, the multiple imputation standard errors are a function of, in the following order (from a stepwise regression selection): the number of missing values for a given observation, the variance of the observed values for a given variable, and the imputed values. These three variates account for roughly 90% of the variance displayed by the multiple imputation standard errors.
Variance estimates were further explored by analyzing the SAS illustrative data set that is furnished with discussion of PROC MI (SAS, 1999, p. 133). This example contains three variables─X
1
, X
2
and X
3
─with two of the same observations having missing values for X
1
and
X
3
, three of the same observations having missing values for X
2
and X
3
, and the remaining missing values occurring only in X
3
; out of 93 possible values, 78 are observed and 15 are missing. Both regression and multiple imputation results appear in Table 4. For this situation, the
1 There may well be a typographical error for the estimated mean of X
4
, which is calculated here as 27.037 but reported as 27.047.
9
multiple imputation standard errors are a function of, in the following order (from a stepwise regression selection): the imputed values, and the number of missing values for a given observation. These two variates account for nearly all of the variance displayed by the multiple imputation standard errors.
These exploratory analyses indicate that analytical variances for the multiple imputations may well be available, but possibly are a function of the percentage and pattern of missing values across a battery of variables.
4. EXTENSIONS TO SELECTED CONSTRAINED SITUATIONS
The EM algorithm is applied to compute many quantities besides missing values, including unknown variance components and latent variables. In these other contexts, an analyst may overlook important properties of the missing value estimates. One possible situation is the example presented by Little and Rubin (1987, p. 118), in which the estimated conditional mean for missing value x
1,11
is -0.5. In some contexts values cannot be negative; this is true for the
Little and Rubin example, too, since each of its first four variables is measured as a percentage of weight (see Draper and Smith, 1966, p. 365). This situation differs from including a minimum value for imputation purposes when executing SAS PROC MI.
By performing the ANCOVA estimation using NLS, constraints can be easily attached to individual parameters. The Little and Rubin (1987, p. 43) trivariate example furnishes a relatively simple illustration of this point. These data display certain features that suggest the four missing data values in each variable should have a mean of 2.5 and sum to 10. Imposing these two constraints results in estimates of approximately 2.5 for each of the missing values for variable Y
1
, the set of values {2.46, 2.49, 2.51, 2.54} for variable Y
2
, and the set of values {2.47,
2.49, 2.51, 2.53} for variable Y
3
. These sets of values suggest the complete data pairwise correlations of r(Y
1
, Y
2
) = 0.51284, r(Y
1
, Y
3
) = 0.51012, and r(Y
2
, Y
3
) = -0.47677, which appear far more satisfactory than the respective reported values of 1, 1 and -1 produced by an availablecase analysis. Next, the previous solution presented for the multiple variables illustration has been modified by restricting the missing value estimate for measure x
1,11
to be non-negative. By doing so, the resulting estimate becomes 0; this value change is propagated through the covariance structure of the data, resulting in slight modifications to all missing value estimates for observation 11, and even slighter modifications to all missing values estimates for observation 10 (see Table 3).
Of note is that placing a constraint of 0 as a minimum for the Little and Rubin example prevents SAS PROC MI from properly executing. The minimum values specification for this
SAS procedure ensure that MCMC-generated values are greater than the specified minimum value; when a simulated value is less than a specified minimum, the simulated sample chain is discarded, and a new simulated sample value is generated. But in a situation like the Little and
Rubin example, virtually all simulated values will be less than 0, preventing the MCMC technique from successfully generating values.
5. EXTENSIONS TO NON-NORMAL DATA
10
Generalized linear modeling techniques allow non-normal data to be analyzed. But multiple imputation has been developed using the normality assumption. Allison (2002, p. 39) reiterates that, fortunately, these imputation methods appear to work well when non-normality prevails.
Not surprisingly, variable transformations to normality should precede imputation calculations
(see Appendix A).
5.1. Poisson random variables
Consider annual rainfall measures (in hundredths of inches) for the set of 88 weather stations covering the islands of Puerto Rico in 1990, of which 58 have complete data and 30 have missing data (see http://www1.ncdc.noaa.gov/pub/data/coop-precip/others.txt). Potential covariates here include latitude, longitude, and elevation, all of which have complete data.
Rainfall has a natural lower bound of 0, tends to be highly skewed, and can be treated as a count
(in hundredths of inches). The Box-Cox-type of variable transformation to approximate normality identified for the 1990 Puerto Rican data analyzed here is as follows:
Y* = LN(Y – 2498) = LN(rainfall – 24.98in) , where LN denotes the natural logarithm. This variable transformation increases the probability of the accompanying Shapiro-Wilk test statistic, under the null hypothesis of a normal distribution, from < 0.0001 for Y to 0.2002 for Y*.
Box-Tidwell linearization transformations result in elevation being re-expressed as
*
X = LN(X
1
1
– 4.99) = LN(elevation – 4.99ft) , and latitude (in decimal degrees) being re-expressed as
*
X = exp(-9.70 X
2
2
) = exp(-9.70 latitude) , where exp denotes the base e (i.e., the anti-logarithm of the natural logarithm). Longitude (X
3
; in decimal degrees) does not require a transformation.
Using redundant information contained in
*
X ,
1
*
X , and X
3
for imputation of Y* values
2 yields the results portrayed in Figures 1a and 1b. A parallel analysis for 1997 data, for which 42 weather stations have missing data, appears in Figures 1c and 1d. These results are consistent with those posited in Theorems 1-3.
Applying a transformation in order to achieve a normal approximation ultimately requires calculating a back-transformation. For this example, the suitable estimates are given by (see
Appendix A)
Y m
exp(Y * m
0.10571
)
2
2498 .
The variance of these computed back-transformation values can be calculated from the MCMCgenerated multiple imputations. Furthermore, direct imputations can be calculated with a Poisson
11
regression generalized linear model. Whereas the linear model involves subtracting each missing value from both sides of a regression equation, now, because the Poisson model is specified in terms of exponentiation, this particular generalized linear model involves dividing both sides of a
Poisson regression equation by each missing value. In other words, the response vector Y has a 1
(rather than a 0) substituted for each missing value, with division yielding a negative sign for each indicator variable moved to the right-hand side of the equation.
Results for 1990 and 1997 Puerto Rico rainfall measurements by weather station are portrayed in Figure 3. These results suggest the following two conjectures:
CONJECTURE 1. When missing values occur only in a Poisson distributed response variable, Y, then by replacing the missing values with ones and introducing a binary 0/-1 indicator variable covariate -I m
for each missing value m, such that I m
is 0 for all but missing value observation m and 1 for missing value observation m, the estimated Poisson regression coefficient b m
furnishes the point estimate exp(b m
), which is proportional to an EM algorithm imputation.
CONJECTURE 2. When computing imputations based upon Conjecture 1, the standard error (se) of an estimated imputation y m
is proportional to y m
/exp( s b m
), where s b m
is the standard error for the negative binomial regression parameter estimate b m
corresponding to its Poisson regression coefficient b m
counterpart.
The proportionality stated in Conjecture 1 appears to be 1-to-1 when the vector of values b m essentially is the same for a Poisson and a negative binomial regression (i.e., relatively little or no overdispersion is present). Conjecture 2 is consistent with the mean-variance relationship of a
Poisson random variable. In either case, the negative binomial regression model can be used to account for overdispersion.
Consider the Baltic Sea peracarids diurnal activity data analyzed by Liu and Dey (2004), which contains 31 observed and 4 missing animal counts. The log-counts are strongly related to a quadratic function of the local time at which measurements were recorded. Employing Y* =
LN(Y + 0.1) improves the Shapiro-Wilk statistic probability from < 0.0001 to 0.0196; in order words, conspicuous non-normality remains, a possible source of corruption in SAS PROC MI results. The four missing count imputations together with their standard errors are as follows: observation negative binomial model log-normal approximation
1
16
24
35
2.3 (se = 1.78)
13.0 (se = 6.12)
126.5 (se = 30.23)
2.3 (se = 1.78)
13.0 (se = 39.78)
48.8 (se = 129.74)
290.1 (se = 439.01)
14.2 (se = 55.92)
These latter results suggest that the relationship between generalized linear model ANCOVA solutions and EM normal approximation imputations require better articulation. Of note is that the negative binomial model results appear more reasonable; impacts of error can explode during the back-transformation process for a normal approximation. In addition, variance computations also are proportional but quite different, in part because of the variance term included in the back-transformation conditional expectation formula for a normal random variable.
12
5.2. Binomial random variables
The variance term for a binomial random variable is even more problematic, given that the variance is a function of the denominator of the fraction yielding a percentage. Here imputations can be computed by setting p = 0.5 for missing values (e.g., initially setting an unknown numerator to 50% of its denominator), and then using right-hand side indicator variables for missing values.
Consider the percentage, p, of farm land harvested in Puerto Rico during 2002 (see http://www.nass.usda.gov/Statistics_by_State/Puerto_Rico/index.asp
). Four municipalities have missing values, whereas the number of farms, F, contained in each municipality is recorded. The following log-odds ratio variable is approximately normally distributed: LN
1 p
p
; for the harvest percentage variable studied here, the Shapiro-Wilk statistic probability under the null hypothesis of normality increases from 0.0129 to 0.0564 with this transformation. This log-odds ratio variate is linearly related to LN(F + 0.001) here. The resulting imputations are as follows: observation binomial model log-normal approximation
29
49
0.23
0.24
0.24 (se = 0.111)
0.24 (se = 0.112)
70
75
0.15
0.15
0.17 (se = 0.093)
0.17 (se = 0.093)
The standard errors (se) reported here were calculated by generating 10,000 multiple imputations with SAS PROC MI. The binomial ANCOVA imputation furnishes nearly identical results to those obtained with a normal approximation based upon the log-odds ratio. Again, though, computation of the standard errors is not straightforward.
This example suggests the following conjecture:
CONJECTURE 3. When missing values occur only in a binomially distributed response variable, Y, then by replacing the missing values with probabilities of
0.5 and introducing a binary 0/-1 indicator variable covariate -I m
for each missing value m, such that I m
is 0 for all but missing value observation m and 1 for missing value observation m, the estimated binomial regression coefficient b m furnishes the point estimate exp(b m
)/[1 + exp(b m
)], which is equivalent to an EM algorithm imputation.
This result supplements findings reported by Ibrahim and Lipsitz (1996).
Consider state-level surveys of safety belt use in 2003 furnished by the National Highway
Traffic Safety Administration
(see Traffic Safety Facts: 2004 data , DOT HS 809 909, from
NHTSA’s NCSA) with usage percentages by state, the District of Columbia, and Puerto Rico;
Maine and Wyoming have no survey results, whereas New Hampshire has no adult seat belt law.
A covariate of interest here is the primary and secondary nature of seat belt laws: primary laws
13
are standard and are directly enforceable; secondary laws are enforceable only when another infraction also has occurred. Imputations based upon the generalized linear model are as follows: observation binomial model log-normal approximation:
Maine
New Hampshire
Wyoming
73.94%
73.94%
73.94% multiple imputations
73.57 (se = 8.756)
73.62 (se = 8.803)
73.58 (se = 8.700)
The standard errors (se) reported here were calculated by generating 10,000 multiple imputations with SAS PROC MI. The reported 2004 estimates for Maine and Wyoming respectively are
72.3% and 70.1%, suggesting that these imputations are reasonable; New Hampshire has no
2004 reported result, and a 2003 reported estimate of 49.6% that was obtained from a different source, which is consistent with it being in a category of its own. All three imputations here are the same because none of these three states has a primary seat belt law, whose presence/absence was used to construct a binary covariate. Again, the log-odds ratio normal approximation yields comparable imputation results.
6. EXTENSIONS TO SPATIAL DATA
Haining, Griffith and Bennett (e.g., 1989) outline a spatial EM algorithm for estimating missing spatial data values. Consider the following n-by-n partitioned spatial covariance matrix, which captures spatial autocorrelation (see, e.g., Cliff and Ord, 1980) effects:
Σ oo
Σ mo
Σ om
Σ mm
V oo
V mo
V om
V mm
1
σ 2
, (9) where, as before, the subscript o denotes observed data, and the subscript m denotes missing data. For a multivariate normal probability model, the MLE for missing data is given by m
X m
β Σ mo
Σ
1 oo
( Y o
X o
β
) , (10) which is the kriging equation of geostatistics (see Christensen, 1991, p. 268; Griffith, 1993).
Using the preceding notation, Haining, Griffith and Bennett show that for an autoregressive model specification, equation (10) becomes m
X m
β
V
1 mm
V mo
( Y o
X o
β
) , which reduces to the following equation for the conditional autoregression (CAR) model
(11) specification based upon a binary 0/1 geographic connectivity matrix, C ─where c ij
= 1 if locational units i and j are neighbors, and c ij
= 0 otherwise─and spatial autocorrelation parameter
ρ : m
X m
β ρ(
I
ρ
C mm
)
1
C mo
( Y o
X o
β
) .
14
In other words, the spatial EM and geostatistical kriging solutions are identical, and are equivalent to predicting new observations, yielding the following theorem:
THEOREM 4. The maximum likelihood estimate for missing georeferenced values described by a spatial autoregressive model specification, given by equation (11), is equivalent to the best linear unbiased predictor kriging equation of geostatistics, given by equation (10).
PF: Substituting the partitioned matrix components of matrix V
-1
in equation (9) into their partitioned matrix Σ counterparts appearing in equation (10) yields equation (11).
This theorem highlights that spatial autoregression specifications deal with an inverse covariance matrix, whereas semivariogram specifications deal directly with the corresponding covariance matrix itself.
6.1. Imputations produced by an autoregressive model specification
Theorems 2, 3 and 4 coupled with the simultaneous autoregressive (SAR) model specification based upon the row-standardized version of matrix C , namely matrix W (i.e., w ij
c ij
), n j
1 c ij yields the following spatial SAR EM algorithm solution:
COROLLARY 1. Employing an autoregressive model specification to account for spatial autocorrelation in georeferenced data renders the imputation equation
Y
0 m o
ρ
W
Y
0 m o
( I
ρ
W ) Xβ
m
M
1 y m
(
I m
ρ
W
* om
)
ε
, (10) where missing values in the vector Y are replaced by 0s, I m
is the indicator variable vector for missing value m that contains n-1 0s and a single 1 in each of its m columns (hence, the vector I m
has 0s and -1s),
*
W om
is the column of the geographic weights matrix W associated with the m th
missing value, and M is the number of missing values.
The particular solution equations are given by m
X m o
1 mm where V
( I ρ W )
T mo
( I
( Y o
X o
ρ W ), and o
) , s
2 y m
σˆ
{
1 mm
where A
-(
1 mm
[ B
A V
1 mm
A ]
1 mo
X o
1 mm
} diag
, mm
X m
) and B
X
T n
X n
.
15
These are the predicted values and the predicted variances for new georeferenced observations.
For the Puerto Rican percentage of harvested agricultural land example presented in §5.2, the log-odds ratio displays weak positive spatial autocorrelation [its Moran Coefficient (MC) is 0.21
(z = 39.6)]. Estimation of the parameters of equation (10) requires a Jacobian modification (see
Martin 1984), which reduces to a denominator of n o
in this particular case (i.e., the missing values are dispersed, resulting in the standard Jacobian term being the sum of the logeigenvalues divided by n o
rather than n). The spatial autoregressive parameter estimate is ρˆ =
0.3426, and the imputations are observation log-normal approximation: SAR
29 0.26
49
70
0.29
0.13
These imputation values compare favorably with those reported in the two subsequent sections, with the additional exploitation of redundant locational information resulting in a sizeable difference only in the estimate for observation 49 (see §5.2).
6.2. Imputations produced by an eigenfunction-based spatial filter model specification
Theorems 2 and 3 coupled with an eigenvector-based spatial filter model specification (see
Griffith, 2000, 2002, 2004) yields the following spatial filter EM algorithm solution:
COROLLARY 2. Employing a spatial filter constructed from eigenvectors selected to account for spatial autocorrelation in georeferenced data renders the imputation equation
Y
0 m o
α
1
Xβ
X
m
M
1 y m
I m
k
K
1
E k
β
E k
ε
, (11) where
β
is the vector of regression coefficients for the set of X attribute variable
X covariates, K eigenvectors, denoted by E k
, are selected from the candidate set extracted from matrix
( I
11
T
/n) C ( I
11
T
/n) , an expression that appears in the numerator of the Moran Coefficient, and
β
E k the regression coefficient for the k th
selected eigenvector.
is
The particular solution equations are given by m
X m b
X
E m, K b
E
K
, and s 2 y m
σˆ { I m
X m
E m, K
X m
E m, K
X m
E m, K
1
X m
E m, K
T
} diag
.
16
These are the predicted values and the predicted variances for new georeferenced observations.
The spatial filter can be constructed using stepwise regression procedures.
The spatial filter constructed for the 2002 Puerto Rican percentage of harvested agricultural land example presented in §5.2 includes the following three eigenvectors for the log-odds ratio normal approximation: E
4
(MC = 0.85), E
8
(MC = 0.71), and E
14
(MC = 0.43). Based upon a stepwise linear regression analysis, these three eigenvectors respectively account for roughly 5%,
4%, and 4% of the residual variance; the number of farms covariate (F) accounts for roughly
14%; the Shapiro-Wilk statistic for the residuals is 0.0185 before imputation, and 0.0303 after imputation. Because Vieques (municipio # 75) is an island located southeast of the main island, it was not included in this particular spatial analysis (it has no physical geographic neighbors).
The calculated spatial filter imputations are observation Log-normal approximation: equation (1)
29
49
0.23
0.31 log-normal approximation: multiple imputations
0.24 (se = 0.099)
0.32 (se = 0.116)
70 0.14 0.15 (se = 0.076)
Again, equation (11) furnishes imputations that are equivalent to those obtained with SAS PROC
MI. Although equation (10) may be labeled a spatial EM algorithm solution, being the direct autoregressive counterpart to equation (1), equation (11) also may be labeled a spatial EM algorithm solution because it employs the set of complete data eigenvectors associated with a geographic surface partitioning.
6.3. Extensions to non-normal data
In keeping with Conjectures 1 and 3, the spatial filter specification outlined in the preceding section supports extensions of the linear model to the generalized linear model for spatial data
(see Griffith 2002, 2004). A stepwise binomial regression selects only eigenvector E
8
to construct a spatial filter here. Now the calculated spatial filter imputations are observation
29
49
70 binomial
0.23
0.27
0.15
As before, the generalized linear model imputations closely agree with their log-normal approximation counterparts.
7. DISCUSSION, IMPLICATIONS, AND CONCLUSIONS
Three theorems are presented that help to simplify the EM algorithm solution, at least for selected but common data cases, relating this solution based upon the linear model with a normal probability distribution for its error term to standard regression theory for predicting values of new observations. In addition, three conjectures are presented that speculatively extend these
17
results to the generalized linear model, especially in terms of Poisson and binomial regression.
Conversion of these conjectures to theorems, as well as precise determination of imputation variances for these latter cases, remains to be done.
A fourth theorem and two corollaries are presented that extend the more classical results to the situation of spatially autocorrelated, georeferenced data. In parallel with the nonspatial solution, the spatial autoregressive EM algorithm is found to be equivalent to kriging in geostatistics. Although autoregressive modeling furnishes a spatial EM algorithm solution for linear modeling, spatial filtering provides the necessary tools to extend this analysis to generalized linear modeling. For the 2002 Puerto Rico agricultural data example presented here, log-odds normal approximation imputations agree with binomial regression results, with weak positive spatial autocorrelation introducing slight to sizeable modifications to these imputations.
Similar comparable results have been obtained for imputations of the single missing coal ash data value reported by Cressie (1991, p. 160).
Multiple imputation and the role of the linear model coupled with a normal probability model emphasize, once more, that the Box-Cox type of variable transformations are not yet obsolete. A simplified version of power transformations that transcends the conventional ladder perspective is outlined here in Appendix A. This transformation approach, coupled with a more sophisticated version of the spatial autoregressive EM algorithm solution with constraints, is presented in
Griffith (1999).
Besides determining analytical variance estimates for imputations calculated with a generalized linear model, one additional result that merits closer scrutiny is the failure to obtain maxima estimates that agree with those reported by McLachlan and Krishnan (1997, p. 91; see
Table 2). Another is the poor performance of findings reported here for the ill-behaved, overdispersed Baltic Sea example data.
REFERENCES
Allison, P. 2002. Missing Data . Thousand Oaks, CA: SAGE.
Bartlett, M. 1937. Some examples of statistical methods of research in agriculture and applied biology, Journal of the Royal Statistical Society , Supplement 4: 137-183.
Blom, G. 1958. Statistical Estimates and Transformed Beta Variables.
New York: Wiley.
Christensen, R. 1991. Linear Models for Multivariate, Time Series, and Spatial Data . Berlin:
Springer-Verlag.
Cliff, A., and J. Ord. 1980. Spatial Processes . London: Pion.
Cressie, N. 1991. Statistics for Spatial Data . New York: Wiley.
Dempster, A., Laird, N., and Rubin, D. 1977. Maximum likelihood from incomplete data via the
EM algorithm, Journal of the Royal Statistical Society . Series B, 39: 1-38.
18
Dodge, Y. 1985. Analysis of Experiments with Missing Data . New York: Wiley.
Draper, N., and H. Smith. 1966. Applied Regression Analysis . New York: Wiley.
Flury, B., and A. Zoppè. 2000. Exercises in EM, The American Statistician . 54: 207-209.
Griffith, D. 1993. Advanced spatial statistics for analyzing and visualizing geo-referenced data ,
International Journal of Geographical Information Systems . 7: 107-123.
Griffith, D. 1999. “A methodology for small area estimation, with special reference to a onenumber agricultural census and confidentiality: results for selected major crops and states,”
NASS Research Report RD-99-04. Washington, DC: Research Division, National Agricultural
Statistics Service, U.S. Department of Agriculture.
Griffith, D. 2000. A linear regression solution to the spatial autocorrelation problem, J. of
Geographical Systems , 2: 141-156.
Griffith, D. 2002. A spatial filtering specification for the auto-Poisson model, Statistics &
Probability Letters . 58: 245-251.
Griffith, D. 2004. A spatial filtering specification for the auto-logistic model, Environment &
Planning A . 36: 1791-1811.
Haining, R., D. Griffith, and R. Bennett. 1989. Maximum likelihood estimation with missing spatial data and with an application to remotely sensed data, Communications in Statistics . 18:
1875-1894.
Horton, N., and S. Lipsitz. 2001. Multiple imputation I practice: comparison of software packages for regression models with missing variables, The American Statistician . 55: 244-254.
Ibrahim, J., and S. Lipsitz. 1996. Parameter estimation from incomplete data in binomial regression when the missing data mechanism is nonignorable, Biometrics . 52: 1071-1078.
Little, R., and D. Rubin. 1987. Statistical Analysis with Missing Data . New York: Wiley.
Liu, J., and D. Dey. 2004. Multilevel overdispersed Poisson regression models with missing data, unpublished manuscript, Department of Statistics, University of Connecticut.
Manly, B. 1976. Exponential data transformations, Statistician . 25: 37-42.
Martin, R. 1984. Exact maximum likelihood for incomplete data from a correlated Gaussian process, Communications in Statistics . 13: 1275-1288.
McLachlan, G., and T. Krishnan. 1997. The EM-algorithm and Extensions . New York: Wiley.
Meng, X. 1997. The EM algorithm, in Encyclopedia of Statistical Sciences , Update Vol. 1, edited by S. Kotz, C. Read and D. Banks, pp. 218-227, New York: Wiley.
19
Montgomery, D., and E. Peck. 1982. Introduction to Linear Regression Analysis . New York:
Wiley.
Rubin, D. 1972. A non-iterative algorithm for least squares estimation of missing values in any analysis of variance with missing data, Applied Statistics . 21: 136-141.
SAS. 1999. “Chapter 9. The MI Procedure,”
OnlineDoc
TM
, Version 8. support.sas.com/rnd/app/papers/miv802.pdf (accessed 4/14/2006).
Schafer, J. 1997. Analysis of Incomplete Multivariate Data . New York: Chapman & Hall.
Sen, A. and M. Srivastava. 1990. Regression Analysis, Theory Methods and Applications.
New
York: Springer-Verlag.
Yates, F. 1933. The analysis of replicated experiments when the field results are incomplete,
Empirical Journal of Experimental Agriculture . 1: 129-142.
Table 1 . Iterative results for a bivariate data example presented in McLachlan and Krishnan
(1997, pp. 49-51).
McLachlan and Krishnan Nonlinear regression results iteration
0 b
0.80535 a -2log(L) a
4.40547 1019.64200 4.4055 b
0.8054
ESS
159.9
1
2
3
0.52832
0.52050
0.51973
7.68467 210.93090 7.6847
7.83272 193.33120 7.8327
7.85517 190.55040 7.8552
0.5283
0.5205
0.5197
127.3
127.2
127.2
8
9
10
11
4
5
6
7
0.51957
0.51954
0.51953
0.51953
0.51953
0.51953
0.51953
0.51953
7.85997
7.86104
7.86125
7.86133
7.86133
7.86132
7.86132
7.86132
190.01470
189.90800
189.88660
189.88230
189.88160
189.88140
189.88140
189.88140
7.8600
7.8610
7.8613
0.5196
0.5195
0.5195
127.2
127.2
127.2
20
Table 2 . Calculations from ANCOVA regression and the EM algorithm
Data Source quantity Reported value OLS/NLS estimate
McLachlan & Krishnan (1997)
p. 49 ˆ
2
ˆ
12
ˆ
22
p. 53 yˆ (
1 ,
1 ) yˆ ( 0 ,
1 )
14.61523
20.88516
26.75405
429.6978
324.0233
14.61523
208.85156/10
230.875/8 + 0.51953
2 (402/10-
384/8) = 26.75407
429.69767
324.02326
4.73 4.73030 p. 54 yˆ
23
p. 91 yˆ
51 saddle point:
,
11
, ρ
22 maxima
Little & Rubin (1987)
p. 31
p. 101 u
1 estimate u
2 estimate
ˆ
2
p. 118 ( x
1
ˆ
( x
2
)
)
ˆ
( x
4
)
Schafer (1997)
p. 43
(
)
p. 54
(
)
ˆ
11
=
ˆ
22
ˆ
1
=
ˆ
2
p. 195 yˆ
3 , 2
average (n=5)
3.598 3.59697
5/2, 5/2, 0 5/2, 5/2, 0
8/3, 8/3,
0.5 2.87977, 2.87977,
0.88817
7.8549
7.9206
49.3333
6.655
49.965
27.047
48.1000
59.4260
1.80
-1
0
226.2
7.85492
7.92063
49.33333
6.65518
49.96524
27.03739
48.10000
59.42600
18/10
-1
0
228.0 (se = 32.86)
146.8 146.2 (se = 38.37) yˆ
3 , 4
average (n=5) yˆ average (n=5)
3 , 5 yˆ average (n=5)
3 , 10 yˆ
3 , 13
average (n=5) yˆ
3 , 16
average (n=5) yˆ average (n=5)
3 , 18 yˆ average (n=5)
3 , 23 yˆ
3 , 25
average (n=5)
Montgomery & Peck (1982) pp. 127 &
142 s yˆ
26 yˆ
2 6
190.8
250.2
234.2
269.2
192.4
215.6
250.0
19.22
3.34628
192.5 (se = 34.11)
271.7 (se = 36.20)
241.3 (se = 35.18)
269.9 (se = 34.53)
201.9 (se = 32.91)
207.4 (se = 33.09)
255.7 (se = 33.39)
19.22432
10.6239(1
0.05346) = 3.34542
21
Table 3 . Multivariate normal imputations when more than one variable contains missing values: the Little and Rubin (1987, p. 118) example data set. unconstrained constrained variable observation
X
1
Equation (8) results PROC MI results Equation (8) results y
ˆ s
ˆ m y
ˆ s m y
ˆ s m
12.8907 196.1 12.8809 3.67861 12.8351 185.6
X
X
2
4
10
11
12
13
10
11
12
-0.4686
9.9899
10.1054 179.3 10.0911 3.42244 10.0957
65.8389 538.2 65.9448 9.41728 65.9877
48.1963
68.0844
190.8
185.9
523.5
510.2
-0.4561
9.9999
48.1680
68.0420
3.60878
3.52741
9.18909
8.99468
0
9.9935
46.9425
68.0745
0
177.2
170.9
509.5
111.8
486.2
62.4285 492.2 62.4336 8.66538 62.4541 468.9
0.8449 0.4569 0.8725 0.79822 0.8449 0.4335
13
7
8
9
10
11
12
37.8695
19.8902
14.4591
20.7966
8.1828
0.4861
0.3400
366.3
356.3
347.2
37.9041
19.8872
14.3398
20.8345
8.2192
0.83148
0.58860
7.80931
7.63828
7.37670
37.8695
19.8902
14.3657
21.5841
8.1891
0.4611
0.3225
346.9
148.4
330.7
13 15.4429 334.9 15.4612 7.15627 15.4269 318.9
Table 4 . Multivariate normal imputations when more than one variable contains missing values: the SAS (1999, p. 133) example data set. variable observation Equation (8) results s m
PROC MI results y
ˆ s
ˆ m
-
X
1
X
X
2
3
20
23
25
28
6
10
13
18
4
20
28
6
18
23
3
41.3616
45.9610
52.1291
9.4214
6.6074
10.4398
170.6799
169.6093
172.2305
172.5714
164.7680
172.0624
171.3614
172.7761
168.7691
3.9332
5.8598
6.0461
1.8602
2.1145
1.8310
5.7283
6.5199
5.5869
5.5439
7.4900
8.4021
6.3984
5.6584
8.6998
42.4394
46.2275
52.2993
9.7688
7.6800
10.5164
170.357
169.194
171.993
172.661
163.350
171.934
171.347
172.385
168.562
2.84722
3.00148
3.06625
0.79959
0.88971
0.80302
10.5193
10.6017
10.2231
10.0909
11.7891
10.7925
10.4574
10.3858
11.2857
22
Figure 1 . Scatterplots of summary multiple imputation output versus corresponding equation (1) output. (a) top left: regression parameter estimates. (b) top right: regression parameter estimate standard errors. (c) bottom left: imputations. (d) bottom right: imputation standard errors.
Figure 2 . Scatterplots of summary multiple imputation output versus corresponding equation (1) output for rainfall in Puerto Rico. (a) top left: imputations for 1990. (b) top right: imputation standard errors for 1990. (c) bottom left: imputations for 1997. (d) bottom right: imputation standard errors for 1997.
23
Figure 3 . Scatterplots of summary multiple imputation output versus corresponding generalized linear model output for rainfall in Puerto Rico. (a) top left: imputations for 1990. (b) top right: imputation standard errors for 1990. (c) bottom left: imputations for 1997. (d) bottom right: imputation standard errors for 1997.
24
APPENDIX A
NORMALITY TRANSFORMATIONS
Sometimes data are forced to mimic the normal probability distribution by application of a Box-
Cox-type of power transformation. Normal scores can be computed for the observed data using the following equation given by Blom (1958): n y i
= Φ -1 [(r i
- 3/8)/(n o
+ 1/4)] , (A1) where n y i
denotes the normal quantile score for observation value y i
, r i
is the rank of the i th value of variate Y, and Φ denotes the cumulative normal distribution function. The Box-Cox class of variance stabilizing transformations may be written as follows:
Y*
(Y
δ) γ
, γ
Y*
LN(Y
δ),
γ
0 and (Y
δ)
0, and
0 and (Y
δ)
0, where
δ is a translation parameter (see Sen and Srivastava, 1990). Manly (1976) supplements this pair of equations with the following third option:
Y*
exp ( γ Y) .
Consequently, the transformation that maximizes conformity with a normal probability distribution may be selected from the following possibilities: n y i
α β
(Y i
δ) γ
, or n y i n y i
α β
LN(Y i
δ)
, or
α β exp(γ Y i
) , depending upon which transformation achieves the smallest mean squared error. These alternatives need to be estimated with a NLS routine (e.g., SAS PROC NLIN), with suitable constraints attached, and appear to not need to include the transformation Jacobian for modest to large values of n. Because the response variable is based upon equation (A1), the selected parameter estimates for
δ and γ
tend to maximize the Shapiro-Wilk normality test statistic.
When
γ
> 1, a power transformation stretches out the upper tail and shrinks the lower tail of the frequency distribution of Y. When 0
γ
< 1, a power transformation stretches out the lower tail and shrinks the upper tail of the frequency distribution of Y. In either case, the frequency distribution at least becomes more symmetric, with the mean and the median becoming more similar. Meanwhile, inclusion of the translation parameter
δ tends to better align one or both tails with the end points of the theoretical normal quantile plot straight line.
Using E(Y) to denote the expected value of Y, employing the calculus of expectation notation, a back-transformation is needed to calculate E(Y) from results obtained for Y*. In imputation work the expected value of the original variable Y is the quantity of interest, rather
25
than its power-transformed counterpart used to more successfully execute a multiple imputation routine. Given
γ
, the back-transformation involves
1
γ
. Accordingly, for 0 <
γ
< 1, 1 <
1
γ
<
, and for 1 < γ <
, 0 <
1
γ
< 1. Of note is that parsimony considerations suggest that when γ is very close to 0 (e.g., < 1/15), the logarithmic transformation should be used. Once a backtransformation has been calculated, then the set of back-transformed multiple imputations can be used to estimate the variance of an imputation.
Integer moments about the origin for Y are given by evaluation at t = 0 of successive derivatives with respect to t of the moment generating function for a normal distribution, namely
M y*
(t) = exp[
μ y *
(t) +
σ 2 y *
2
(t
2
)] ,
Rendering expressions in terms of the mean,
μ y *
, and variance, y *
Y*. Because it is restricted to integer derivatives (e.g., 1 st
, 2 nd
), this is the basis of the standard transformation ladder. Both integer and noninteger moments about the origin for Y are given by
E[(Y*)
1/γ
]
σ y *
1
2π
(Y*)
1/γ exp[
(Y *
μ y *
)
2
] dY* .
2σ
2 y *
Interestingly, for the common case interval of 0 <
γ
< 1,
E(Y) = -
δ + E[ 1/γ
(Y*) ] =
δ μ 1/γ y *
k j
1 v j
μ 1/γ
2j σ y *
2j
, 2j
1
γ
, where v j
[ 1/γ h
1
]
(
1
2h
)[
1
4
(r
2h
3
2
)
2
] ,
1
γ
1 , with [ 1/γ ] denoting the integer value of 1/γ .
The case of
γ
= 0 has the well-known back-transformation appearing in §5.1, namely
E(Y) = -
δ + exp[ μ y *
+
σ 2 y *
2
] .
This particular, standard result is the most useful one for generalized linear modeling cases, because both the log-mean and log-odds ratio involve a logarithmic transformation to achieve normality.
These back-transformation findings have been corroborated with a set of simulation experiments. Unfortunately, to date only rather speculative results have been obtained for the cases of 1/γ > 1 and 1/γ < 0.
26