Using incomplete multivariate data to simultaneously estimate the means by Susan Marie Hinkins A thesis submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY in Mathematics Montana State University © Copyright by Susan Marie Hinkins (1979) Abstract: This paper presents two estimators, a Bayes and an empirical Bayes estimator, of the mean vector from a multivariate normal data set with missing observations. The variance-covariance matrix is assumed known; the missing observations are assumed to be missing at random; and the loss function used is squared error loss. These two estimators are compared to the maximum likelihood estimate Both estimators resemble the ridge-regression estimate, which shrinks the maximum likelihood estimate towards an a priori mean vector. Both estimators are consistent for 0 and asymptotically equivalent to the maximum likelihood estimate. Small sample properties of the Bayes estimator are found, including conditions under which it has smaller risk than the maximum likelihood estimate. While small sample properties of the empirical Bayes estimator are more difficult to find, numerical examples indicate that under some conditions the empirical Bayes estimator may also improve on the maximum likelihood estimator. This paper also presents two computer programs which provide other useful methods for estimation when there are data missing. The program RFACTOR creates Rubin's factorization table. The program MISSMLE calculates the maximum likelihood estimates of the mean vector and variance-covariance matrix from a multivariate normal data set with missing data; it uses Orchard and Woodbury's iterative procedure. USING INCOMPLETE MULTIVARIATE DATA TO SIMULTANEOUSLY ESTIMATE THE MEANS by SUSAN MARIE HINKINS A thesis submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY in Mathematics Approved: Chairman, Examining Coimnittee MONTANA. STATE UNIVERSITY Bozeman, Montana August, 1979 ill ACKNOWLEDGMENT . . . I wish to thank my thesis adviser Dr. Martin A. Hamilton for his advice and assistance throughout my graduate work, and for suggesting the topic of this thesis. Thanks are also due to Ms. Debra Enevoldsen for taking so much care in typing this manuscript, and to R, M. Gillette for.drawing the flow charts. I also wish to acknowledge my parents for their encouragement and support throughout my education. iv TABLE OF CONTENTS CHAPTER PAGE 1. INTRODUCTION ............... . 2. BACKGROUND MATERIAL 2.1 2.2 2.3 2.4 2.5 3. ................................ ......................... . 5. 5 Maximum Likelihood Estimation . . . . . . . . . . . . . . . Bayes and Empirical Bayes Estimation . . . . . . . . . . . Estimation with Incomplete Data .......................... Maximum Likelihood Estimation with Incomplete Data . . . . Estimating Linear Combinations of the Means When There are Missing Data in a Bivariate Normal Sample . . . . . 5 5 11 16 29 RUBIN'S FACTORIZATION T A B L E ................................ 3.1 Creating Rubin's Factorization Table ........ . . . 3.2 The Computer P r o g r a m ........................... 3.3 Sample P r o b l e m ............... 4. I 40 * . . .41 45 52 ORCHARD AND WOODBURY'S ALGORITHM FOR MULTIVARIATE NORMAL DATA ............. .......... '........................ .. . . . 54 4.1 The Iterative Algorithm 4.2 The Computer.Program 4.3 Sample Problem . 54 57 60 ........ . . . . ............ ............ .. . MAXIMUM LIKELIHOOD, BAYES, AND EMPIRICAL BAYES ESTIMATION OF 6 WHEN THERE ARE MISSING DATA . . . . . . . . .. . . . . . . 65 5.1 5.2 5.3 5.4 5.5 65 66 69 69 71 Assumptions ............. More Notation . . . . . . . . . . . . . . . . .......... Sufficient Statistics . .. . , . . . . . . . . . The M.L.E. of 6 When $ is Known .. . . V . . . . . . . . . Bayes Estimation of 0. When | is Known . . . . . . . . . . 5.6 The Risk of the Bayes Estimate 6 ^ Compared to the M.L.E. 0 . . . . . ... .'. .' . ... .. . . . . . . . -. 74 5.7 Empirical Bayes. Estimation of £ When | is Known . . .. . .80 5.8 Examples . . . . ■■ . . . .■. . . . . .■ . . .-. . ^ . . 86 6. SUMMARY . ... . .................. . ... . . . ... .1.. . . 94 BIBLIOGRAPHY ..........................................................99 V ABSTRACT This paper presents two estimators, a Bayes and an empirical Bayes estimator, of the mean vector from a multivariate normal data set with missing observations. The variance-covariance matrix is assumed known; the missing observations are assumed to be missing at fandom; and the loss function used is squared error loss. These two estimators are compared to. the maximum likelihood estir mate. Both estimators resemble the ridge-regression estimate, which shrinks the maximum likelihood estimate towards an a priori mean, vector Both estimators are consistent for,8 and asymptotically equivalent to the maximum likelihood estimate. Small sample properties of the Bayes estimator are found, including conditions under which it has smaller risk than the maximum likelihood estimate. While small sample prop­ erties of the empirical Bayes estimator are more difficult to find, numerical examples indicate that under some conditions the empirical Bayes estimator may also improve on the maximum likelihood estimator, This paper also presents two computer programs which provide other useful methods for estimation when there are data missing. The program REACTOR creates Rubin's factorization table. The program MISSMLE cal­ culates the maximum likelihood estimates of the mean vector and variance-covariance matrix from a multivariate normal data set with missing data; it uses Orchard and Woodbury's iterative procedure. I. INTRODUCTION The statistical analysis of partially missing or incomplete data sets is an important problem which occurs in many subject areas. Responses to questionnaires often include unanswered questions or undecipherable responses. data. Mechanical failures can cause incomplete For example, several weather measurements (temperature, wind speed, wind direction, etc.) may be recorded automatically on an hourly basis, but a recording device may malfunction occasionally or run out of ink, causing missing data. Human error, including computer error, may cause data to be unobserved or lost. A seemingly simple solution to analyzing incomplete data is to discard any observation which is not complete. It may be impractical, however, to discard information which may have been expensive or diffi­ cult to collect. In some situations, discarding incomplete observations may amount to discarding the entire.experiment. Since incomplete data may occur for any type of analysis— maximum likelihood estimation, analysis of variance, factor analysis, linear models analysis, contin­ gency table analysis, etc., the problem of statistical analysis when the data are not complete touches every branch of statistical methodology. One of the most basic statistical analyses is simultaneous esti-r mation of the variate means from a multivariate data set. primary concern of this paper. This is the Chapter 2 gives a brief review of current literature dealing with estimation and hypothesis testing 2 when there are missing data in a multivariate data set. Much of the work described is concerned with multivariate normal data. Rubin's method for factoring the likelihood of the observed data, when there are data missing, is described in chapter 3. is a useful tool, especially in large data sets. This technique When possible, it factors the big problem into a few simpler estimation problems. The algorithm for creating the factorization is hard to use unless the data Set is small and has a simple pattern of missingness. As part of this research project, I have written a FORTRAN computer program which creates Rubin's factorization table for any pattern of missingness. It displays the pattern of missingness, the number of observations with each pattern, and the factorization, if any, of the problem. Maximum likelihood estimation is an important, often-r-used statis­ tical technique. When there are data missing, calculation of the maximum likelihood estimate of the mean and variance-covariance matrix of a multivariate normal distribution is a nontrivial problem. Orchard and Woodbury (.1970) developed an iterative procedure for finding these estimates. I have written a computer program which calculates the maximum likelihood estimates using Orchard and Woodbury's procedure. It is discussed in chapter 4. The approach taken in this paper (chapter 5) is to consider Bayes and empirical Bayes estimates of the mean, 0, of a multivariate normal distribution from a random sample where there are data missing. These 3 estimates are compared to the maximum likelihood estimate. Some assumptions are made about the process that causes missingness, but the only assumption made about the pattern of missingness is that every variable is observed at least once. The variance-covariance matrix is assumed to be known. A multivariate normal prior distribution, with mean vector Q and diagonal variance-covariance matrix, is assumed on 0, and the Bayes estimate under squared-error loss is found. The Bayes estimate is similar to a ridge-regression estimate, which is a shrinkage estimator, shrinking the maximum likelihood estimate towards 0. The Bayes esti­ mate has smaller Bayes risk than the maximum likelihood estimate, under squared error loss, and for values of 0 within a specified ellipsoid it has smaller risk than the maximum likelihood estimate. The Bayes estimate is biased, but under mild conditions it is consistent and asymptotically equivalent to the maximum likelihood estimate. However the Bayes estimate is not always practical since it depends on knowledge of the variances in the prior distribution. Using the unconditional distriuution of the data, estimates of the variances of the prior distribution can be found. These estimates are substituted into the formula for the Bayes estimate of 0 and the resulting estimate is an empirical Bayes estimate of 0. The idea here is that, this esti­ mate, although based entirely on the observed data, will retain some of the nice properties of the Bayes estimate. The empirical Bayes estimate 4 is in the form of a ridge-regression estimate. It is biased, but under the same mild conditions, it is consistent and asymptotically equiva­ lent to the maximum likelihood estimate. A few numerical examples were done and the empirical. Bayes estimate compared well to the maximum likelihood estimate. errors. The basis of comparison was the sum of squared 2. BACKGROUND MATERIAL 2.1 Maximum Likelihood Estimation Let 0 = (0^...0p)’ be a p x I vector of parameters with values in a specified parameter space ft, where ft is a subset of Euclidean p-space. The vector X = (x^.-.x^)' denotes the vector of observations and f(x|0) denotes the density of X at a value of 0. The likelihood function of 0 given the observations is defined to be a function of 0 »l(x|0), such a that l(x|0) is proportional to f (X|0). If the estimator 0 = 0 (X) maximizes £(x|0), then 0 is a maximum likelihood estimate (M.L.E.) of 0. It is often convenient to deal with the log likelihood, A L(X;0) = ln(l(x|0)), in which case 0 maximizes L(X;0). When L(X;0) is a differentiable function of 0 and sup L(X;0) is attained at ah interior ■■ ■ 0eft point of ft, then the M.L.E. is a solution of the maximum likelihood equations 3L(X;0)/30 = Q. There are densities where a M.L.E. does not exist or is not unique. Under regularity conditions, the M.L.E. exists, is consistent for 0, and is asymptotically efficient (Rao 1973). a . If 0 is the M.L.E. of 0 and h(0) is a function of 0,. then the. M.L.E. of h(0) is h(0) (Graybill 1976). , 2.2 Bayes and Empirical Bayes Estimation An alternative method of estimation is Bayes estimation, following from Bayes' theorem which was published in 1763. The essential compo­ nents are the parameter space ft, a prior distribution on the parameter, 6 observations X in Euclidean n-space, and a loss function L S (0,0) which measures the loss occurring when the parameter 0 is estimated by 0 = 0(X). It is assumed that there is a family of density functions {f (x|0),0 E fi} for an observation X. For any 0 = 0(X) in Euclidean p-space, the risk function is defined as R(0,0) = Ex |0 (LS(0,0)) =/LS(0,0(X))f(x|0)dX (Ferguson 1967). It will be assumed that the only estimates 0 being considered are those where R(0,0) is finite.. For the given probability distribution function P on the Bayes risk is defined as r(0) = E 0 (R(0,0)) =/R(0,0)dP(0) . . " (B) A Bayes estimate, if one exists, is a value 0 such that r(0(B)) = inf r(0). It can be shown that the Bayes estimate also minimizes a . ■ E 0 |x (LS(0,0(X))) (DeGroot 1970). . ■ ■ . ' If X has probability density function (p.d.f.) f (x|0) and 0 has p.d.f. f (0), then Bayes' theorem states.that. •f (0 |x) = cf (x|0)f(0), where c is not a function of 0. The p.d.f. f (0) is called the prior p.d.f. and f (0|X) is called the posterior p.d.f. Bayes estimation is particularly suited to iterative estimation; from previous experience, one increases and refines one's knowledge about 0. The prior distribution represents what is known about the parameter prior to sampling; it can indicate quite specific knowledge or 7 relative ignorance. The posterior distribution can also be written f(6 |X) = k £(x|6)f(0) where £(X 10) is the likelihood function and k is not a function of 0. The likelihood function can be thought to represent the information about 0 coming from the data. It is the function through which the data, X, modifies the prior knowledge about 0. If the data could not provide additional information about 0, i.e, if the prior totally dominated the likelihood, one would usually not bother to gather data and calculate an estimate of the parameter. Therefore one usually wants the prior to be dominated by the likelihood; that is, the prior distribution does not change very much over the region in which the likelihood is appreciable and the prior does not assume large values outside this region. For example, if x^.Xg,...,^^ is a random sample, i x ^ I0 normally distributed with mean 0 and variance o 2 2 (Mt(0 ,a )), and . 9 2 (a,v ), then the prior is dominated by the likelihood if v is large . 2 compared to o /N. We will be concerned specifically with Bayes estimation under squared error loss, L S (0,0) = (0 - 0 ) ’V(0 - 0) where V is a symmetric, positive definite matrix of constants. estimate is 0 ^ In this case the unique Bayes = E (6 |x).and the Bayes risk is r ( 0 ^ ) = tr(V Var(0 |x)) (DeGroot 1970), where tr(M) indicates the trace of the matrix M. 8 2.2.1 James-Steln and Ridge-Regression Estimators . . Consider the simple situation where X |-6 has a multivariate Ijibrmal . distribution with mean vector 9 and variance-covariance matrix I (X|6 ~ MVN(9,Ip)), where I is the p x p identity matrix, n = p, and the parameter, space U is Euclidean p-space. The M.L.E. of 0 is 0 = X; this is also the least, squares estimate, and for each i, i = 1,2,...,p, 9^ = x_^ is the best linear unbiased estimate (b.l.u.e.) of 0^, Suppose the loss function of interest is simple squared error loss, i.e. V = I^, then 0 is also the minimax.estimate; that is, 0 minimizes max(R(0,0)). 0 James and Stein.(1960) showed that the estimator 0 ^ = (I - (p - 2)/X'X)X, which is neither unbiased nor a linear A estimate of 0, has smaller mean squared error than 0 when p > 3; that is, ex |q (0 ^ - 6) ' (6 ^ - 0) < Ex |g(0 - 0) '(0 - 0). Therefore, when p > 3, the M.L.E. 0 is inadmissible under squared error loss, being dominated by the James-Stein (J-S) estimate; The J-S estimate can be constructed as an empirical.Bayes (E.B .) 1 estimate. Let the prior distribution on 0 be MVN((), (A)I^) where A is a positive constant* Under squared error loss, with V = I , the Bayes p estimate is (I - 1 / ( 1 + A))X. If A is not known, 1/(1 + A) must be ,estimated. When 1/(1 + A) is replaced by its estimate, one has an E.B. estimate. The M.L.E. based on f (X) of 1/(1 + A) is p/X’X; however if one uses the unbiased estimate (p - 2)/X,X for 1/(1 + A), the .resulting. E.B. estimate is identical to thie J-S estimate. This E.B. , 9 derivation clearly shows that the J-S estimate amounts to shrinking t h e . least squares estimate towards JJ). James and Stein also offered a simple extension of this estimate, namely the estimate 8* = (I - (p - 2)/X'X)+X where (u)+ = u if u > 6 = 0 if u < 0 . The idea is that (p - 2)/X'X is estimating 1/(1 + A) which is between 0 and I. The modification 0* fixes the estimate of (1/(1 + A ) ) at I if (p - 2)/X'X I.. Efron and Morris (1972) look at J-S type estimators when X^ ~ M V N I ^ ) . Hudson (1974) gives a good general discussion of E.B. estimators and extends the theory of J-S estimation. The point 0 towards which the J-S estimator shrinks the least squares estimator is not special. The J-S estimator can shrink towards any given 6^; the appropriate estimate is then 8 ^ ) = 0Q + (I - (p - 2)/(X - 8g)' (X - 80)) (X - 0q ) . Or suppose one: suspects that 0 has a prior distribution 0 ^ MVN(y^,(A)I^), where, both y and A are unknown scalars, and ^ is the p x I vector of I's, the appropriate J-S estimate is 0 (S) = + (I - (p - 3) / (X x p ' (X - x p M X - x p where x = Then p Z x^ , i=l So, without loss of generality, we will continue to. consider estimators which shrink the least squares estimate towards the zero vector. 10 Hoerl and Kennard (1970) proposed the ridge-regression estimator, (R-R), originally for application to the general linear model and least squares estimates. towards It is a shrinkage estimator which shrinks the M.L.E as does the J-S estimator. The R-R estimator can also be derived as a Bayes or E.B. estimator. Let X = (x- ...x ) be the M.L.E. of 6' = (6 ...0 ). An R-R estiI p l p . * (R) mator is one which can be expressed as = (I - h^(X).)x^ , i = 1,2,...,p. It shrinks the M.L.E. towards Q, but each component is not shrunk by the same proportion. A J-S type estimator is of the form " Zg) = (I - h (X))X^ ; each component is shrunk by the same proportion. 2.2.2 An Example Suppose. X^ = ( x ^ . ..x^p)., i = I .... N, is a random sample; X ' I6 ~ MVN(0,|) where | is the p x p diagonal matrix (cr..) , ff... > 0, 1 _ N . _ _ 13 JJ j = 1,2,...,p. Let X = 2 X '/N A (x ...x )'. Then. j . . i — -I *p' i=l x|G MVN(G,(1/N)|) and X is the M.L.E; of 0, For the loss function L S (0,0) = N(0 - 0)'I ^(0 - 0), the risk is R(0,X) = p. Suppose one assumes a prior 8 ~ M V N (0,(A)|) where A is a positive scalar constant. Then the Bayes estimate of 6 is 6 ^ where 0^B ^ = (I - 1/(1 + NA))X v^. (I - 1/(1 + NA))p < p. = ( 8 ^ \ . . 8 ^ )' -L p The Bayes risk of 0^B ^ is Replacing 1/(1 + NA) by an estimate one gets an E.B. estimate which is a J-S estimate. If the a are known, • JJ (p - 2)/(N)X'|~^X is an unbiased estimate, with respect to the 11 unconditional density f (X), and the J-S estimate is (I - (p - 2)/(N)X'| ^X)X. The risk of the J-S estimate is p - (p - 2 + (p - 2) (I - 1/(1 + NA)) - 2 0 fEx j0((p - 2)/(XtIf1X) T 1X)). If in fact 0 = 0 (and A = 0 ) , the risk (and Bayes risk) of the J-S estimate is 2 compared to the risk p of the M.L.E. Suppose, instead, that one assumes a prior 0 A is still a positive scalar constant. MVN(0,(A)I^) where For example (Hudson 1974), one - might be measuring the incidence of food poisoning in several towns. Data from a large town would give an estimate with low variability compared to data from a small town. However it could be that the variability in true incidence rates is the same for all size towns. In this case the Bayes estimate of 0^ is 0 ^ ^ = (I - which is an R-R estimate. yVn) The. Bayes risk is r(0v ') = + NA))x ^ P I. NA/(a., + NA) 1=1 11 A /T)\ which is strictly less than p, and r(0' ') decreases as A decreases. If the a., are known, then an E.B. estimate, which is. also an R-R esti- 11 / mate, is found by replacing I/ (cr^ + NA) by an estimate, which might be obtained using the unconditional distribution X ~ MVN({|),(1/N)if + (A)I^) 2.3 Estimation with Incomplete Data Missing or partially specified observations in multivariate data may occur in almost any type of experiment. The primary interest in this discussion is the problem of estimating a parameter 0 from a 12 multivariate distribution. Let the p x I random vectors X ^ be independently, identically distributed with p.d.f. f^(x|6). the element of X^ by .,X^ Denote Instead of observing X^,Xg,...,X^, however, only Y . = Y .(X ) can be observed. are incomplete. , In this sense, the data This,general problem encompasses missing data, par­ tially specified data, and other general situations. When the data are incomplete, inference must be based on the joint p.d.f. of the Y^'s, fyCYf,.. .,Yjj JI0) . The problem of missing data is the particular case of incomplete data where each Y^(X^) is a function taking X^ onto a specific subset of elements in X^. For example, if in the first obser­ vation all variables are observed and in the second observation only. the first two variables are observed, then Y^(X^) = X^ = ( x ^ x^g.,.x^) and Yg(Xg) = (Xg1 Xgg}. 2.3.1 Missing at Random An assumption often made in analysis of missing data is that the missing observations are missing at random (MAR). The data are.said to be MAR if the distribution of (the.observed pattern of missing data} given (the observed data and the (unobserved).values of the missing data}, is the same for all possible values of the missing data (Rubin 1976). That is, the observations that one tried to record are missing not because of their value, but because of some random process which causes, missingness in some simple probabilistic process. Censored data 13 is an example of missing data that is not MAR. Another common assumption made is that one can ignore the process that causes missing data. This means that the missingness pattern is fixed and not a random variable, that the observed values arise from the marginal density of the complete (though not fully observed) data. Let = (x\^...x^p) be the complete observation. Let ^ indicate the vector of values of X. that are observed and let X^™^ indicate the I I values that are missing. If X^ has a p.d.f. f^(X^|6), then one is assuming that X ^ ^ has a distribution given by the density function (m) J fxCx1 1e)dxj Rubin (1976) points out that ignoring the process that causes missing data is not always valid— even if there are no data missing.' However, he shows that for Bayesian analysis and likelihood inference, the analysis ignoring the missingness process gives the same results as the complete analysis, taking the.process into account, if two criteria are met; namely, (I) the data are MAR, and (2) the parameter of the missingness process is distinct from 9, the parameter of the data. The parameter, <j>, of the missingness process is said to be dis­ tinct from 0 if their joint parameter space factors into a <j>-space and a 0-space, and if, when prior distributions are given for <j> and.9, these distributions are independent. As an example, suppose weather data is being recorded by an instrument that has constant probability, <j>, of failing to record a result, for all possible samples. In this 14 example, <f> is distinct from 0, the parameter of the data. example .is censoring. Another Suppose a variable U is only recorded if U _> <j>, where <j> might, be the population mean. Then <j) and 0 are distinct only if <f> is known apriori. Rubin concludes that the statistician should explicitly consider the process that causes missing data far more often than is common practice. Although modeling the missingness process should be attempted it is usually difficult, if not impossible, to get the information necessary to perform such modeling. . 2.3.2 Rubin’s Factorization Table and Identifiability of 0 Suppose a random sample has been taken and there are missing values. If one can assume that the values are MAR and that the parameters are distinct, then Rubin (1974) describes an important algorithm which provides a method by which one can often factor the original estimation problem into smaller estimation problems. Suppose N observations are taken, each of which is a complete or partial realization of p variables. The factorization is summarized in a table which identifies, the parameters associated with each factoir and the observations to be used to estimate these parameters. The M.L.E.or- Bayes estimates of the parameters associated with a factor, which are found using only the indicated observations; are identical to the estimates calculated using all available data, (Anderson (1957) 15 suggested the idea of factoring the likelihood function; however he only considered M.L.E. for MVN data.) It may happen that because of the pattern of missingness, certain parameters are not identifiable. parameter values 8^ and That is, there may exist two distinct say, such that ^ ( Y ^ , ... ,YjjI0 ) = fY (Yi.... Y n Iq 2) except on sets of probability zero with respect to both p.d.f.’s. Especially with a large data set where each observation con­ sists of many variables, it may not be immediately obvious which parameters are identifiable. Incorrect and misleading results may be obtained by estimating parameters which are not identifiable. Rubin's factorization table helps determine which parameters, if any, are not identifiable. In general, if there are fewer cases than parameters associated with one of the factors, then one has an identifiability problem. Inspecting the data to see that all pairs of variables have cases in common is not enough to guarantee estiinability. 100 cases with three variables x;Q > x;L2,xi3’ 51-100 only x ^ Suppose that on the cases. is measured, on cases 4-50 x ^ on cases 1-3 x ^ , X ^ > arid x ^ are measured. For example, consider and x ^ are,measured, and Using standard notation for conditional normal distributions, the joint p.d.f. of all 100 observed vectors can be factored into 16 11 * ....... 100, 1 ' 12' f (x-1 >•• • >x ’X50,2’X13’X23’X33l9®1) = 100,11yI'0!^ f C x 1 2 , ' " , x S O , 2 Ix 1 1 . . . X 1 0 0 , 1 ;a2-l’P2-l’cr2*l^ 2 f ^x13’X23’X 33'Xl l ’*" * X100,1'X1 2 ' " '’X50,2;cl3-12’331-2 ’^32,1 ’CT3*12^. * ■ But the parameters in the last p.d.f. on the right-hand side of the equation are not identifiable. Rubin's factorization is an important tool in estimation when observations are missing. It provides a method for possibly simplifying the estimation of 0, summarizes the information available, and charac­ terizes the problems associated with a particular pattern of missingness While creating the factorization table by hand is conceptually easy, it is practically impossible for any but small data sets. Chapter 3 describes a computer program, REACTOR, which creates Rubin's fac­ torization table for a given set of data. 2.4 Maximum Likelihood Estimation with Incomplete Data Various algorithms for calculating the M.L.E. of the parameter of a multivariate distribution, when the sample contains incomplete obser­ vations, have been discussed in the literature. Let X in x be the complete data vector with sampling density f (x|0), 0 e fi. are incomplete. Let L(X;0) = In f (x|0). The observed data, Y = Y(X), The unobserved X is known only to lie in x O O > the 17 subset of x determined by Y = Y(X). densities for the observed data. data, g(Y I0) ^ Let g(Y|6) be the family of Then in the special case of missing ^ (X|0)dX, where it is assumed that the process producing missingness can be ignored. Let L(Y;0) = In g(Y;0). The goal then is to find Q* e Q which maximizes L(Y;0). Dempster, Laird, and Rubin (D, L, and R) (1976) have developed a general iterative algorithm for computing the M.L.E. when the obser­ vations can be viewed as incomplete data. The algorithm is called the EM algorithm because each iteration, from 0 steps: to 0 (r+1) , involves two the expectation step and the maximization step. to be E(L(X;0) |Y,0^). is: (r) E-step: of 0 e Define Q (0|0^) Then the general EM iteration, 0 ^ compute Q(010 ^ ) ; M-step: which maximizes Q(0|0^r^). choose -> g (r+l) $ to be the value The heuristic idea is that one would, ideally, like to find the value 0 which maximizes L(X;0). Since L(X;0) is not known, because X is riot known, maximize E(L(X;0) |Y,0'’ ') at each step. Let D10Q (0 I0A ) = 3Q(0|0a )/30, and p2OQ(0|.0A ) = 32Q(0 |0A ) / (30)2 . Then the usual procedure, is to choose 0 D ^ Q (6 |0^) This way, 8(^+1) 0<r>. ■ (r+1) to satisfy ■■ = 0 and D 2^Q(0^r:*"1^ |0^r^.) nonpositive definite. at least a local maximum of Q(0|0^r^), for fixed • Some pertinent questions to ask are: is there a unique global maximum of L(Y;0); if so, does the algorithm converge; if.it converges, 18 is it to a value 0* which maximizes L(Y;0); and how fast does it converge? case. D, L, and R answer some of these questions for the general Under assumptions of continuity and differentiability, they show that if the EM algorithm converges to 0* and if D ^ Q (0* 10*) is negative definite, then the limiting 0* is a local or global maximum of L(Y;0). If, in addition, each g(r+l) the unique max of Q(0|0^r^ ) , the EM procedure monotonically approaches this maximum; i.e., . L ( Y ; 0 ^ + ^ ) _> L(Y;0^r^). They also give a formula which describes the rate of convergence. They note that if the (Fisher) information loss due to incompleteness is small, the algorithm converges rapidly. The amount of information lost may vary across the elements of 0, so certain components of 0 may approach 0* more rapidly, using the EM algorithm, than others. The EM algorithm is very general; it could be applied to many distributions f (x|0); ■However it may be very complicated dr even impossible to apply for certain problems, D, L, and R work out several specific examples, both in terms of specifying f (X|0) ,and specifying the function Y = Y ( X ) . Orchard and Woodbury , (0 and W). (1970) also derive a special case . of the EM algorithm, as a consequence of their missing, information principle, for obtaining the M.L.E, of a parameter 0 when the p.d.f. of X^ is f (Xjs) but X^ is not completely observed. The variables in X^ can be partitioned X^ - .(Y^,Z^) where Y^ is, the vector of. observed 19 components and contains the variables that were unobserved. Beale and Little (1975) give a detailed derivation of 0 and W's missingness information principle, with a slightly different emphasis. In their discussion, it is easy to see that this is another example of an EM algorithm. The idea is to treat the missing components, Z^, as a random variable with some known distribution, f (z |y ;6^). find 0 which maximizes L(Y;0). The goal is to It may be easier, however, to find the value of 0 which maximizes the expected value of L(Y,Z;0) if Z is treated as a random variable with known distribution. So, maximize E z |.y.0(L(Y,Z;0A)) for any fixed 0^; note, that this is D, L, and R's Q ( 0 10.) . A Suppose 0 = 0 ID maximizes E , Z Iy £U (L(Y,Z;0 )); then define the A transformation 0 by 0 = 0(0.). The transformation $ will define the m A (r) (r+1) equations for each iteration 0 v -> 0 ^ , and repeated iteration is done until there is no appreciable change. 0 is a fixed point of 0; i.e., 0 = 0(0). They show that the M.L.E. Also, if the likelihood function is differentiable, any solution of the fixed point equations, 0 = 0(0), automatically satisfies the likelihood equations and is a relative max or a stationary point; therefore the procedure won't converge to a relative minimum. 20 2.4.1 Maximum Likelihood Estimation when f (x|6) is of the Regular Exponential Family Sundberg (1974, 1976) worked but the special case where f(x|6) the form of the regular exponential family; f (x|0) = b(X)e® has /a(6) , where 6 is a p x I vector of parameters and t(X) is a p x I vector of complete-data sufficient statistics. D, L, and R also looked at this case in detail and point out that Sundberg's algorithm is the EM algorithm. Sundberg shows that the likelihood equations can be written as E^Ct(X)|Y) = E^Ct(X)). (He attributes this result to unpublished notes of Martin-Lof (1966).) solving these equations. as 0^r+^ He defines an iterative algorithm for An iteration from 0 (r) to 0 (r+i) is defined = m t_"*"(E(t|Y,0^r^)) , where m^(0) = E0( t ( X ) ) a n d assuming m ^ exists. The EM algorithm when f (X|0) is of the exponential family is: E.-step: compute t ^ = E(t(X) |Y,0^r^); M-step; the solution of the equations Eg(t(X)) = t^r^ . determine as Therefore, Sundberg's algorithm is the same as the EM algorithm. Again, the same questions are pertinent. does L(Y;0) have a unique global maximum? In this special case, Perhaps not; since Y does not necessarily have an exponential distribution, there may be several roots of the likelihood equations and it may be that none of them give a global maximum. Simdberg (1974) gives an example and Rubin, in his discussion following Hartley and Hocking (1971), gives two examples where there, is not a unique M.L.E. 21 Does the. algorithm converge to a relative maximum of L(Y;0)? From D, L, and R, we have that if the algorithm converges to 0* and if D 20 Q(0*10*) is negative definite, then 0* is a local or global max. 0 Let 0 be the true parameter vector. Var q Sundberg shows that if (E 0 (t(X)|Y)) is positive definite, then for large N the algorithm converges, say to 0*. For the case of exponential families, D ^ Q ( 0 * I0*) = -Var (t I0*) , which is. negative definite. Therefore, for large N, if Var -(E .(t|y )) is positive definite, the algorithm con0 0U verges to a value 0* where 0* gives a relative maximum of L(Y;0). Also, since Q(010 ) is convex in 0, if all 0 ^ of fi, then each Q Cr*1"!) are in the interior the unique maximum of Q(010 ) and the convergence is monotone. Let J y (9) be the Fisher information matrix when observing Y(X); J (0) = E(3L(Y;0)/90)^ = Var(8L(Y;0)/30). y J (0) = Varg(Eg(t|Y)). Sundberg notes that Therefore, the necessary condition for con- y o vergence can be restated as the condition that J (0 ) is positive y definite. ' Let J (0) be the Fisher information matrix when observing x X; Jx (6) = Var0 Ct(X)). Then Jy(B) = J 35 (S) - E q (Varg (t|Y)) and the matrix Eg (Varg (t|Y)) could be considered the loss of information due to observing Y(X) instead of X. The matrix J "^(B)Eg (Varg (t|Y)) could be considered a measure of relative loss of Fisher information due to observing Y rather than X. ‘ 22 Sundberg defines the factor of convergence, of the algorithm, in the following way: asymptotically as r ^ 00, the error |6^r^ - 6*| decreases by this factor at each iteration. He shows that if J (6^) . y is positive definite, the factor of convergence for the algorithm 0 ^ -> 0* is the maximal eigenvalue of J ^ ( O ) E g (VarQ (t|Y)) , which will be < I. So again, this implies that the smaller the relative loss of information, due to observing Y(X) instead of X, the faster the algorithm will converge. Louis, Heghinian, and Albert (1976) consider a slightly less general problem. They are interested in the problem of finding a M.L.E. of the parameters where the data are a sample from a regular exponential family but the data are partially specified in that one knows for each X. that X. e R., a subset of the reals. i l i if the data are observed exactly, missing, then R. = the reals. For. example, = {X^}; if an observation is Their iterative method replaces missing or. partially specified data with estimates, which are found using.a current set of (estimated) parameter values. This pseudo-data, com­ plete now, is used to form a new closed form estimate of the parameters This procedure continues.until the.sequence of parameter values con­ verges; they show that, under regularity conditions, it does converge to a maximal point. The algorithm for calculating the estimate of the parameter in each iteration is the EM algorithm for the exponential family with partially specified data. 23 Blight (1970) found the M.L.E. of parameters of a regular exponential family when the observed data are censored in a specific way. Within a certain region one can observe the values exactly; for data falling outside this region, only grouped frequencies are known. 2.4.2 Notation and Definitions for Patterns of Missing Observations Let be a I x p random vector. A random sample X^,Xg,...,X^ is taken but there are missing data in that for each observation i, only a subset, Y^, of the variables (x^^,x^g,...,x^^) is observed. the N x p Let M be incompleteness matrix; the entry (ij) in M is I if x ^ observed and 0 if x . . is not observed. 1J is Replace all rows of M having the same 0-1 pattern with one row having that pattern, and call the resulting incompleteness matrix M*. Let k be the total number of distinct patterns of observed variables among the N observations. Then M* is a k x p matrix. Let M* = (m. .) . Missingness pattern i ■ th refers to the missingness pattern denoted by the i row of M*. A data set is said to have a monotone pattern of missingness if the matrix M* can be rearranged so that if all i = 1,2,...,Z. = I for In other words, if the random vectors can be ordered such that the variables in observed in Y^ = I, then m ^ are a subset of the variables for all i, the random sample is said to have a monotone pattern of missingness, 24 For example, suppose there are 3 variables which were to be recorded and the following observations were seen, where indicates a missing value. Y1 = (3,6,7) Y 2 = (*,*,8) Y 3 = (8,1,3) . Y4 = (2,0,-1) Y 5 = (*,4,2) Then /111 M=I / 001 111 \ HI AOll The rows of M* can be reordered so that M* = /111 Oil Vooi Therefore, this data has. a monotone pattern of missingness. Without loss of generality, the observations can be reordered such that the first n(l) observations are those with missingness 25 pattern I; the next n(2) observations are those with missingness pattern 2; etc. The last n(k) observations are those with the missingness pattern of the U t*1 row of M*. 2.4.3 M.L.E. of Parameters of a MVN Distribution Orchard and Woodbury (0 and W) (1970) work out the specific algorithm for estimating the mean vector and the variance-covariance matrix when the data are from a MVN distribution and there are missing components in the random sample. and Hasselblad (1970). This is also discussed by Woodbury, The MVN distribution belongs to the exponential class of distributions, and 0 and W's algorithm is a special case of Sundberg's algorithm and the EM algorithm. However, 0 and W have worked out the details for using the algorithm in this special case, so that the algorithm is computer programmable. The details for this algorithm and a computer program are given in chapter 4. Another method of finding the M.L.E. is the method of scoring. Let the Lt*1 score be S^(6) = dL(X;0)/d0^, where L(X;0) is the loglikelihood of X |0, and let J^(0) be the Fisher information matrix (E0 (SiSj)). The likelihood equations are (S^...Sp) 1V= 0, which may be difficult to solve. The method.of scoring is an iterative procedure based on estimating the score by the first.two terms of its Taylor’s expansion. The second term is actually estimated by its expected I value. iterate, The (I + I)-Iterate, 0 ^ + ^ , e(1), by 9(1+1) is then obtained from the previous 0(i) + Jx1 (6(:L))(S1 (6(i)). ..Sp (e(1))) The equations for the iterative method of scoring when the obser­ vations are from a MVN distribution and some components are missing are worked out by Hartley and Hocking (1971) and are restated by Little (1976). Hartley and Hocking’s development of this procedure stems from treating the missing observations as unknown parameters, rather than as random variables as 0 and W did. Little (1976) compares the method of scoring and 0 and W ’s iterative procedure for solving for the M.L.E. of the means, ju, arid, the regression coefficients, when there are missing data in a multi­ variate sample from a MVN distribution, where one variable is specified as dependent. Point estimation, confidence intervals, and hypothesis tests are discussed. The advantages of 0 and W ’s method over the method of scoring are: i) large matrix inversions are avoided; ii) it is easy to program; and ill) it provides fitted values for the missing variables., The major advantages of the method of scoring over 0 and.W’s algorithm are: i) as a. by-product, it produces an estimate of the standard errors of the estimate; and ii) it converges faster, at a quadratic rate compared to a linear rate for 0 and W (Little 1976), The most important advantage of the method of scoring is that it provides estimates of the standard errors. However, under certain circumstances, any algorithm can easily estimate the standard error 27 of the estimate. Hocking and Smith (1968) give equations for the estimated covariance matrix of the estimates when the data have a monotone missingness pattern with either 2 or 3 patterns of missingness; i.e., M* is either a 2 x p o r a 3 x p missingness, when estimating just are easily obtained (Little 1976). matrix. For any pattern of estimates of the standard error Also, Beale and Little (1975) suggest a procedure for estimating the standard error of which is easily obtained and performed well in a simulation study. Beale and Little, in the same article, compare six methods for finding the M.L.E. from a MVN sample which has missing components. The methods considered include ordinary least squares on complete observations only, iterated Buck (to be described later), and three weighted least squares procedures. The simulation was done with variable #1 always identified as the dependent variable, x . , and from 2-4 independent variables, X25X^,x^. Let x ^ , j = I , ... ,4, be the value of the J t*1 variable in the It*1 observation. The criteria used to judge the estimator^ was J 1fxU - »0 where the b .'s are the estimated regression coefficients using a particular method, and the x ^ ' s are the true values of all variables before deletion. 28 The methods that they found best were the iterated Buck's and a method of weighted least squares. The iterated Buck's procedure is a modification of Buck's method for finding M.L.E.'s of parameters from, a multivariate data set, not necessarily normal, which has missing data (Buck 1960). The iterated Buck procedure, also called the cor­ rected M.L.E. procedure, is equivalent to 0 and W s procedure except that the variance-covariance M.L.E.'s are. multiplied by N/(N - I) where The iterated Buck procedure and 0 and W s N is the sample size. algorithm gave almost identical results. The method of weighted least squares is as follows. Get an estimate of the covariance matrix of all variables, $, and the mean vector, - jj, 1 using the corrected M.L. procedure. ' I ■ ’ ^ Using | and for each observation i, missing values are estimated by the estimated conditional mean of the unobserved independent variables given the observed independent variables. Let a denote the.residual variance 2 of x^ when all independent variables are fitted and let a^ be the conditional variance of x .. given the observed variables in observation i. 2 Let s and s, be the corresponding estimates. 2 2 W i = s Zsi 0 . 11 2 Define if the dependent variable Xii is observed .otherwise . Note that if all independent variables and x._ are present iii obsef; \ vation i, then W 4 = I. 11 Then a weighted least squares analysis with 29 weights is carried out on all the data with present. There are many papers in the literature dealing specifically with linear models and least squares estimation when there are missing data. Afifi and Elashoff (1967, 1969, 1969a) and Hamilton.(1975) . provide surveys and bibliographies for this area. 2.5 Estimating Linear Combinations of the Means When There are ' Missing Data in a Bivariate Normal Sample Let X. be a I x 2 random vector which is distributed bivariate I normal with mean vector 011 °1 2 I. 0 = (B^Gg)' and variance-covariance matrix A random sample is taken and there are missing data. 012 022 denoted by the incompleteness matrix M*. Assuming that there is at least one complete, observation, there are two possible forms that the incompleteness matrix M* can.take, M* 2.5.1 Estimation of 6 .=.0^ - 0^ When M* = ^ ^ . This is a monotone pattern of missingness where there are h(l) vectors of complete observations, X'.= (x. 1 ,x 0) , and n( 2 ) observations ■. • . 1 11 12 . of X^ alone, X^ = (x^^), i = l,2,...,n(2). N = n(l) + n(2). Let 30 n(l) (I) Z X 'j — n (l) /n(l) j = = ? x i=l /n(2 ) — ■ (IN _ (I ) (xij - x -j )(xik - x -k ) , Consider the class ; 1 ,2 13 (2) X *1 ^jk " x i=l 3 ,k = 1 ,2 . of linear combinations which can be written as - -'(2) Z(t) = A(t)x<1^1^ + (I - A(t))x.1 " _ - x I x *2 CD where A(t) = (n(l) + n(2)t)/N and t is a function of the complete — (i) pairs and is uncorrelated with x ^ When | is known, the M.L.E. is Z C a ^ ) and so belongs to C^. M.L.E. is the minimum variance unbiased estimate of When .• • is not known, a simple estimate of 4 6 6 The . • — — (i) is x ^ - x ^ , where — , . ’— m x ^ is the mean over all N observations of variable I, and x,^ . is the meant over all observations of variable 2. . This estimate is also in C 1 since it.equals Z(O). M.L.E. is .ZCa^/ajj) The M.L.E. of .6 also belongs to this class; the (Anderson 1957) :and .(Lin 1971). Mehta and Gurland (M arid G) (1969) suggest Z( . 2 a { a + a^)) as an estimate of is appropriate in the. neighborhood a^ + ^ 2 2 ^ w^en CTll^ a coefficient. 22 = a22 ~ 6 which The M.L.E. of p is where p is the population correlation Lin (1971) did a simulation study comparing the simple 31 estimate Z(O), the M.L.E. ZCa 1 0 Za11), M and G's estimate IZ 1 1 Z C Z a ^ / ( a ^ + a^g)), and the estimate Z{a.^/a^. moderate values of n(l), 14 He looked at n(l) j< 101. ■ The estimates were evaluated by their efficiency E(Z(t^) - 2 6 ) /E(ZCtg) - 6 ) 2 which equals Var(Z(t^))/Var(ZCtg)) since all estimates in this class are unbiased. Lin found that when p is in the neighborhood of 0, the simple estimate Z(O) is most efficient. When p f 0, but 0 ^ Z(2a^g/(a^ + Bgg)) is most efficient. of 6 If a 2 2 = 1» M and G's p = 0, Z(O) is the M.L.E. = Ogg, Z(2a^g/(a^^ + agg)) is the M.L.E, of S. and if IP I = .1 and CT2 j / a 22 — Z ^a12^a22^ is inost efficient. When In all other circumstances, Lin's statistic, the M.L.E. Z(a^g/a^) is most efficient. These are not surprising results; the M.L.E. is efficient. Mehta and Swamy (1974) use a Bayesian approach; the emphasis is on evaluating the effect of using the extra observations in estimation. They place a non-informative prior on 0, assume $ is known, and show that the Bayes estimate is the M.L.E. of 6. They compare f ( 6 Icomplete data only) to f (6 |all observed data). They find that the means of the two distributions are not markedly different, but the extra observations significantly reduce the variance, 11 2.5.2 Estimation of 6 When M* 10 01 This is an example of a pattern of missingness which is not monotone. There are n(l) complete observations, = (x^^x^) > i = I,...,n(l), an additional n( 2 ) observations on x^ alone, , i = l,...,n(2), and an additional n(3) observations on Xg alone, X^ = (x^g), I = I,-- ,ri(3). - (I) x . = n(1) Z x /n(l) , j = 1=1 J N = n(l) + n(2) + n(3). 1 ,2 Let ; J n( 2 ) E x, /n(2 ) ; i=l ~ (3) n(3) x. 2 . =. E x f i=l ajk ' ^ 1- /n(3) ; - ri) - (I) ^3 K x ik - x-k ). - • Consider the class C 0 of estimators which can be expressed as ' 2 ■ . ■ ' . ■ ■ ■ : ■ ■ . ■ . WCr1U1V). ” A(T1U)X,^^^+(l-A(r1u))x^(2L8(r1v)x,2^L(l-B(r1v))x where 2 . A(r,u) = n(l)(n(l)+n(3)+n(2)u)/((n(l)4n(2)) (n(l)+n(3))-n(2)n(3) r^) 33 and B(r,v) = n(l) (n(l)-hi(2)+n(3)v)/((n(l)+n(2))(n(l)+n(3))-n(2)n(3)r2) . Note that the class of section 2.5.1 is the subclass of when n(3) = 0 . When I is known, the M.L.E. of 6 is w ^P>ai2^aii,ai2^a22^ w^iere P is the population correlation coefficient. When $ is not known, the M.L.E. does not have a closed form, but it can be calculated for a given set of data. A simple estimate is the mean of all observations on x. minus the mean of all observations on Xg,.which is equal to W(0,0,0). Lih and Stivers (1974) suggest using a modified M.L.E,, using the M.L.E. of $ obtained using only the complete data. 6 That is, = W(a^g//a^ag^,a^g/a^^,a^g/agg). It.is unbiased, asymptotically normally distributed, and asymptotically efficient, as n(l) -* <», n(j)/n(l) -* Cj, where 0 < C j < « > , j = 2,3. Mehta and Swamy (1974) also looked at this problem using a Bayesian approach. However, finding f (6 [observed data) involves a numerical technique which is time consuming and expensive. They do one specific example where they compare f(&|complete data) to f(6 |all observed data) and find that both the mean and variance are affected. They also compare f (6 |observed data) to f(6 || = |, observed data), where $ is replaced by its M 1 L 1 E i, |, in 34 two specific examples. They found that f (6 || = | , observed data) does not provide a good approximation to f (6 |observed data) for small samples. However if the degrees of freedom of the distribution of f (6 (I = I , data) are reduced by the number of parameters estimated. by the data, it provides a better approximation. Hamden, Pirie, and Khuri (1976) look at the special case where 0 I = 0 . They consider a more specific class of estimators than the L class . They give an unbiased estimate of the common mean which minimizes the variance of the estimate. 2.5.3 Hypothesis Tests About Linear Combinations of the Means Recent literature has been more concerned with testing the hypothesis H^ : c '0 = 0, than in estimating c'0. If | is known, the problem is not complicated; c ’0/Var(c'0) can be calculated where 0 is the M.L.E. of 0, and Var(c'0) will be a function of |. has a known t-distribution. This statistic If | is unknown, but the sample size is A A A A A large, the test statistic c'0/Var(c'0) can be used, where Var(c*0) is found by substituting the M.L.E. for the elements of | into the func­ tion Var(c'0). be used. The asymptotic distribution of this statistic can then For small sample problems, one can often find an exact test, i.e., a test whose exact distribution is known, by discarding some data. The emphasis in the current literature has been on finding tests, using all available data, for which one can find exactly, or 35 approximately, the small sample distribution. The situation is compli­ cated in that the most powerful test, for a given size a, appears to vary depending on what can be assumed about p or about CT2j / a 2 2 * Little (1976a) suggests a class of statistics based on linear combinations of sample means to consider for testing : c' 6 = 0 . His goal is to find statistics which have known or well-approximated distributions and lose little in efficiency, when compared.to the A •statistic using the M.L.E., c' 6 . (Efficiency is measured in terms of f\, A Var(cT0)/Var(c?0).) In a simulation study, he compares several statistics and their approximate distributions for testing Hg : Gg = and H n : U 6 0 = 0. Morrison and Bhoj (1973) consider the power of the likelihood ratio test (l.r.t.) of H_ : c 1 6 = 0 vs H : c'8 u a f 0 when there are MVN(0,$) data which has missing observations such that the incom/ pleteness matrix can be written M* = I 1 1 11 . ...I \ IQ " o )" 1 1 They compare the l.r.t. using all available data to the test using only complete pairs. When , • 2 is known, the l.r.t. is distributed as a noncentral % and 4 always has higher power than the test using only the complete obser-r vations. When $ is unknown, they consider two specific examples and find that again, the l.r.t. using all observed data is better, has higher power, than the test based only on the complete data. attribute the generalized l.r.t. for hypotheses on 0 They , when data are 36 from a MVN(0,|) distribution, $ unknown but nonsingular, to R. P. Bhargava. 2.5.4 Hypothesis Tests about 6 When M * = ^ Lin (1973) discusses tests of Hr. : U ) < <5_ vs H 6 — 0 a : 6 > Sr.. 0 He considers four special cases defined by the extent of knowledge one has about .|, in terms of p and d = cr^^/a 2 2 ‘ ^ach case he gives an exact test; if this exact test involves discarding data, he also suggests tests based on all available data and gives their approximate distribution. Mehta and Guriand (M and G) (1969a, 1973) propose a statistic T for testing H^ : <S = 0. ZfZa^g/Ca^i + a2 2 ^ It is based on their estimate anc* *s aPProPr:i-at:e if d = I. They give constants k for applying the test T > k for a size a = .05 test. When I is known, Morrison (1973) finds the l.r.t. statistic for testing H q : 6 = 0. It has a t-distfibution under H q . When $ is not known, he suggests replacing $ with | found from complete pairs. He gets an approximate t-distribution for the associated test statistic. Naik (1975) proposes a test statistic for.testing H q : H a : 6 6 = 0 vs ^ 0 which is based on the simple statistic Z(O) and is designed such that the size of the test does not exceed the pre-assigned level a when p _< 0 and in fact cannot exceed a as long as a^/d^ > 2pn(l)/N. In a simulation study, he found that his test is more powerful than 37 the paired t-test when p O and for small positive values of p. also gives a test statistic for : 6 = 0 vs : 6 < 0 He , and gives some comparison to Lin's test statistic and M and G's test. In a simulation study, Lin and Stivers (1975) look at the powers . and levels of significance for the testing procedures of Liri, M and G, Morrison, and the paired t-test on complete data only, for testing Hq . : 6 = 0. They find that when p >; .9, the paired t-test is always most powerful. They give criteria, in terms of sample size and p, for establishing the preferred test to use when p < .9, Draper and Guttman (1977) use a Bayesian technique, assuming squared error loss and a noninformative prior on testing of H q : 6 = 0 ; the M.L.E. of 6 . They find 6 0 and ^ , for hypothesis = E (6 |incomplete data) which is also They offer three statistics, using with their approximate t-distributiohs for inference. 6 as the numerator, In three exam­ ples, they compare the approximate distributions to the true distri­ butions and find, that two of the approximations are excellent even for small samples. smhll. In all examples the number of incomplete pairs was The best approximation was the t-distribution where they matched the mean and variance of f (6 |incomplete data) and removed two degrees of freedom to allow for this. They compare their test statistic to others in the literature— Naik, Lin, Morrison,.M and G. For example, M and G's test statistic is a special case of Draper and Guttman's. As 38 well as suggesting a usable test. Draper and Guttmah provide a brief summary of the work done in this area. 2.5.5 Hypothesis Test about 6 When M* = ( 10 oiy The notation is the same as that used in section 2.5.2. When n(l), n(2), and n(3) are large, an exact test can be found using the a M.L.E. of 6 , 5, and its asymptotic distribution. The problem then is to find tests which can be used when asymptotic results are not appropriate. Using only the n(l) complete pairs, the paired t-test for testing Hq : 6 = 0 can be used; it is an exact test. Lin and Stivers (1974) suggest four statistics for testing Hq 6 = 0 which use all available data. One is based on the estimate \Ll^22’a12^al l ’vL2^a22^ and the ot^ ers a^e based on W ( O iO 1 O) , the difference in sample means. for each statistic. They give approximate t-distributions Ekbohm (1976) does a simulation study comparing these statistics and two others. . Some of the tests are based on the heteroscedastic case and some utilize a homoscedasticity assumption. . Ekbphm finds that when the population correlation is medium or large, . or one does not know anything about its value, the tests based on W ^al 2 ^ a11^22’a12^al l ’a12^2?^ are to, be recommended, He gives specific recommendations which depend on the relation between n(l), n( 2 ) , arid n(3). : 39 Bhoj (1978) proposes two test statistics which are intuitively appealing for testing on complete pairs. degrees of freedom. : 6=0. Under When a^ Let t^ be the paired t-test based , t^ has a t-distribution with n(l) - I = Ogg' Bhoj suggests a test T which is . the weighted sum, Xt^ + (I - D t ^ , 'of two independent random variables . — (2) — (3) t^ and tg, where t^ is a function of x ^ ~ x »2 and the standard pooled estimate of the variance when there are unequal sample sizes. The distribution of t^ under degrees of freedom. is a t-distribution with N - n(l) - 2 The distribution of T can be adequately approxi­ mated by a t-distribution. I When f Ogg' BhOj proposes a test statistic T which is a weighted sum of t^ and t^, where t^ is Scheffe's statistic for testing equality of means with uncorrelated data and has a t-distribution with n( 2 ) - I degrees of freedom under H^. Bhdj compares his test statistics to the simple paired t-test which used complete data only; the expected squared lengths of 95% confidence intervals was used to make the comparison. statistics T and T His test can give considerable gain and he .gives recom­ mendations for the choice of X, which depends on the size of p and the value of h(2) and n(3) compared to n(l). .. 3. RUBIN’S FACTORIZATION TABLE Rubin (1974) describes a technique which is of use when estimating the parameters of a multivariate data set which contains blocks of missing observations. The likelihood of the observed data is factored into a product of likelihoods. The result is summarized in a fac­ torization table which identifies the parameters which can be estimated using standard complete-data techniques and the parameters which must be estimated using missing-data techniques. Let Z be an N x p data matrix representing the potential reali­ zation of p variables on N experimental units. The rows of Z are assumed to be independently and identically distributed, and the vector of parameters of the density of Z . *1 is Let M be the N x p incompleteness matrix of 0 's and I's; M was defined in 2.4.2. Jt 0 The column of M, or Z, represents the observations on the J t "*1 variable. Definitions: 1) The k ^ 1 column is said to be more observed than the J t *1 column if, th whenever an entry in j k th column is also I. column is I, the corresponding entry in the For example, / I \ is more observed than / I \ I \ / I I I I 0 0 I o i/ W 2) Two columns are said to be never ,jointly observed if, whenever an entry in one column is I, the corresponding entry in the other column . 41 is 0 . For example, / I ^ a r e never jointly observed.. / I 0 0 \ 0 \ I 3.1 Creating Rubin's Factorization Table Rubin (1974) describes the steps for creating the factorization as follows. Step I. Replace all rows of M having the same 0-1 pattern with one row having that pattern, noting which rows of Z are represented by that pattern. Similarly, replace all columns of M having the same 0 - 1 pattern with one col­ umn having that pattern, noting which columns of Z are repre­ sented Jjy that pattern. Call the resulting 0-1 incompleteness matrix M.. See Table I for the example of M that we will use to illustrate this method. I. Example of Incompleteness Matrix M for a 550 x 8 Data Matrix Columns represented — --- : ------------ -------- :— 3 4,5 6,7 1 ,2 I I. I I I 0 0 0 I 0 . ' I. 0 I .1 0 0 .0 0 0 0 ' Rows 8 I I I I . d 1 -1 0 0 101-275 276-300 301-450 . 451-550 ■ - Step 2., Reorder and partition the columns of M into (M^)Mg) such that each column in M^ is either (a) more ob­ served than every column in M^, or (b) never jointly observed 42 with any column in S 1 . If this cannot be done, the pattern of incompleteness in M (and M) is "irreducible," and no further progress can be made. Assuming this step can be performed, proceed to Step 3. In Table 2 see the result f\j of Step 2 for the matrix M of Table I. Here, every col­ umn in M„ is more observed than all columns in z 2 . The First Partitioning of the S of SI1. Table I Columns represented Rows 1 ,2 4,5 6,7 3 8 0 I I I 0 I I I I I 0 0 I I I I 0 0 0 0 . 0 . 0 . 1 -1 0 0 101-275 276-300 301-450 451-550 0 0 \ Step 3. Apply the procedure in Step 2 to both partitions created in Step 2. If both and are. irreducible, stop; otherwise proceed trying to repartition each partition created. 'V ■ • Continue until no partition of M can be further, partitioned. See Table 3 for the final partitioning for our example* This f\j"* partitioning was achieved by examining the M- partition of Ta&le 2 and noting that Columns (.6,7) are more observed than Columns (4,5), and Columns.(1,2) are never jointly observed with Columns (4,5). The S 0 partition of Table 2 and all partitions in Table 3 are irreducible. 43 3. The Final Partitioning of the S' of Table I Columns represented Rows 4,5 1 ,2 6,7 3 8 I 0 0 I I 0 I I I I I 0 0 0 I I I I 0 0 0 0 0 . I 0 — --- % M M1 ; --- M 2 1 -1 0 0 101-275 276-300 301-450 451-550 3 Step 4. Summarize the final partitions in a "factoriza­ tion table" as illustrated in Table 4 for our example. Labeling the final partitions ,... from left to right, list for each St1: i 1. the "conditioned" variables— the variables (columns of Z) represented by the i *123"*1 parti-? . tion: shy, Z±. 2 . the "marginal" variables— the variables, repre- til seated in partitions to the right of the i , partition, that are more observed than variables in the I t * 1 partition: 3. Say, Z ^ . the "missing variables"— the variables,, repre­ sented in the partitions to the right of the i*"*\ partition, that are never jointly observed with.. the variables in the It^ partition: 4; say, Z ^ . whether it is a "complete-data" partition— one column in SL, or an "incomplete-data" partition— more than one column in rXj 44 5. the rows of which are at least partially observed (i.e., the rows of Z represented by rXj rows of ML that are not all zero). 4. Partition complete incomplete incomplete 1 4,5 ,2 ,6 ,7 3,8 Marginal variables 3,8 - Missing variables Rows of Z 1 ,2 1 -1 0 0 CO 3 Conditioned variables. CO 2 Complete or incomplete <o I The Factorization Table for Table 3 - . 1-300 1-550 Each partition in M, and thus each row of the factori­ zation table, corresponds to a factor of the likelihood of the observed data. The factor is the conditional joint distribution of the conditioned variables for that partition, Z^, given the marginal variables for that partition Z^jfc. Hence, the final factorization of the likelihood of the observed data is 11 / fCz1I (3.1) i=l i where Z ^ is empty and er .r* may be written as 8 ^. o In equation (3.1), Z^ is the collection of unobserved scalar random variables in the i tlx partition. For the general normal model, the M.L.E. or Bayes estimates of the parameters of the i th factor, using only the indicated rows and columns 45 in the factorization table, are identical to the estimates using all rows and columns of Z. If a partition is "complete", it involves a completely observed data matrix and the usual computational methods can be used. The table can also indicate parameters which are not identifiable. The parameters of conditional association between the conditioned variables and missing variables given the marginal variables in each partition are not identifiable. In the example, these are the parameters of association between (4 and 5) and (I and 2) given variables (3, 6 , 7, and 8 ). Whenever two variables are never jointly, observed, the parameters of conditional association between them are not identifiable. The table also shows the number of observations available for estimating the parameters; if there are too few obser­ vations, the parameters may be not identifiable. 3.2 The Computer Program The FORTRAN language computer program REACTOR calculates and. . outputs the reduced incompleteness matrix 5 and Rubin's factorization table. In the flow chart, a partition C^ is said to be "to the right of" partition C 0 if each column in C 1 is either (a) more observed than every column in C^, or (b) never jointly observed with any column in C g . 46 Star Input the N x p missingness matrix M. which has dimensions n x k, say. Keep track of which variables are associated with each rXt column of M and which observations are associated with each row of 5. Output M I or M is irreducible. Output message to this effect. Calculate MXO = the number of columns which cannot be "to the right of" any other column. 5 is irreducible. Output message to this effect. W Stop 47 There are NC possible combinations of the NK integers in C taken I at a time. Let C , ,C„,...,C„„ be these NO combinations. Set C* = all integers in C which are not in C T. Is C* Is I > MXO? to the right Is C to the right _ of C*? J + 1|4 48 For all J in C set IV(J) For all J in C set IV(J) = IL set IV(J) NXT + I NXT = 0? NEXT(NXT) II = NEXT(NXT) IL = IL + I IL = NXT - I Set C = set of all J, J = I, I such that IV(J) = II. C defines the next partition which the program will try to further reduce. A 49 Each column J 1 J = I,...,k, is associated, now, with a number IV(J). If IV(I) = IV(J), then I and J are in the same partition. If IV(I) < IV(J), then J is in a partition to the right of the partition which contains I. Find IN and MAX such that IN <_ IV(J) £ MAX for all J = I .... k, and there exists and J 2 such that IV(J^) = IN and IV(J0) = MAX. M is irreducible. Output message to this effect. E 50 Set NVAR equal to a column number such that IV(NVAR) = IL. IV(J) < IL? Column J is not associated with this partition. The variables associated with Column J are conditional variables in this partition.______ The variables associated with column J are missing variables in this partition •''column J more observed vthan column, ^XNVAR? The variables associated with column J are marginal variables in this partition. 51 If the number of columns associated with conditioned variables is > I, this partition is incomplete. Otherwise, the partition is complete. Output the line of the factorization table for this partition. X Is \ IL = MAX? Set IL equal to the next largest number such that there exists J, J = I .... k, where IV(J) = IL. E 52 3.3 Sample Problem This problem is a slight modification of the example in Rubin (1974). Suppose the original data consisted of the following eight variables and ten cases (an asterisk indicates a missing value): Variable Case I I * . 2 .2 3 4 5 6 7 8 9 10 1 .2 .5 * * .7 * * * 3 2 A 4 3 2 5 6 1 .6 I 1 .1 7 4 3 1 .1 3.0 A A 1.5 .6 1 .0 A A 1 .6 A A A A A A A A A A A A A 4.0 . 1.5 3.0 .9 1 .1 A 1 .0 A A A A A I. .5 A A .5 .9 2 I. A .5 2 .0 A . A 2 .0 A A ■ I. 8 .0 1 .15 1 .1 .8 .5 .5 .05 A A .5 The input deck, on file, is as follows (ten column fields indicated by vertical lines): 00111111 11100111 11100001 11100001 00000001 00111111 11100111 . 00100000 00100000 00111111 On the following page is a listing of the output generated by REACTOR. reduced In c o m p l e t e n e s s matrix ■NEW VARIABLES NEW CASE # (// OF EQUIVALENT ROWS) I 3) ( 2 2 ) ■( 3 2 ) ( 4 ( D 5 2 ) ( ORIGINAL DATA HAS REDUCED MATRIX HAS 2 3 4 5 I I I I I I 0 I I 0 0 0 0 0 0 I I I I 0 I 0 0 0 0 10 ROWS AND 8 COLUMNS 5 ROWS AND 5 COLUMNS FACTORIZATION TABLE (RUBIN,JAEA,69:467) PARTITION I 2 3 COMPLETE OR INCOMPLETE COMPLETE INCOMPLETE INCOMPLETE : CONDITIONED VARIABLES ( 2) (4) (2) 3 1 4 2 5 . MARGINAL VARIABLES ( 4) ( 2) ( 0) 2 2 0 4 5 MISSING VARIABLES 5 ( ( ( 2 0 0 ) ) ) # OF PARAMS I 13 0 22 0 5 // OF ROWS 3 7 10 *** NOTE: VARIABLE //zS IN THIS TABLE ARE NOT NECESSARILY THE ORIGINAL VARIABLE //zS. TO FIND WHICH ORIGINAL VARIABLE //zS CORRESPOND TO THE NEW VARIABLE //zS, AND TO FIND WHICH CASES CORRESPOND TO THE VARIOUS PARTITIONS, SEE FILE //4 THE NUMBERS IN PARENS,( ), ARE THE TOTAL NUMBER OF ORIGINAL VARIABLES ASSOCIATED WITH THAT CELL. 4. ORCHARD AND WOODBURY'S ALGORITHM FOR MULTIVARIATE NORMAL DATA . 4.1 The Iterative Algorithm Let X 1 = (xjj»•*•,xip)’ I = I)•••»N, be a random sample of N vectors from the p-variate normal distribution which has true mean vector 0 ' = (0 ..... 0 ) and true variance-covariance matrix aU ••• aIp °22 * *' a2 p a2p Let app X = (X1'...Xl)' he the input data matrix and suppose some Nxp elements of X are missing. The N rows of X are associated with the cases and the p columns of X are associated with the variables. If no elements of X were missing, then the sample mean vector and the sample covariance matrix are the maximum likelihood estimates of and $, respectively. M.L.E. of 6 The goal of the computations is to find the and $ when there are some data values missing. Hamilton (1975) presents some guidelines for the appropriateness of maximum likelihood estimation. The maximum likelihood estimates . are best if the nprmality condition and one of the following condi­ tions hold: 0 55 (a) the sample size is greater than 300, (b) the sample size is greater than 75 but less than 300, and the intercorrelations are classified as "medium" or "high," (c) the sample size is less than 75, but the intercorrelations are classified as "high." The Orchard and Woodbury (1970) iterative procedure to compute the maximum likelihood estimates of 0 and | when X has missing elements has four basic steps. 1. Choose initial guesses, 0 (0 ) and V (0 ) , for the mean vector and -(0 ) ; is the mean of x . .1S over all cases covariance matrix. 0 J iJ Afny *fn) where variable j is not missing; Er ' = (0V i Z q Afo) ...Er ')*. P Next, N 0j is substituted for any missing value of variable j , j = I,.. .,p, thereby artificially completing X to form X ^ . ’ Then V^0) = (1/N) 2 (X(0) ' - 0 (O))(X(O) ' - 0 ^ ) ) ' . 1=1 2. Let 0^™^' and be the estimates of 0 and $ at the m4/ iteration, in = 0,1,2,... . elements in X are estimated. a time. Using 0^m ^ and , the missing The cases are completed one at Suppose that case i has exactly p^ variables missing. Let p^ = p - Pg be the number of variables not missing. A Drop : the superscript from 0 and V for ease in writing formulas. 56 Rearrange the order of variables in case i and in so that X I lx? (XuIxi2) • V = ^l Z Zll and in V = ^lZ 0I ^ ; and 0 .V xil| xi2 P1 P2 9 P2 T \ ; where the last p„ rows and columns fzrzil ?22 pI P 2 correspond to the missing variables. -It Then, letting ( V ^ be the (£,j)th element of V 1 1 V^ 2 > xIj ' .9j + cvU vI 2 ^ j txIZ - V ' j = P 1 + !,...,P1 + P g , are the estimated values of the missing variables for case i. At this time in the calcupI lations, the matrix V* = pI Z ^ pxp P2 V calculated for use in step 3. columns of p 2 ^ v 22 - V 2 1 VU V1 2 After these calculations, the ■ and the rows a n d .columns of V* are rearranged to correspond to the original order of variables. This.pro­ cedure, which is essentially a regression substitution method, is repeated for all cases. ynri-1 ) _ The resulting data matrix is 57 3. Using , revised estimates q an(j ^(m+-!) are calculated: = (i/N) I ; < * » i=l 12 3 . and v.(m+l) = (ly,N) % v* + i=l 4. % ^x (mH) ' _ 0 (nrf-1)^^.(m+l) ’ _ Q (mi-1)^ i=l 1 If Q Cm "*"!) and 1 ^ 1 are essentially identical to and V ^ , respectively, the iterative procedure is terminated and Q and estimates of 4.2 0 are printed, as the maximum likelihood and $. The Computer Program The FORTRAN language computer program MISSMLE calculates the M.L.E. of the mean vector and the variance-covariance matrix using the iterative procedure.of Orchard and Woodbury. It also calculates estimates of the regression coefficients, if applicable. MISSMLE does riot check that the parameters are identifiable for the given pattern of missing data. For meaningful results, the user must verifv this himself; Rubin’s factorization table (chapter 3) is useful in determining identiflability. * * .? .! P lo w C h a r t f o r MIiJSMLE Input d a t a matrix Compute means, g j , J - I , u sing p r e s e n t data S ubstitute means for miss i n g data Compute Initial c o v a r i a n c e matrix N V P a r tition 2. Begin loop for Set the m a trix V* -I - 0 Begin iteration the I loop all data' present?. Calculate n ew estimate of m i s s i n g d a t a p o i n t as Repeat for all P 2 miss i n g points, Accumulate the m a t r i x V* b y a d d i n g to it the elements of V22jl " V22"V21 V11 V12 in the a p p ropriate places Accum u l a t e /finished all N cases' ^•Calculate new means, 0 j ■ I j/N» j * l , . ,.,p Ca l c u l a t e new covar i a n c e matrix N JonvergenceJ Less thafls, ^OO Iteration! ' X s^done? V - I tv + JSi(Ji-I)CXi-*)-] ^/Dependent ^ variable ^ s s Specified? Regression calculations J= 59 4.2.2 The Test Used for Convergence The user inputs a stopping value STP or the. program defaults to STP = .001. At each iteration a new mean vector and a new variance^covariance matrix are calculated. At the I t *1 step, max T i- 1 is calculated, i- 1 where T ranges over all parameters in the mean vector and variance-: covariance matrix, and; T step. is that parameter .as estimated at the I t *1 If this maximum percent change is less than or equal to. STP, the iterations stop. A maximum of 100 iterations will be done. number suggested by Beale and Little (1975). This was the minimum If 100 iterations are done with no convergence, this result is noted in the output. 4.2.3 Regression Calculations Standard regression terminology and notation are used in this section. If a dependent variable is specified, the regression coefficients V. : : g, the SSE 1 and .R 2 ■■■ . , : -■ are. calculated.. ■ / ctO IY '' X . Let V = I -v — — — ■ I be the final variance-covariance matrix which V W has been rearranged so that O q is the: variance component for the specified dependent variable. Then the calculations done are; 60 R 2 Y 'V22Y = --- Y ~ (= RSQD) , 0O ' 2 -I S S E = G 0 - Y 1V2 ^ Y , A g= 4.3 _-l y 'V 2 2 (= Estimated Regression Coefficients) . Sample Problem . The data for this example were taken from Woodbury and Hasselblad (1970). No labels were input so those used are the default values. Variable 2 was specified as the dependent variable; all intermediate output is requested. The only value signifying,missing data is 0.0. The cards on the following page are the input deck used in this example. 61 ARTIFICIAL EXAMPLE OF A TRIVARIATE NORMAL I 20 3 2 I 0.0 (3F.0) .422,0.,0. -1.306,0.,0. -.125,0.,0. -.983,0.,0. .453,0.,0. .274,-.065,0. 1 . . .169.0. .510,.477,0. — .767,— .310,0. 1.075.. 304.0. -.656,-2.142,-.927 -.754,I.234,-.03 -I.Ill,-.297,.248 — .846,— .942,— .527 I.031,-.309,-1.264 -I.466,-I.465,-.988 .230,1.064,-1.161 .355,-.029,-.174 -.539,-1.181,-1.702 1.034,.325,.268 On the following pages is a partial listing of the output that was generated. Step 11 was the last iteration. Only output from that iteration is included, but this type of information was output for each iteration. 62 A R T I F I C I A L E X A M P L E OF A T R I V A R I A T E SAMPLE SIZE 20 I N I T I A L VAR S . 3 D E P . VAR. 2 S T O P P I N G V A L U E IS . 0 0 1 0 0 0 0 0 NUMBER X( I ) OF P R E S E N T NUMBER OF M I S S I N G X( CASEI INITIAL I .42 2 0 0 0 2 -1.30600 3 -.125000 4 -.983000 5 .453000 6 .2 7400.0 7 1 . 0 0 0 0 0 8 .510000 9 -.767000 1 0 1.07500 11 -.656000 . 12 -.754000 13 - 1 . 1 1 1 0 0 14 - . 8 4 6 0 0 0 .15 1.03100 16 - 1 . 4 6 6 0 0 17 .23 0 0 0 0 , 18 .355000 19. - . 5 3 9 0 0 0 2 0 1.03400 3) 15 1 0 5 10 DATA 0 VECTOR -.108450 X( DATA 2 0 MEAN 2) NORMAL -.211133 DATA -.211133 -.211133 -.211133 -.211133 ' -.211133 - . 650000E-01 .16 9 0 0 0 .477000 -.310000 .30 4 0 0 0 -2.14200 1.23400 -.297000 -.942000 -.309000 -1.46500 1.06400 - . 2 9 0 0 0 O E - 01 -1.18100 .32 5 0 0 0 -.625700 * * * * * -.625700 -.625700 -.625700 -.625700 -.625700 -.625700 -.625700 -.625700 -.625700 -.625700 -.927000 -.300000E-01 .248000 -.527000 -1.26400 -.988000 -1.16100 -.174000 -1.70200 .26 8 0 0 0 X( I) X( 2) X( INITIAL VARIANCE-1 CORRELATION MATRIX .68 0 7 1 6 .432145 .58 7 0 8 5 . 1 4 3 8 0 O E -01 .389216 .210256 3) * * * * * * * * * * OF STEP 11 X( I) NEW DATA ,422000 2 -1.30600 3 -.125000 A -.983000 5 .453000 6 .274000 7 1 . 0 0 0 0 0 8 .510000 9 -.767000 1 0 1.07500 11 -.656000 12 -.754000 13 - 1 . 1 1 1 0 0 14 - . 8 4 6 0 0 0 15 1.03100 16 - 1 . 4 6 6 0 0 17. .230000 18 .355000 . 19 - . 5 3 9 0 0 0 2 0 1.03400 CASE I MAX PERCENT NEW MEAN CHANGE I X( 2) .2464 9 6 E - 0 I * -.853437 * -.253310 . * -.689304 * .4040 2 3 E - 0 I * - . 650000E-01 .169000 .477000 . -.310000 .304000 -2.14200 1.23400 -.297000 -.942000 . -.309000 -1.46500 1.06400 -. 2 9 0 0 0 0 E - 0 1 -1.18100 .325000 OCCURS -.603218 -.623830 -.609743 -.6.19977 -.602849 -.609441 -.642417 -.476374 -.534220 -.611622 -.927000 -. 3 0 0 0 0 0 E .248000 -.527000 -I .26400 -.988000 -I .16100 -.174000 - I .70200 .26 8 0 0 0 AT V C ( 3 , 1 ) VARIANCE-CORRELATION MATRIX .68 0 7 1 6 .47 6 7 1 5 .773458 . 15677.2E-01 .341328 VECTOR -.108450 X( -.244900 .394625 -.609535 «• * RESULTS 64 ESTIMATED REGRESSION COEFFICIENTS I : I 3 B ( I): .502572 .46 7 5 0 9 CASE .511.454 .33 8 7 4 3 DEPENDENT X( 2) X(I) 1 .2 4 6 4 9 6 E - 0 1 * .42 2 0 0 0 2 -.853437 * -1.30600 3 -.253310 * -.125000 4 -.689304 * -.983000 5 .404023E -01* .453000 6 -. 6 5 0 0 0 0 E - 0 1 .27400.0 7 .169000 1.00000 8 .477000 .510000 9 -.310000 -.767000 10 .304000 1.07500 11 - 2 . 1 4 2 0 0 -.656000 12 1.23400 -.754000 13 - . 2 9 7 0 0 0 -1.11100 .14 -.942000 -.846000 15 - . 3 0 9 0 0 0 1.03100 16 - I .465 0 0 - I .466 0 0 17 1.06400 .230000 18 - . 2 9 0 0 0 0 E - 0 1 .355000 19 - 1 , 1 8 1 0 0 -.539000 20 .325000 1.03400 . X( 3) -.603218 -.623830 -.609743 -.619977 -.602849 -.609441 -.642417 -.476374 -.534220 -.611622 -.927000 -.300000E-bl .248000 -.527000 -1.26400 . -.988000 -1.16100 -.174000 -1.70200 .268000 »'»»».* SSE= RSQD= * * * * * 5. MAXIMUM LIKELIHOOD, BAYES, AND EMPIRICAL BAYES ESTIMATION OF 6 WHEN THERE ARE MISSING DATA 5.1 Assumptions Let be the I x p vector has a MVN( 8 , d i s t r i b u t i o n where such that X^, given 6 is a p x I vector (6 ^ .,..8 I is the p x p positive definite variance-covariance matrix. 6 , ) ’ and The elements of $ are a... A random sample of N observations of X^ is taken and there are missing observations. The incompleteness matrix M* describes the pattern of misdingness, indicating which variables are observed and which are missing. Every variable is assumed to be observed at least once. It is assumed that the data are MAR and the parameter of the missingness process is distinct from 0. This implies that the process that causes missing can be ignored, Recall that M*. = (m ,) is the k x p incompleteness matrix; k is ■ ■ the number of distinct patterns of observed variables. Missingness pattern i refers to the pattern of missingness described by. the i row of M*. th Let n(i) denote the number of observations X. which have J k missingness pattern i; I < n(i) <_ N and Z n(i) = N. It is assumed 1=1 that M*, n(l),...,n(k) are such that 0 is identifiable. .66 5i2 More Notation Let i I-S 1 1 be the number of variables observed in pattern i; p ; Si = P E z Vill. Z=I Let Ci , i = I...k, be the set of subscripts, j , such that variable is observed in the i th pattern; Ci = {j : = 1 }. Without loss of generality we can reorder the observations such that the first n(l) are those with missingness pattern I, observations be labeled Let these . . v (d /I) n(l) and recall that each X D/ . is a I % s. vector with elements x . ,, j e C1 . I I IJ The next n(2) observations are those with missihgness pattern 2 and are labeled x(2) Xi- .. n( 2 ) where each Xi is a I x Sg vector, etc. The program REACTOR ■ '.'7 identifies the observations which.have missiiigness pattern i, i = I , ...,k. Let P^, i = be the s^ x.p matrix of I's and 0 ’s such that P_. times the p x I vector of all variables is equal to the s^ x I vector of variables that were actually observed. P^ is obtained from (m.-,...m, ) in the following manner. Ii ip skip it. If m ^ If m . = 0, IJ .th - I, add another row to P^, a row where the j position is I and all others are 0. For example, if 1 0 0 0 0 0 Cm1 1 ...m^g) = (110101), then P^ is the 4 x 6 matrix u 0 1 0 0 0 0 q 0o i o o \ \ I ' ,000001/ Note that P^P^ = I ^ ^ y the s^ x identity matrix. Let i . be the s, x s , matrix P. t P!. Ti i i i T r Let B be the Es i=l x p matrix It will be convenient to have a special notation for block diagonal matrices. Let. the direct sum ■r ' ■ ■r '■ E T ,where each T 4 is a 1-1 ' . /' ' '■ . - q. x q . matrix, denote the.E q . x E q . matrix with the matrices T.; down 1 1 ■■ ••■... I 1 I 1 :, "■ 68 the diagonal, and. all other entries 0.. . For example. 3 + Z 1=1 0 T., = T2- 0 Let I ... denote the s. x s. identity s(i) 1 1 ?■ V0 O T matrix. The matrix B'( E n(i)I ...)B equals the p x p diagonal matrix i=l su; P + E (t.), where t. equals the total number of observations of variable i=l 1 i. By assumption, each t^ is strictly greater than 0. Let t^. equal the total number of observations in which variable i and variable j are both observed.; t. 1 whose ij th = t.. Then B ’( E 1 n(i)|.)B is the p x p matrix 1=1 1 ' element is t..cr... 1J 1J Let S . be the s . x I vector which is the sum of observation i i . n(i) (O' Let S be the E s . x I vectors having pattern i; S. = E X. i=l i 1 J - I j vector Then s|e ~ MVN((E+n(i)I ...)B6 ,E+ (n(i)4.)) . I su; I 1 The associated vector of means is S = (E (l/n(i))I ,.i.)S ; _ k I SU; S|0 -v MVN(B 6 ,E ((l/n(i))|.)) . I 1 . . . . The sum over all observed values of variable j is denoted by h(Z) E x (Z) ): , summing over all Z I i=l 1J x ., j = I ,...,p ; that is, x . = E ■J such that j E Cgy '3 Then B 'S (xe1 ...X e )'. *I *P Define 69 X = ( H l/t j B ' S = (x 1 /t1 ...x /t ) * = (x _;..x )' i *1 I *p p -I 'P, i= 1 5.3 Sufficient Statistics For all variables x. which are observed in missingness pattern i, J (i) define x . = •J (i) x » . , the sum of the n(i) values of x. present in -lJ 3 Z £ = 1 Let S be the set of all the observations with missingness pattern i . (i) x'. , over all j e 3 and over all patterns i; m S = {x . : j e C., i = l,...,k}. 'J There are 1 k Z s. elements in S. i=l 1 The set S is sufficient, but not necessarily minimal sufficient, for 0 (Little 1976a). Since S is the vector of the elements of S, S is a sufficient statistic for. 8 . 5.4 The M.L.E. of 0 When is Known In the notation of this section the M.L.E. of 0 is . 8 k = (B'( Z +n(i)|7 1 )B)™1 B'( Z i=l 1 i=l ; 1" or equivalently, k k 0 = (B’( Z +n(i)|7 1 )B)"'1 B ,( Z ^ ( i ) ^ 1)^ i=l 1 i=l 1 ' (Hartley and Hocking 1971). (5.1) The estimate is unbiased and is distributed, k given 0, as MVN(0,(B'( Z n(i)|. )B) i=l 1 ). The M.L.E. has the form of a 70 weighted least squares.estimate. Proposition 5.4.1. If n(i)/N goes in probability to r ., I P _ (n(i)/N — » r.), where 0 < r. < I, i = I , ...,k, and S and S . are N-*” 1 1 1 J independent, (S I] S.), i, j = I , ... ,k, i f j, then VlJ( 8 1 - 0) converges J in distribution to Z as N goes to k (VN ( 6 - 8 ) •> Z) , where Z is _ distributed MVN(0,(B' ( Z + ri|™1 )B)7^1) . Proof. Recall that S = (S^ J ... j S^)* where is the vector of sample means using only the data with missingness pattern i. ■ By the ■ '_ multivariate central limit theorem, /n(i) (S. - P.0) 1 Z^ ~ M V N (0,P^ I P^), for i = l,...,k. L — Z . where 1 n(i)-x» 1 Therefore, using Slutsky's ' 1 P theorem (Rao 1973) and the assumption n(i)/N — > r ., this implies _ ■ L / 1 ,v¥(s. - P. 8 ) -»Z* %.MVN(0, (l/r.)| ) , for i = I, — 1 1 N-x» 1 1 1 are assumed to be independent for all i ^ j , j-1 k ./N(S - B 8 ) -> Z* % MVN (0, Z (l/r.)|J N-K° fxVi=I 1 ^ Aj(0. - 8 . ) = (B '( Z + (n(i)/N)^.1 )B) i=l 1 N^oo (B'( Z +f J:71 )B.) i=l 1 1 ,k. _ " _ Since S . and S . 1 J this implies that Therefore . . B tC Z + (n(i)/N)| .1) ^ ( S - B0) i=l X B 1C Z +r.|.l)Z* i=l -L k + where Z* ~ MVN(0, Z (1/r )Z.) . Therefore, v^ ( 6 - 9 1=1 k , Z ^ M VN(0, (B 1 ( E r.^^ i=l i L ) Z where N-*= _i )B) ). The proof of Proposition 5.4.1 is complete. Note that in Proposition 5.4.1 the only assumption made on the . distribution of the X ^ es was that S^ Il S .. conditional distribution In particular, under the |0 ~ MVN(9,.|), the vector is independent L of S., i f j , and therefore v^(0 - 0) Z where ^ 1 . N-x» . k + I I Z ~ MVN(0,(B'( % r )B) ■ ). i=l 5.5 Bayes Estimation of, 0 When | is Known p +; . It is assumed that the prior distribution bn 0 is. MVN(0, Z A ^ j , i=l . . where Z A,, is a p x p diagonal matrix with the positive scalars A . I 1 ' down the diagonal; • ■■ '' The loss function is L(0,0) = (0 ,- 0)lVCb - .0) where V is any positive definite matrix of constants. . Proposition 5.5.1. P The Bayes estimate of 0.is ■ ^ 0 (B) = ( Z + (IZAi); + ,B1C Z +n(i)|i;L)B)"1 B ,'( Z +n ( D ^ 1)S , which can i=l i=l. * 1=1 also be written as P T . k B'" = ( 2 ^(IZA1) (1/A.) + B;, 1 ( Z i=l 1 = 1 k , T * n(i)|“ )B)“ B'( Z n ( i ) ^ )B0 , The 1=1 . ••••; conditional distribution, of 0 .k. Ovjj-Ie ~ MVN((Z+ (1/A ) + B ' ( ^ n ( i ) C ^ ) B ) ^ B ' ( ^ n ( i ) CI1-Ix - V)B-8X, I I 1 I 1 V1V1/--JY TlY - 1 TJ » f V + (Z+ (l/A ) + B t *((^n(I) TI--1Y j")B)"^ '(E+n(i) 1 I I U-Ix 1 (E+ ( I M i) + B'(E+n(i)ti 1 )B)"1) . .I Proof. 9 8 Let T = X . ,.x L-Ixmx-I (E (1/A.) +. B.'(Z n(i) £7 >B) . I 1 I 1 k Ie rv, MVN( e, (B* (.E+n(i) |:~1 )B)~1) and . . . I i |0 Recall that p 8 ~ MVN(0,E+A.). - I 1 Therefore, k ' + -I ■A a. MVN(TB'( e n(i) ) B 0 ,T) (DeGroot 1970) and under squared .error a Z^ \ loss the Bayes estimate, A(B) bution. Because 0 8 ^ , is the mean of the posterior distri- A A is a linear transformation of 0 and has a 0 AfgY MVN distribution, the MVN distribution of. O- . is easily described. This completes the proof of Proposition 5.5.1. The following lemma is useful in matrix manipulation. Lemma 5.5.2 (Woodbury's theorem) . .If T is a p x p matrix, U is q x p , H is q x q , and W is q x p , then (T + U'HW)"^ = T " 1 - T- 1 U 'H (H + HWT- 1 U 1H)" 1 HWT " 1 73 provided the inverses exist. Proof. The result is easily checked by multiplication. Using this Iemmai 0- (I can be rewritten as - (I which makes it resemble an R.- R estimate. diagonal matrix, a^ 0 ^ = (I - a = 0 For example, if $ is a. when i f j , then ( a ^ +. t^A^))0 ^, where t_^ is the total number of obser­ vations in which x. is observed, and 0 ^ I • ' (5.2) + ( r A )B'(Z n(i)|? 1 )B)"1)e Proposition 5.5.3. , is art. R - R estimate. p . - ' If n(i)/N — > r . , where 0 < r. < I, i = I,...,k, 1 L and ? H S , i f j, then vW(0(B).- .0) J ' Proof. k Z ~ MVN(D,(B'(E+ T-IJ 1 )B)~1) . I 1 Using equation (5.2) for 0 V ^ , Slutsky's theorem, and the results of Proposition 5.4.1, .Vn c q 1 - 0 ) = /N ( 0 - 0 L-lv_\-1 2 \ - (I + (E+A,)B'(Z+n (i)T 1 )B)" 1 O) P ’' I 1 I 1 /N(0 - 0) - (v^/N) (IZN)- 1 CI P pH-. /,x^-is^x-i: + ( E % ) B ' (E^h(i)$-^)B)-^0 i I i i X '•• 74 = /5(8 - 0) - (I/ /5)((1/N)I ' L P4. P.. ki + (Z+A.)B,(E+ (n(i)/N)^T1 )B) I 1 I 1 _q “ 1 I — > Z - O v (0 + (E A )B'(Z r i / ) ? ) 1B , ■■ N-x»' I 1 I 1 1 k+ where Z ~ MVN ((3,(B* (Z -I )B) -I ). This completes the proof. Therefore under the conditional distribution v5 ( 0 A (B) . |0 ^ M V N (0,^ , - 0) -» Z ~ MVN(0, (B' (Z+ T 1 IT 1 )B)"1) , and the Bayes estimate N-X” I 1 1 is asymptotically equivalent to the M.L.E. 0. a 5.6 ZnX The Risk of the Bayes Estimate 0 V a Compared to the M.L.E. 0 The loss function used in this section is still L(0,0) = (0 - 0).'V(0 - 0), where V is a positive definite matrix of constants. Consider the improvement in the risk when using a 'a A ZnN than 0; define IM = R ( 0 ,0) - R(0,0v y) . 0 ^ rather x im­ a The Bayes estimate 0 proves on 0 when IM > 0 . Another measure of improvement is the improvement in Bayes risk; define I M ^ = r(B) - r(0^B ^). we know that I M ^ By the definition of a Bayes estimate, > G. A. Recall from equation (5.2) that 5(B) can then be written as (I P - W)0. 75 Lemma 5.6.1. The trace of a matrix is indicated by tr( ). P- 1 I M = tr(W'VW(Z A )) \- tr(W'VW8 I Proof. 8 Then h. -I -I ') + tr(VW(B'(Z n(i)|7 )B) ) I . By definition, IM 4.Eg|e (6 - = E 0 J6 6 ) 'V( 8 - 8 ) - E* | 0.(8 - 8 + W 6 ) 'V( 8 - 8 + W8) (2.(6 - 0)'W8 - e'w'we) ZE0 i0(8 - 8) 'VW(0 - 6) - E ^ 0(SfW 1W e ) Recalling that 8 |0 rO MVN(0 ,(Bf(i:+n.(i)| .^)B) I 1 then -IM = 2 t r ( W ( B ,(E+n(i)^“ 1 )B)"1) I k. - tr(WfVW(Bf(E^n(i)|~l)B)~l) - S fW fWS' " I k = tr(VW(Bf(E+n(i)i~^)B)"l) I + tr(W,W ( B ,(E+n(i)|" 1 )B)’'1 (W,"i - I)) - tr(WfW S S f) ■ ■ _i k+ ,_i P+ Since W f = I + B f(E n(i)|, )B(E A ), I 1 I ' (5.3) I 76 IM = .tr(W (B'(E+H(I)IT 1 )B)""1) + tr(WfW ( I +A 1)) - tr(WfW I 1 . I 1 6 6 f) and this completes the proof. Lemma 5; 6 .2. ' ■k IM(B) = t r ( W ( B f.(i:+ (n(i)|71 )B)™1) Proof. Using the definition of IM' (5.4) . and equation (5.3), I M (B) = E 0 (IM) p, k , ■, tr(WfW ( I A )) - tr(WfWE(ee')) + tr(W(B'(I n(i)|~ )B) ) I I tr(VW(Bf(I+n(i)$:l)B)~l) . . . I This completes the proof. Note that (B' (2+rt(i)i 1 'L)B) is symmetric and positive definite, i I k -I P, k Px Therefore.(B1(I n(i)$^ )B)(Z [Z A^)(B'(I n(i)$^ )B) is positive definite, I n 1 I ' P. since I A 1 is. . . 1 . 1 k _V This implies that W ( B f(I n(i)$. )B) ., which can be . I 1 r' ■■ . 77 K. K ' p JC written as (B' (E+H ( I ) K 1)B + B'(Z+n(i)$Tl)B(Z+A.)B'(Z+n(i)$~l)B)~l, is . . . I I 1 I 1 I 1 a positive definite matrix. Proposition 5.6.3. Proof. Therefore we have the following result. > 0 . Recall from equation (5.4) that IM(B) = tr(VW(B' (E+H ( I ) ^ 1 )B)"1) . I Since V is assumed to be a positive definite matrix and the trace of the product of two positive definite matrices is strictly positive, then IM (B) >0 and this completes the proof. Therefore, in terms of Bayes risk, the Bayes estimate always improves on the M.L.E. Lemma 5.6.4. definite,.and if If P ■P 2 P_L E (0 /A.) < I, then (E A ) - 0 0 ’ is positive i=l I . 2 Pi- E,(0./A.) = I, then (E.A.) - 0 0 1 is positive .1 = 1 . V 1 " ' ' I semi-definite. Proof, By theorem 8.5.2 in Graybill (1969), the characteristic " P+ P 2 P equation for (E A ) - 00’ is (I - E 8 ,/(A - X)) H (A I i=l . i=l - X) = 0. . 78 Suppose A* < 0 arid Z (0 /A ) . 1=1 1 1 I. Then ' P Z 0;/(A. - A*) < Z 6 /A. _< I, which implies that I = I 1 1 i=l 1 1 (I - Z 8 ./(A - A*)) > 0. i=l 1 that (I - Z 1=1 6 Since A / (A, - A*)) H (A i=l teristic. root of (Z A ) I 1 - A* > 0 for each i, this implies I 8 8 '. - A*) > 0 and A* cannot be a charac- Therefore, there are no negative roots .. of the characteristic equation when • P o P P 2 Z (.6 ./A.) ±=i o . P If . Z Q /k < I, then (I - Z 8^/(A i=l 1 1 i=l 1 0 I. 1 - 0))n(A. - 0) > 0 I and 1 is not a root of the characteristic equation. P 2 So for Z 6 ./A. <: I, all characteristic roots are positive. • i-i 1 1 . Since a symmetric matrix having characteristic.roots which are all positive is positive definite (Graybill 1969) , (Z A.) I i P ■O 2 ef/A, 1 = 1 1 .1 I, t hen.(I - Z6./A.)nA. I 1 1 I 1 t 8 8 .1 is positive defiriite. ■ . : 0 and A = 0 is a chafac- 79 teristic root of (2 A.) - 80'. I 1 . . Therefore, if 20,/A; _< I, the characteristic roots of (2 A.) - 00 I 1 1 I 1 are non-negative and at least one root is equal to zero; by theorem P+ 12.2.2 in Graybill (1969) , (2 A.) - 00' is positive semi-definite. I 1 This completes the proof of lemma 5.6.4. Proposition 5.6.5. Proof. If 2 0./A, < I, then IM > 0, i=l 1 1 . By equations (5.3) and (5.4), we have IM = tr(W'W(2 A.)) - tr(W'VW00 ') + IM^ Yb ) By proposition 5.6.3, IMv ■ > 0 . Therefore, IM > tr(W'VW(2 A )) - tr(W'VW8 I . = tr(W,W ( ( 2 +A.) I 1 8 8 6 ') ')) . P 9 ■. p +; By lemma 5.6.4, 20,/A £ I implies that (2 A.) - 00' is non-negative 'i 1 1 •• ■ ■ ■ :i 1 definite. ••' P+ Since W ’VW is positive definite, tr(W'VW((2 A.) - 00')) £ 0 ' I ■; 80 and IM > 0. This completes the proof. In particular, for the two most common loss functions L(0,6) = (8 - 9)'(9 - 0) and L(0,0) = (0 - 0)'| ^(0 - 0), the risk of the Bayes estimate is strictly less than the risk of the M.L.E. if lies within the ellipsoid P 2 Z y /A = I. 0 The region in which IM is 1=1 greater than may be much more general than this ellipsoid, especially 0 k if tr(VW(B' (Z n(i)|/)B) x) is large. I 5.7 Empirical Bayes Estimation of 0 When i is Known Consider the following simple example. Let X^, i = I , ... ,N, be a 2 2 random sample of p x I vectors where Xl |0 -v MVN(0,a I ) and o is 1 P '■ _ N known. The M.L.E. of 0 is X = Z X.'/N. If 0 ^ MVN(0,Al ) , where i=l 1 P A is a positive constant, then the Bayes estimate under squared error 2 loss is (I - cr / (a 2 — + NA))X. 2 An unbiased estimate, of I/(a, + NA), under the unconditional distribution of X, is (p - 2)/NX'X. Therefore an empirical Bayes estimate, which turns out to be a James-Steiri estimate, is (I - a (p i 2)/NX’X)X. . Suppose however that there are. missing observations. this.case, the M.L.E. of £ is (Z I/t^)B'S.= X ■ I Then for where x .. = Zx ./t., the mean over the t . observations of variable j . .j £ 1J J I The 81 Bayes estimate is g(B) P I 9 9 ( S i - a /(a I (I - a 2 /(a 2 + tiA))x.i . _ + t .A))X or ^ This is now in the form of a ridge- '2 2 regression estimate; the term (I - a / (a. + t^A)) is different for each i. 2 It is more difficult to get. unbiased estimates of I /(a 2 i = I,...,py than it was to get an.unbiased estimate of I/(a + t A), i + NA);.. Alternative types of. estimators could be considered and maximum likelihood estimation of A might seem the logical place to start. However, even in this simple case, the M.L.E. of A may not exist, and if it exists there is not a simple closed form expression for it. An unbiased estimate of A is A ___ yP (X'X - a E(l/t.))/p; this is I also a method of moments estimator. 5.7.1 An Empirical Bayes Estimator for the General Case Proposition 5.7;I . where Q ; k • = (En(I)I1)+ I Proof. LetU= The unconditional distribution of S is MVN(£,Q) k P k (E n(i)Is(i))B(i: A ^ B V C E n(i)I?:(i)), I I I (El/A,) + B'(E n(i)|.)B. ■' I 1 I . 1 Recall that k s|e ^ .MVN(: (E+n(i)Ig.^±))B6, (E4ln (I)I1) ) and 8 ~ MVN(Jg1CE+A1)). Then 82 the p.d.f. of S,f(S;8 ) is J- c &xp(. - Id '(Z-rC i Z n a ) )|T1)d - A e X z r I Z A J e ) d e I (Z n(i)I where D = S . and f(s) 8 . 2. I 1 x .. )B 8 , and c is a constant with respect.to S s' Therefore, using theorem 10.5.1 in Graybill (1969), J (IZn(I))I1 ............... .. = c exp( - As' ( z = C1 e x p ( - Asv (E )s) .. k, exp( - ASVC (L+ (IZn(I))I. ) I , J 1- I . I 1 k , 1 k (Z+ITi)BU-iB l(L+ITi) ) S ) I I where C 1 is a,constant with respect to S. 1 .......... - A ( e ’ ue - 2 e ' B ' ( z |~ ) s ) ) d e A-is (IZn(I))I1 ) s ) e x p ( A s ’ (sk+x-iNT. I. ) b u -i„.t V ( Zk+ +I J )S ) I = C J f e...... x p( I By straightforward algebra, . • . k' ■ k ' % ' it can be checked that ( (Z+ (lZn(i))|T1) - (L+IT1)BU-1B '(L+IT1) )Q = I I I 1 I .1 and Q( (L+ (IZnCi))T1) - '(L+T 1)BU-1B' (L+IT1) ) = I, where I is the I I I 1 k , k 'L s . x L s . identity matrix; I 1 I 1 S ~ MVN(^,Q). o -1C Therefore, f (S) = C 1 e x p( - A SH'0 S) and 1 This completes the proof of proposition 5.7.1.. 83 k Therefore .0 ^ MVN(0, (B'(Z+H ( D i J 1 )B)^ ~ I 1 X= P+ (Z'l/t^)B’S = (x#^...x p + (Z+A.)) and I 1 )' is distributed P+ K p . p MVN(0, (Z+ l/t.)B»(Z n(i)D)B(Z 1/t.) + ( Z % ) ) I I I 1 I . Using the unconditional distribution of X, there is not a general closed form solution for the M.L.E. of the A^'s. In fact, since it is assumed that each A^ > 0, a M.L.E. does not necessarily exist. —2 unbiased estimate of A^ is A^ = x ^ - a^^/t^, i = I , ...,p. estimates are method of moments (M.O.M.) estimates. An These Using these A 's I and equation 5.2, an empirical Bayes estimate of 0 is Q(EB) = ( I - (I k. -I :\-l + (Z+ X^i - O 1 1 Zti)B '(Z-hH ( I ) ^ 1 )B) '•L ) 0. , I I The asymptotic, properties of 0 (EB) are given i n ■the following proposition. . Proposition 5.7.2. Suppose n(i)/N -> r ., where 0 < r . < I, ' ' ' ' N-Xjo 1 1 i =. I , ... ,k, and S . LI ,S ., i,j = l,...,k, i f j. 1 J Then L . ^ ( 0 (EB) - 0) -XZ A, MVN(0, (B '(Z+ r1 |1 1 )B)"1) N-Xd . where 0 V y is defined in equation (5.5) . : . (5.5) 84 Proof. I j Recall that t^ = En(j) where the summation is over all j , k, such that i e C., i .e ., over all j such that variable i . . J ■ . was observed in missingness pattern 3 . where the summation is again over all. 3 Let c. = Zr., i = I , ...,p, 1 , J = I,...,k, such that i e C. 3 J P Then c. > 0, £ = I , ... ,p, and t./N--*c.; Since each r. > 0 and c. > 0 1 1 N-x» 1 . 1 1 __ N_ P _1 '__1_' P n a(DN>i ' _J. P n ' Cf BH- , JN ' P _1 ^i. ^ i ^ i. '" " ... . From the proof of proposition 5.4.1, we have . /N(S - B9) ->Z N-x” ^ MVN(0,(2 .(l/r ) L ) ) I 1 . Since X = (2+ l/t.)B*(Z+n(i))S, this implies I I " V n (X - ‘ L ' p ,p 1■ ) . > z . ~ MVN(0,(2+ l/c.)B '(Z+ r,$ .)B(Z+ 1/C .) . . N-x» 1 ' I ‘ 1 I 1 1 I . 1 6 , P - pV , Then, by Slutsky’s, law, X '■>0 and X X' -x 09 ... N^oo ' . N-x” ‘ ' .. If D is a p x p matrix, let the notation dia.g{D} indicate the p % p diagonal matrix with the same diagonal elements as D. ^( q CEB) ^ Qj can be written as Then 85 /N ( 8 - 0) - VS"( I + dmgtii X' - ^(Z+ l/ti)}B,(2+fa(i)i71)B )-;L0 = P I 1 I 1 VS(6 — 0 ) — • ' ; ___ . P k . + XdtaglX X1 } - dcag{|}(z:+ l/t,))B,(Z+ (n(i)/N)i7J")B . ' V 1 I -1 .-' V^/N( (1/N)I ) " 1 ; Using the results of proposition 5.4.1 and repeated applications of Slutsky's law, this implies *^(0™ L k - 0) -y Z - 0(0 + (dtag{00'} - dtag{|}0)B,a:+ r J:3‘1 )b )'"?-0 = z N- * 00 I k _ _ where Z t MVN(0,(B'(Z r.|. )B) . .This completes the proof. I Therefore, under the conditional distribution X^|0, L ■ k ' ^j(0(EB) - O) -)-Z n, MVN(0, (B 1 ( Z V . ; I 1 : f 1 )B)"1) 1 and this .E.B. estimator is consistent and it is asymptotically equivalent to’ .,this M.L.E. 0. One apparent difficulty with this E.B. estimator, is that Ay may be negative, while the true parameter A. is assumed to be strictly positive.. I . ■ ■ * ^ ■ • . . . ■ ■ When ^ is a diagonal matrix and A^ > 0, A^ is also the M.L.E. of A 1 .. 86 5.8 Examples . . . 5.8.1 The Design Consider an experiment carried out in a randomized block design. The model for the observed variables, where b^, i = 1,...,N, is the i the ., is y „ = a + b^ + t . + e ^ ., block effect, t^ , j = I ,..,,p, is treatment effect, and t^ represents the control. The vectbrs e^ = (e^y..-e^p)'> I = 1,...,N, are independently and identically distributed with the (p + I)-variate normal distribution with mean and variance-covariance matrix Q. 0 Since Q is not. necessarily a diagonal matrix, the treatment effects may be correlated. For a recent reference for this nonstandard block design model, see Mudholkar and Subbaiah (1976). 0 I Suppose the parameters of interest are = t. - tn , i = l,...,p, the differences between treatment and i U control. Let Let Xi 0 = (0 ,.. . 0 I p )'. = y^^ - y i0, and X ± = ( x^. ..x^), i = I,.... ,N, and 1 = l,...,p. Then the model for x ' ij is x. .= t. - t_ + e . . ij 3 0 ij e,_ and 10 X i I0 ~ MVN(0,|) where $ = (-* ] I pxl P JQt-J ! I )'. p Xp Suppose one or more treatment variables are missing for sonie observations. If Q. is known, and therefore | is known, then the problem is to estimate 0 from a random sample Xi , i = 1 ,.,.,N, where 87 Xj:|0 ~ MVN(0,|) and some data are missing. In this example, shrinking the maximum likelihood estimate of the treatment effects towards the control is a conservative and reasonable approach. So for p ^ 3, the ,-*1 estimate 0 V . , which shrinks the M.L.E. towards O 9 is a reasonable estimate to consider. Mudholkar and Subbaiah (1976) give an example of a randomized block design where the parameter of interest is between treatment and control. 0 , the differences They give the sample mean and variance from a clinical trial where p = 4. Using their sample estimates as true parameters, consider an example where /1.5616 \ 1.3529 I 1.5457 \ 1.0222 / . I . _ 9 I /5.1791 3.5200 4.1789 X 4.7738 _I 2.9220 3.1260, 4.5538 3.8497 3.9780 I and 4 I 5.5509 Fpf the following examples, N random. 4 x I vectors were generated with this MVN,(.0j|) distribution. . '' ■ ■ ■ " (R) 0 ■ and zero. 0 . ' ' A (EB) 0 ' ;' ■ ■ . . The Bayes and empirical Bayeh estimates, • ' .. . . . ' , should give the best results when The value of 0 ' 0 is zero or near used for these examples is not exactly equal to , but the values of the. variances are sufficiently large so that close to zero relative to the variances., 0 is Therefore these examples may or may not show the empirical Bayes estimate to advantage. v. 88 5.8.2 Numerical Examples Where | is Known . In each of the examples 1-3, described below, the M,L,E, calculated using equation (5.1). 6 is The empirical Bayes estimate, 0 is calculated using equation (5.5). The estimate 0 (EB) (EB) uses —2 • A.-. = x % - .Cf4 4 Ztil; when t4 is large, the term O 4 4 Zt4 will, have little r ii ii i -2 influence. . The empirical Bayes estimate using = x,^, i = l,...,p, 'L 'A(2 ) was also calculated; it is denoted by 0 V '. The values for two loss functions are calculated for each estimate: I) 2 ) (0 - 0 ^'^ ^ '(8 - 0 ), where 0 is the estimate - $) (0 0 , 0 ^EB\ '.(0 of - 0 ) and . 4 2 In each of the following examples the number E0./A was calculated I 1 1 —2 , where A^ = x ^ - o^^/t^. This was of interest because the Bayes esti- 4 % mate will improve bn the M.L.E. when E0./A. < I. I 1 1 _ 4 . Therefore E0.4 /A. . 1 i i ... might give some, indication of the suitability of using the E.B. esti­ mate; however, these examples don't show any obvious relation between . this number and the performance of Ov 'Example I . N = 100. '. There are three patterns of missingness, with missinghess pattern n(i) 80 'mix M* 1110 ,0110/ and 10 10 89 2 " = 2.91. The following is a summary of the results. V VV *4- - 0 - '$ - Il <?CD 0 .341 .231 .234 0 .190 .106 ,108 0 .257 .160 .162 0 .323 .206 .209 .323 .133 ,136 .034 .0 2 2 .0 2 2 6I . (% - 0 ) ' (% - - 0 )'F^ 0 (0 ) - 0 ) 'In this.example the E .B . estimate performed better than the M.L.E. a (TTg} ' *^ (2 ) There was little difference between 0 ■ and 0. . , which was to be: expected since t. > 80 for each i. Example 2. N = 50. In this example the missingness pattern was generated by assuming that the probability of the variables being missing w a s : • Variable # Pr (missing) 1 . 2. 2 .4 3 4 .2 .90 The resulting incompleteness matrix is 1111 1011 1101 0011 1110 1010 1001 0111 0101. 1100 0100 0010 0110 1000 M* 4 2 / E0i/Ai =2 .60. ' ' .' ' The following is a summary of the results. ■■ .? = . > 1 - 6V : ®2 ." : 6 2 6 3 v ■ ®4 - '6A .: ; » , e (EB) 0 .564 .366 .,374 ' .411 .256 .. .263 .429 ,249 ,256 .419 . ,2 1 0 .219 . ■( 8 - 0 ) ’ (0 .. $ - 0 - 0 ) ).'$- 1 ($ - .306 . .847 0 ) .1 1 2 V . .079 .322 .080 Again the E.B.■estimators performed significantly better than the M.L.E., The estimate 0 ^ did hot do quite as. well as 0 ^ ^ . ■ ; ...... s . 91 Example 3 . N = 16. The incompleteness matrix is n(i) M* = 4 2 Z 8 ,/A = 3.55. /1111\ . I 0110 I and V0 0 1 1 / 8 4 4 . Following is a summary of the results. - 1 . K, 8 °1 " 9I &2 - @2 ^4 - 94 - - 8 ) ' ( 8 ' (ef> e 8 ) 8 ) - 6 ) , .S - I e <EB) .032 -.466 -.413 .2 1 2 — .160 -.1 2 1 - .1 0 2 *3 - 93 (0 = ■ -.545 . -.499 . .074 -.425 -.373 . .062 .721 .574 .270 .262 .233 ' In this example, neither E.B. estimator performed well at all compared to the M.L.E. The M.L.E.’s were either very close to the true value or underestimated the true value, except for /so- shrinking the M.L.E. towards 0 did not improve the estimation. sample size was.smaller here than in examples I and - 4 % . of 20 /A 2 The. and the value : was larger. , . . : : ' However one cannot draw any general conclusions from these three examples. , A simulation study is necessary to make such conclusions, 92 and such a study is not within the scope of this paper. These examples do indicate that the empirical Bayes procedure warrants further investigation. 5.8.3 An Example Where | is Not Known The assumption that $ is known is a restriction bn the usefulness of the E.B. method. Future study will need, to extend the empirical Bayes method to problems of estimating 0 when $ is not known. Until such work is completed, it seems reasonable to use the naive approach of replacing $ of equation (5.5) by some consistent estimate of $. Consider the data and missingriess pattern of example I. now that $ is not known. The M.L.E. 6 Assume can be calculated by 0 and W ’s , iterative technique using the computer program MISSMLE, To calculate the empirical Bayes procedures, replace the elements of $ in equation (5.5) by their M.L.E.'s from the 80 complete data cases. As the.table below shows, the. E.B. procedures still perform well relative to maximum likelihood. 93 rV 6 *1 - 0I * 2 .- 0 2 S . . 04 . • (e- 8 )' ( 8 -r (6 - 8 )'$"1 (8 8 ) - 6 ) •= 8 8 - e‘ <EB) 8 = 5 <2) .340 .251 .253 .190 ,126 .128 .257 .177 ,178 .325 .237 ,239 .323 ,167 .169 .034 .023 .023 . ' 6 . SUMMARY This paper is concerned with the problem of simultaneous estimation of the means using incomplete multivariate data. The major results of this paper are two new estimators for the mean— a Bayes estimator and an empirical Bayes estimator. These are compared to the maximum likelihood estimate of the mean. Large sample and small sample prop­ erties of the Bayes estimator are found, and large sample properties of the empirical Bayes estimator are found. Although small sample properties of the empirical Bayes estimator, are more difficult to find, numerical examples indicate that under some conditions this estimator may improve on the maximum likelihood estimator. Also, this paper presents two computer programs which provide additional tools for estimation when there are data missing. The computer program REACTOR creates Rubin's factorization table (1974); without this program, his algorithm is only feasible for small data sets. This factorization provides a useful summary of the data and the missingness pattern. REACTOR may break the problem into several simpler estimation problems and, more importantly, it can show that some parameters are not identifiable. For these reasons, the data should be analyzed by REACTOR before estimation is performed. The. second computer program is MISSMLE which calculates the. A. maximum likelihood estimate of the mean vector (6 ) arid variance- . , covariance matrix (4 ) from a multivariate normal data set with missing 95 observations. (1970). It uses Orchard and Woodbury's iterative procedure Maximum likelihood is a widely used method in statistics and is asymptotically optimal in the setting of this paper. • • ■ . ■ . . . . . . . • ' • , ' . . For large . ■ data sets, MISSMLE should be used. The major results presented in this paper are a Bayes estimator, and an empirical.Bayes estimator of the mean vector, 8 , from a multivariate.normal data set with missing observations. The ,empirical Bayes estimator resembles the popular ridge-regression estimator and shrinks the maximum likelihood estimator towards an a priori mean vector. From what is known about the complete data case, it was anticipated that the empirical Bayes estimator could be an improvement over maximum likelihood, in the small sample situation, for simul­ taneous estimation of the means under squared error loss. As a preliminary step in deriving the empirical Bayes estimator, the Bayes estimate, 0.^ , under squared error loss is found; A. multivariate normal prior with mean (j> and variance-covariance matrix p ... .. , ' I A., is assumed. .The Bayes estimate has the following properties; i 1 . . I. ■ - ■ ■ ■ ■ / • ■ ; v • • ■ : . ■ / . \ . : : ’ It has the same form as a ridge-regression estimator. 2. It is biased. 3. Under mild assumptions, it. is consistent for 0 and it Is asymptotically equivalent to • 0 . , > 96 4. Under squared error loss, the Bayes risk of Q v ^ is strictly ' less than the Bayes risk of 5. If the true 8 . P 2 6 satisfies Z 6 ./A. < I, then the risk of 6 ^ ' is ■ The converse is not true* 1 strictly less than the risk of 8 . However, the Bayes estimator is a function of the A.'s, which are not usually known. Replacing the A 's by estimates converts the Bayes estimate into an empirical Bayes estimate, a (EB) 0 . The estimates of the A ^ ’s are found from the unconditional distribution of the sample. In this paper, a method of moments estimate of the A ^ ’s was used. The resulting empirical Bayes estimate has the following properties: 1. It has. the same form as a ridge-regression estimator. 2. It is biased. 3. Under mild conditions, it is consistent for 0 and it is ,- a asymptotically equivalent to The estimate examples 6 v 8 ^ 6 . is easy to compute, and in several numerical improved on 8 . It is assumed that the process causing the missingness can be ignored, and that every variable is observed at least once. Neither of these assumptions are particularly restrictive or unreasonable. The data are assumed to come from a (p)^variate normal distribution, MVN( 8 , $ )• pxl pxp While.this is a common assumption in statistical work, 97 it may be restrictive in practice. Some work, has been done on testing for multivariate normality and the researcher should verify that the data come from a multivariate normal distribution. Robustness of the empirical Bayes estimator, however, Would be. a useful area for future work. The most restrictive assumption made is that 4 is known. This is a restriction which needs to be removed in future Work, as will be discussed later. There are a number of potentially profitable ways that the results of this paper can be extended and generalized. The most important result needed is a theorem dealing with the improvement in the risk of ^ /TTTJY Sv ^ ' compared to 0. useful. - ' An unbiased estimate of this improvement would be Also, other estimates of the A^'s could be considered. An empirical Bayes estimator with the elements of 4 also replaced by estimates could be the next step.. This would remove the restriction that i is assumed known. • if such an estimate can be developed which . v : ' / v : , does as well as, or better than, the maximum likelihood estimate^ it ■ would provide a simple., noniterative, procedure for. simultaneous esti­ mation of the means, .for any pattern of missingness,. The final example of chapter 5 shows how one can handle, for now, the case of in an ad hoc way. 4 unknown ■ Estimation of the means is of limited use without the-associated inference procedures— confidence limits and hypothesis tests, At this 98 point, only the asymptotic distribution of procedures. 0 ^EB^ is available for such For extending the results, of this paper to statistical inference, further work, including a simulation study, is needed. BIBLIOGRAPHY Afifi, A.A., and Elashoff , R.M. (1966), "Missing Observations in ■Multivariate Statistics, I. Review of the Literature," Journal of the American Statistical Association, 61, 595-603. ____ (1967), Missing Observations in Multivariate Statistics, II. .Point Estimation in Simple Linear Regression," Journal of the American Statistical Association, 62, 10-29. j . (1969), "Missing Observations in Multivariate Statistics, III. Large Sample Analysis of Simple Linear Regression," Journal] of the American Statistical Association, 64, 337-358. 1 _____ (1969a), "Missing Observations in Multivariate Statistics, IV. A Note on Simple Linear Regression," Journal of the American Statistical Association, 64, 359-365. Anderson, T.W. (1957), "Maximum Likelihood Estimates for a .Multivariate Normal Distribution When Some Observations are Missing," Journal of the American Statistical Association, 52, 200-203. Beale, E.M.L., and Little, R;J.A. (1975) , "Missing Values in Multivariate Analysis," Journal,of the Royal Statistical Society , Series B , 37, 129-145. Bhoj, D.S. (1978), "Testing Equality of Means of Correlated Variates with Missing Observations on Both Responses , 11 Biometrika, 65,. 225-228. Blight,. B.J.N. (1970), "Estimation from a Censored Samplfe for the Exponential Family," Biometrics, 57, 389-395. Buck, S.F. (1960), "A Method of Estimation of Mis.sing Values in . Multivariate Data Suitable for Use with an Electronic Computer," Journal of the Royal Statistical Society, Series B , 22, 302-306. DeGroot, Morris H . (1970), Optimal Statistical Decisions, New York: McGraw-Hill, Inc.', Dempster, A.P., Laird, N . M ; a n d Rubin, D.B. (1976), "Maximum Likfelitiobd from Incomplete Data Via the EM Algorithm," Research Reports S-38\NS-320, Department of Statistics, Harvard University. 100 Draper, N.R., and Guttman, I. (1977), "Inference from Incomplete Data on the Difference of Means of Correlated Variables: A Bayesian Approach," Technical Report #496; Department of Statistics, University of Wisconsin, Madison. Edgett, G.L. (1956), "Multiple Regression with Missing Observations. Among the Independent Variables," Journal of the American Statistical Association, 51, 122r-131. Efron, B., and Morris, C. (1972), "Empirical Bayes on Vector Obser­ vations: An Extension of Stein's Method," Biometfika, 59, .335-347.: . .Ekbohm, G. (1976) , "On Comparing Means in the.Paired Case with Incomplete Data on Both Responses," Biometrika, 63, 299-304. Ferguson, Thomas S. (1967), Mathematical Statistics, New York: Academic Press. Graybill, Franklin A. (1969) , Introduction to Matrices with Applications in Statistics, Belmont, California: Wadsworth Publishing Company, Inc.. . . • _____ (1976), Theory and Application of the Linear Model, Belmont, California: Wadsworth Publishing Company, Inc. . : Hamdan, M.A., Pirie, W.R., and Khuri,. A.I. (1976), "Unbiased Estimation of the. Common Mean Based bn Incomplete Bivariate Normal Samples," Biometrische Zeitschrift, 18, 245-249. Hamilton,. Martin A. (1975), "Regression Analysis When There Are .Missing Observations— Survey and. Bibliography," Statistical Center Technical Report #1-3-75, Montana State University. Hartley, H.Q., and Hocking, ft.R., (1971) ; "The Analysis of incomplete Data," Biometrics. 27, 783-823. Hinkins, Susan M. (197.6), "MISSMLE - A Computer Program for Calculating Maximum Likelihood Estimates When Some.Data are Missing," . Statistical Center Technical Report #1-6-76, Montana State . University. ' _______ (1976a), "REACTOR - A Computer Program to Create Rubin's Table for Factoring a Multivariate Estimation Problem When Some Data are Missing," Statistical Center Technical Report #1-12-76, Montana State University. Hocking, R.R., and Smith, W.B. (1968), "Estimation of the Parameters in the Multivariate Normal Distribution with Missing Observations," Journal of the American Statistical Association, 63, 159-173. ■• _ (1972), "Optimum incomplete Multinormal Samples," Technometrics; 1 4 , 299 -3 0 7 .' ; Hberl, A.E., and Kennard, R.W. (1970), "Ridge Regression: Biased Estimation for Nonorthogonal Problems," Technometrics, 12, 55-67. Hudson, H.M. (1974), "Empirical Bayes Estimation," Technical Report #58, Department of Statistics, Stanford University, Stanford, .. California. James, W., and Stein, D. (I960), "Estimation with Quadratic Loss," Proceedings of the Fourth Berkeley Symposium, I, 361-379, Lin, P.E. (1971), "Estimation Procedures for Difference of Means with Missing Data," Journal of the American Statistical Association, 6 6 , 634-636. . ____ (1973), "Procedures for. Testing the Difference of Means with Incomplete Data," Journal of the American Statistical Association, . 68, 699-703. Lin, P.E., and Stivers, L.E. (1974), "On Differences of Means with "Incomplete Data," Biometrika, 61, 325-334. . ' (1975)., "Testing for Equality of Means with Incomplete Data on One Variable: .A. Monte-Carlo S t u d y Journal of the American Statistical Assdciationj 70, 190-193. Little, R.J.A.; (1976),."Receht Developments in Inference About Means . and Regression Coefficients for a, Multivariate Sample with Missing :Values/' Technical Report #26, department of Statistics, University of Chicago. 102 ______ ■ (1976a), "Inference About .Means from Incomplete Multivariate Data," Blometrika-, 63, 593-604. Lord, F.M. (1955),."Estimation of Parameters from Incomplete Data," Journal of the American Statistical Association, 50, 870-876. Louis, T.A., Heghinian, S ., arid Albert, A. (1976), "Maximum Likelihood Estimation Using Pseudo-Data Iteration," Boston University Research Report #2-76, Mathematics Department and Cancer Research Center, . Boston University. Mehta, J.S., and Gurland, J. (1969), "Some Properties and an Application of a Statistic Arising in Testing Correlation," Annals of Mathematical Statistics, 40, 1736-1745< _____ (1969a), "Testing Equality of Means in the Presence of Correlation," Bibmetrika, 56, 119-126. _____ . (19.73), "A Test of Equality of Means in the Presence of Corre­ lation and Missing Data," Biometrika, 60, 211-213. Mehta, J .S ., and Swamy, P.A.V.B. (1974), "Bayesian Analysis of a Bivariate Normal Distribution When Some Data are Missing," Contributions to Economic Analysis, 8 6 ., 289-309. Milliken, G.A., and McDonald, L.L. (1976), "Linear Models and Their Analysis in the Case of Missing or Incomplete Data: A Unifying Approach^" Biometrische Zeitschrift, 18, 381-396. Morrison, D.F. (1971), "Expectations and Variances of Maximum Likelihood Estimates of the Multivariate Normal Distribution Parameters with. Missing Data:," Journal of the American Statistical . Association, 6 6 , 602-604. ._______ _ (1973), "A Test of Equality of Means of Correlated Variates with Missing Data on One Response," Biometfika, 60, 101-105i. Morrison, D.F., and Bhoj, D.S. (1973), "Power of the Likelihood Ratio Test on the Mean Vector of the Multivariate. Normal Distribution, with Missing. Observations," Biometrika, 60, 365-368, Mudholkar, G.S., and Subbaiah, P. (1976), "Unequal Precision Multiple Comparisons for Randomized Block Designs under Nonstandard Condi­ tions Journal of _the American Statistical Association, 71, 429-434. : ■ ■ 103 Naik 5 U.D. (1975), "On Testing Equality of Means of Correlated Variables with Incomplete Data," Biometrika 5 62, 615-622, Nicholson, G.E. (1957), "Estimation of Parameters from Incomplete Multivariate Sampled," Journal of the American Statistical Association, 52, 523-526. Orchard, T., and Woodbury, M;A. (1970), "A Missing Information Prin­ ciple: Theory and Applications," Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, ■ I, 697-715. Rao, C. Radhakrishna (1973), Linear Statistical Inference and Its Applications, New York: John Wiley and Sons, Inc. Rubin, D. (1971), Contribution to the discussion of "The Analysis of Incomplete Data" by Hartley and Hocking, Biometrics, 27, 808-813. :_____ (1974), "Characterizing the Estimation of Parameters in Incomplete-Data Problems," Journal of the American Statistical Association, 69, 467-474. _____ _____ (1976), "Inference and Missing Data," Biometrika, 63, 581-592. (1977), "Comparing Regressions When Some Predictor Values Are Missing," Technometrics, 18, 201-205. Sundberg, R. (1974), "Maximum Likelihood Theory for Incomplete Data from an Exponential Family," Scandinavian.Journal of Statistics, I, 49-58. ■' (1976), "An Iterative Method for Solution of the Statistical Equations for Incomplete Data from Exponential Families/' Communications in Statistics, B5, 55-64. Wilks, S.S. (1932), "Moments and Distributions of Estimates of Popula. tion Parameters from fragmentary Samples," Annals of Mathematical Statistics, 3, 163-203. Woodbury; M.A. (19.71), Contribution.to the discussion of "The Analysis of Incomplete Data" by Hartley arid Hocking, Biometrics, 27, 808-813. . 104 Woodbury, M . A . a n d Hasselblad, V. (1970), "Maximum Likelihood Esti­ mates of the Variance-Covariance Matrix from the Multivariate Normal," SHARE National Meeting, Denver, Colorado. 3 D378 H592 cop. 2 DATE Hinkins, Susan M Using incomplete multi­ variate data to simul­ taneously estimate the means ISSUED TO 0378 »592 cop.2