Using incomplete multivariate data to simultaneously estimate the means

advertisement
Using incomplete multivariate data to simultaneously estimate the means
by Susan Marie Hinkins
A thesis submitted in partial fulfillment of the requirements for the degree of DOCTOR OF
PHILOSOPHY in Mathematics
Montana State University
© Copyright by Susan Marie Hinkins (1979)
Abstract:
This paper presents two estimators, a Bayes and an empirical Bayes estimator, of the mean vector from
a multivariate normal data set with missing observations. The variance-covariance matrix is assumed
known; the missing observations are assumed to be missing at random; and the loss function used is
squared error loss.
These two estimators are compared to the maximum likelihood estimate Both estimators resemble the
ridge-regression estimate, which shrinks the maximum likelihood estimate towards an a priori mean
vector. Both estimators are consistent for 0 and asymptotically equivalent to the maximum likelihood
estimate. Small sample properties of the Bayes estimator are found, including conditions under which it
has smaller risk than the maximum likelihood estimate. While small sample properties of the empirical
Bayes estimator are more difficult to find, numerical examples indicate that under some conditions the
empirical Bayes estimator may also improve on the maximum likelihood estimator.
This paper also presents two computer programs which provide other useful methods for estimation
when there are data missing. The program RFACTOR creates Rubin's factorization table. The program
MISSMLE calculates the maximum likelihood estimates of the mean vector and variance-covariance
matrix from a multivariate normal data set with missing data; it uses Orchard and Woodbury's iterative
procedure. USING INCOMPLETE MULTIVARIATE DATA
TO SIMULTANEOUSLY ESTIMATE THE MEANS
by
SUSAN MARIE HINKINS
A thesis submitted in partial fulfillment
of the requirements for the degree
of
DOCTOR OF PHILOSOPHY
in
Mathematics
Approved:
Chairman, Examining Coimnittee
MONTANA. STATE UNIVERSITY
Bozeman, Montana
August, 1979
ill
ACKNOWLEDGMENT
.
.
.
I wish to thank my thesis adviser Dr. Martin A. Hamilton for his
advice and assistance throughout my graduate work, and for suggesting
the topic of this thesis.
Thanks are also due to Ms. Debra Enevoldsen
for taking so much care in typing this manuscript, and to R, M. Gillette
for.drawing the flow charts.
I also wish to acknowledge my parents for their encouragement and
support throughout my education.
iv
TABLE OF CONTENTS
CHAPTER
PAGE
1.
INTRODUCTION ............... .
2.
BACKGROUND MATERIAL
2.1
2.2
2.3
2.4
2.5
3.
................................
.........................
.
5.
5
Maximum Likelihood Estimation . . . . . . . . . . . . . . .
Bayes and Empirical Bayes Estimation . . . . . . . . . .
.
Estimation with Incomplete Data ..........................
Maximum Likelihood Estimation with Incomplete Data . . . .
Estimating Linear Combinations of the Means When There
are Missing Data in a Bivariate Normal Sample . . . . .
5
5
11
16
29
RUBIN'S FACTORIZATION T A B L E ................................
3.1 Creating Rubin's Factorization Table ........ . . .
3.2 The Computer P r o g r a m ...........................
3.3 Sample P r o b l e m ...............
4.
I
40
* . . .41
45
52
ORCHARD AND WOODBURY'S ALGORITHM FOR MULTIVARIATE NORMAL
DATA ............. .......... '........................ .. . . .
54
4.1 The Iterative Algorithm
4.2 The Computer.Program
4.3 Sample Problem .
54
57
60
........ . . . .
............ ............ .. .
MAXIMUM LIKELIHOOD, BAYES, AND EMPIRICAL BAYES ESTIMATION
OF 6 WHEN THERE ARE MISSING DATA . . . . . . . . .. . . . . . .
65
5.1
5.2
5.3
5.4
5.5
65
66
69
69
71
Assumptions
.............
More Notation
. . . . . . . . . . . . . . . .
..........
Sufficient Statistics .
.. . , . . . . . .
. . .
The M.L.E. of 6 When $ is Known .. . . V . . . . .
. . . .
Bayes Estimation of 0. When | is Known . . . . . . . . . .
5.6 The Risk of the Bayes Estimate 6 ^
Compared to the
M.L.E. 0 . . . . . ... .'. .' . ... .. . . . . . . . -.
74
5.7 Empirical Bayes. Estimation of £ When | is Known . . .. . .80
5.8 Examples . . . . ■■ . . . .■. . . . . .■ . . .-. . ^
. . 86
6.
SUMMARY
. ...
. ..................
. ... . . . ... .1.. . .
94
BIBLIOGRAPHY ..........................................................99
V
ABSTRACT
This paper presents two estimators, a Bayes and an empirical Bayes
estimator, of the mean vector from a multivariate normal data set with
missing observations.
The variance-covariance matrix is assumed known;
the missing observations are assumed to be missing at fandom; and the
loss function used is squared error loss.
These two estimators are compared to. the maximum likelihood estir
mate. Both estimators resemble the ridge-regression estimate, which
shrinks the maximum likelihood estimate towards an a priori mean, vector
Both estimators are consistent for,8 and asymptotically equivalent to
the maximum likelihood estimate. Small sample properties of the Bayes
estimator are found, including conditions under which it has smaller
risk than the maximum likelihood estimate. While small sample prop­
erties of the empirical Bayes estimator are more difficult to find,
numerical examples indicate that under some conditions the empirical
Bayes estimator may also improve on the maximum likelihood estimator,
This paper also presents two computer programs which provide other
useful methods for estimation when there are data missing. The program
REACTOR creates Rubin's factorization table. The program MISSMLE cal­
culates the maximum likelihood estimates of the mean vector and
variance-covariance matrix from a multivariate normal data set with
missing data; it uses Orchard and Woodbury's iterative procedure.
I.
INTRODUCTION
The statistical analysis of partially missing or incomplete data
sets is an important problem which occurs in many subject areas.
Responses to questionnaires often include unanswered questions or
undecipherable responses.
data.
Mechanical failures can cause incomplete
For example, several weather measurements (temperature, wind
speed, wind direction, etc.) may be recorded automatically on an hourly
basis, but a recording device may malfunction occasionally or run out
of ink, causing missing data.
Human error, including computer error,
may cause data to be unobserved or lost.
A seemingly simple solution to analyzing incomplete data is to
discard any observation which is not complete.
It may be impractical,
however, to discard information which may have been expensive or diffi­
cult to collect.
In some situations, discarding incomplete observations
may amount to discarding the entire.experiment.
Since incomplete data
may occur for any type of analysis— maximum likelihood estimation,
analysis of variance, factor analysis, linear models analysis, contin­
gency table analysis, etc., the problem of statistical analysis when the
data are not complete touches every branch of statistical methodology.
One of the most basic statistical analyses is simultaneous esti-r
mation of the variate means from a multivariate data set.
primary concern of this paper.
This is the
Chapter 2 gives a brief review of
current literature dealing with estimation and hypothesis testing
2
when there are missing data in a multivariate data set.
Much of the
work described is concerned with multivariate normal data.
Rubin's method for factoring the likelihood of the observed data,
when there are data missing, is described in chapter 3.
is a useful tool, especially in large data sets.
This technique
When possible, it
factors the big problem into a few simpler estimation problems.
The
algorithm for creating the factorization is hard to use unless the data
Set is small and has a simple pattern of missingness.
As part of this
research project, I have written a FORTRAN computer program which
creates Rubin's factorization table for any pattern of missingness.
It displays the pattern of missingness, the number of observations with
each pattern, and the factorization, if any, of the problem.
Maximum likelihood estimation is an important, often-r-used statis­
tical technique.
When there are data missing, calculation of the
maximum likelihood estimate of the mean and variance-covariance matrix
of a multivariate normal distribution is a nontrivial problem.
Orchard
and Woodbury (.1970) developed an iterative procedure for finding these
estimates.
I have written a computer program which calculates the
maximum likelihood estimates using Orchard and Woodbury's procedure.
It is discussed in chapter 4.
The approach taken in this paper (chapter 5) is to consider Bayes
and empirical Bayes estimates of the mean, 0, of a multivariate normal
distribution from a random sample where there are data missing.
These
3
estimates are compared to the maximum likelihood estimate.
Some
assumptions are made about the process that causes missingness, but
the only assumption made about the pattern of missingness is that every
variable is observed at least once.
The variance-covariance matrix is
assumed to be known.
A multivariate normal prior distribution, with mean vector Q and
diagonal variance-covariance matrix, is assumed on 0, and the Bayes
estimate under squared-error loss is found.
The Bayes estimate is
similar to a ridge-regression estimate, which is a shrinkage estimator,
shrinking the maximum likelihood estimate towards 0.
The Bayes esti­
mate has smaller Bayes risk than the maximum likelihood estimate, under
squared error loss, and for values of 0 within a specified ellipsoid
it has smaller risk than the maximum likelihood estimate.
The Bayes
estimate is biased, but under mild conditions it is consistent and
asymptotically equivalent to the maximum likelihood estimate.
However the Bayes estimate is not always practical since it
depends on knowledge of the variances in the prior distribution.
Using
the unconditional distriuution of the data, estimates of the variances
of the prior distribution can be found.
These estimates are substituted
into the formula for the Bayes estimate of 0 and the resulting estimate
is an empirical Bayes estimate of 0.
The idea here is that, this esti­
mate, although based entirely on the observed data, will retain some of
the nice properties of the Bayes estimate.
The empirical Bayes estimate
4
is in the form of a ridge-regression estimate.
It is biased, but under
the same mild conditions, it is consistent and asymptotically equiva­
lent to the maximum likelihood estimate.
A few numerical examples were
done and the empirical. Bayes estimate compared well to the maximum
likelihood estimate.
errors.
The basis of comparison was the sum of squared
2.
BACKGROUND MATERIAL
2.1 Maximum Likelihood Estimation
Let 0 = (0^...0p)’ be a p x I vector of parameters with values in
a specified parameter space ft, where ft is a subset of Euclidean p-space.
The vector X = (x^.-.x^)' denotes the vector of observations and f(x|0)
denotes the density of X at a value of 0.
The likelihood function of 0
given the observations is defined to be a function of 0 »l(x|0), such
a
that l(x|0) is proportional to f (X|0).
If the estimator 0 = 0 (X)
maximizes £(x|0), then 0 is a maximum likelihood estimate (M.L.E.) of 0.
It is often convenient to deal with the log likelihood,
A
L(X;0) = ln(l(x|0)), in which case 0 maximizes L(X;0).
When L(X;0) is a
differentiable function of 0 and sup L(X;0) is attained at ah interior
■■
■
0eft
point of ft, then the M.L.E. is a solution of the maximum likelihood
equations 3L(X;0)/30 = Q.
There are densities where a M.L.E. does not exist or is not unique.
Under regularity conditions, the M.L.E. exists, is consistent for 0, and
is asymptotically efficient (Rao 1973).
a .
If 0 is the M.L.E. of 0 and h(0) is a function of 0,. then the.
M.L.E. of h(0) is h(0) (Graybill 1976).
,
2.2 Bayes and Empirical Bayes Estimation
An alternative method of estimation is Bayes estimation, following
from Bayes' theorem which was published in 1763.
The essential compo­
nents are the parameter space ft, a prior distribution on the parameter,
6
observations X in Euclidean n-space, and a loss function L S (0,0) which
measures the loss occurring when the parameter 0 is estimated by
0 = 0(X).
It is assumed that there is a family of density functions
{f (x|0),0 E fi} for an observation X.
For any 0 = 0(X) in Euclidean
p-space, the risk function is defined as
R(0,0) = Ex |0 (LS(0,0)) =/LS(0,0(X))f(x|0)dX
(Ferguson 1967).
It will be assumed that the only estimates 0 being
considered are those where R(0,0) is finite.. For the given probability
distribution function P on
the Bayes risk is defined as
r(0) = E 0 (R(0,0)) =/R(0,0)dP(0)
. .
" (B)
A Bayes estimate, if one exists, is a value 0
such that
r(0(B)) = inf r(0).
It can be shown that the Bayes estimate also minimizes
a
.
■
E 0 |x (LS(0,0(X))) (DeGroot 1970).
.
■
■
.
'
If X has probability density function
(p.d.f.) f (x|0) and 0 has p.d.f. f (0), then Bayes' theorem states.that.
•f (0 |x) = cf (x|0)f(0), where c is not a function of 0.
The p.d.f. f (0)
is called the prior p.d.f. and f (0|X) is called the posterior p.d.f.
Bayes estimation is particularly suited to iterative estimation; from
previous experience, one increases and refines one's knowledge about 0.
The prior distribution represents what is known about the parameter
prior to sampling; it can indicate quite specific knowledge or
7
relative ignorance.
The posterior distribution can also be written
f(6 |X)
= k
£(x|6)f(0)
where £(X 10) is the likelihood function and k is not a function of 0.
The likelihood function can be thought to represent the information
about 0 coming from the data.
It is the function through which the
data, X, modifies the prior knowledge about 0.
If the data could not
provide additional information about 0, i.e, if the prior totally
dominated the likelihood, one would usually not bother to gather data
and calculate an estimate of the parameter.
Therefore one usually
wants the prior to be dominated by the likelihood; that is, the prior
distribution does not change very much over the region in which the
likelihood is appreciable and the prior does not assume large values
outside this region.
For example, if x^.Xg,...,^^ is a random sample,
i
x ^ I0 normally distributed with mean 0 and variance o
2
2
(Mt(0 ,a )), and
. 9
2
(a,v ), then the prior is dominated by the likelihood if v is large
.
2
compared to o /N.
We will be concerned specifically with Bayes estimation under
squared error loss, L S (0,0) = (0 - 0 ) ’V(0 - 0) where V is a symmetric,
positive definite matrix of constants.
estimate is 0 ^
In this case the unique Bayes
= E (6 |x).and the Bayes risk is r ( 0 ^ )
= tr(V Var(0 |x))
(DeGroot 1970), where tr(M) indicates the trace of the matrix M.
8
2.2.1
James-Steln and Ridge-Regression Estimators
. .
Consider the simple situation where X |-6 has a multivariate Ijibrmal .
distribution with mean vector 9 and variance-covariance matrix
I
(X|6 ~ MVN(9,Ip)), where I
is the p x p identity matrix, n = p, and
the parameter, space U is Euclidean p-space.
The M.L.E. of 0 is 0 = X;
this is also the least, squares estimate, and for each i, i = 1,2,...,p,
9^ = x_^ is the best linear unbiased estimate (b.l.u.e.) of 0^,
Suppose
the loss function of interest is simple squared error loss, i.e. V = I^,
then 0 is also the minimax.estimate; that is, 0 minimizes max(R(0,0)).
0
James and Stein.(1960) showed that the estimator
0 ^
= (I - (p - 2)/X'X)X, which is neither unbiased nor a linear
A
estimate of 0, has smaller mean squared error than 0 when p > 3; that
is,
ex
|q (0 ^
- 6) ' (6 ^
- 0) < Ex |g(0 - 0) '(0 - 0).
Therefore, when
p > 3, the M.L.E. 0 is inadmissible under squared error loss, being
dominated by the James-Stein (J-S) estimate;
The J-S estimate can be constructed as an empirical.Bayes (E.B .) 1
estimate.
Let the prior distribution on 0 be MVN((), (A)I^) where A is
a positive constant*
Under squared error loss, with V = I
, the Bayes
p
estimate is (I - 1 / ( 1 + A))X. If A is not known, 1/(1 + A) must be
,estimated.
When 1/(1 + A) is replaced by its estimate, one has an
E.B. estimate.
The M.L.E. based on f (X) of 1/(1 + A) is p/X’X; however
if one uses the unbiased estimate (p - 2)/X,X for 1/(1 + A), the
.resulting. E.B. estimate is identical to thie J-S estimate.
This E.B.
,
9
derivation clearly shows that the J-S estimate amounts to shrinking t h e .
least squares estimate towards JJ).
James and Stein also offered a simple extension of this estimate,
namely the estimate 8* = (I - (p - 2)/X'X)+X where
(u)+ = u if u > 6
= 0 if u < 0 .
The idea is that (p - 2)/X'X is estimating 1/(1 + A) which is between 0
and I.
The modification 0* fixes the estimate of (1/(1 + A ) ) at I if
(p - 2)/X'X
I.. Efron and Morris (1972) look at J-S type estimators
when X^ ~ M V N I ^ ) .
Hudson (1974) gives a good general discussion of
E.B. estimators and extends the theory of J-S estimation.
The point 0 towards which the J-S estimator shrinks the least
squares estimator is not special.
The J-S estimator can shrink towards
any given 6^; the appropriate estimate is then
8 ^ ) = 0Q + (I - (p - 2)/(X - 8g)' (X - 80)) (X - 0q ) .
Or suppose one:
suspects that 0 has a prior distribution 0 ^ MVN(y^,(A)I^), where, both
y and A are unknown scalars, and ^ is the p x I vector of I's,
the appropriate J-S estimate is
0 (S) =
+ (I - (p - 3) / (X
x p ' (X - x p M X - x p where x =
Then
p
Z x^ ,
i=l
So, without loss of generality, we will continue to. consider estimators
which shrink the least squares estimate towards the zero vector.
10
Hoerl and Kennard (1970) proposed the ridge-regression estimator,
(R-R), originally for application to the general linear model and least
squares estimates.
towards
It is a shrinkage estimator which shrinks the M.L.E
as does the J-S estimator.
The R-R estimator can also be
derived as a Bayes or E.B. estimator.
Let X = (x- ...x ) be the M.L.E. of 6' = (6 ...0 ). An R-R estiI p
l
p
.
* (R)
mator is one which can be expressed as
= (I - h^(X).)x^ ,
i = 1,2,...,p.
It shrinks the M.L.E. towards Q, but each component is
not shrunk by the same proportion.
A J-S type estimator is of the form
" Zg)
= (I - h (X))X^ ; each component is shrunk by the same proportion.
2.2.2 An Example
Suppose. X^ = ( x ^ . ..x^p)., i = I .... N, is a random sample;
X ' I6 ~ MVN(0,|) where | is the p x p diagonal matrix (cr..) , ff... > 0,
1
_
N
. _
_
13
JJ
j = 1,2,...,p. Let X =
2 X '/N A (x ...x )'. Then.
j
. . i
—
-I
*p'
i=l
x|G
MVN(G,(1/N)|) and X is the M.L.E; of 0,
For the loss function
L S (0,0) = N(0 - 0)'I ^(0 - 0), the risk is R(0,X) = p.
Suppose one assumes a prior 8 ~ M V N (0,(A)|) where A is a positive
scalar constant.
Then the Bayes estimate of 6 is 6 ^
where 0^B ^ = (I - 1/(1 + NA))X v^.
(I - 1/(1 + NA))p < p.
= ( 8 ^ \ . . 8 ^ )'
-L
p
The Bayes risk of 0^B ^ is
Replacing 1/(1 + NA) by an estimate one gets an
E.B. estimate which is a J-S estimate.
If the a
are known,
•
JJ
(p - 2)/(N)X'|~^X is an unbiased estimate, with respect to the
11
unconditional density f (X), and the J-S estimate is
(I - (p - 2)/(N)X'| ^X)X.
The risk of the J-S estimate is
p - (p - 2 + (p - 2) (I - 1/(1 + NA)) - 2 0 fEx j0((p - 2)/(XtIf1X) T 1X)).
If in fact 0 = 0
(and A = 0 ) ,
the risk (and Bayes risk) of the J-S
estimate is 2 compared to the risk p of the M.L.E.
Suppose, instead, that one assumes a prior 0
A is still a positive scalar constant.
MVN(0,(A)I^) where
For example (Hudson 1974), one -
might be measuring the incidence of food poisoning in several towns.
Data from a large town would give an estimate with low variability
compared to data from a small town.
However it could be that the
variability in true incidence rates is the same for all size towns.
In this case the Bayes estimate of 0^ is 0 ^ ^ = (I -
which is an R-R estimate.
yVn)
The. Bayes risk is r(0v ') =
+ NA))x ^
P
I. NA/(a., + NA)
1=1
11
A /T)\
which is strictly less than p, and r(0' ') decreases as A decreases.
If the a., are known, then an E.B. estimate, which is. also an R-R esti-
11
/
mate, is found by replacing I/ (cr^
+ NA) by an estimate, which might be
obtained using the unconditional distribution X ~ MVN({|),(1/N)if + (A)I^)
2.3 Estimation with Incomplete Data
Missing or partially specified observations in multivariate data
may occur in almost any type of experiment.
The primary interest in
this discussion is the problem of estimating a parameter 0 from a
12
multivariate distribution.
Let the p x I random vectors X
^
be independently, identically distributed with p.d.f. f^(x|6).
the
element of X^ by
.,X^
Denote
Instead of observing X^,Xg,...,X^,
however, only Y . = Y .(X ) can be observed.
are incomplete.
,
In this sense, the data
This,general problem encompasses missing data, par­
tially specified data, and other general situations.
When the data are
incomplete, inference must be based on the joint p.d.f. of the Y^'s,
fyCYf,.. .,Yjj
JI0) .
The problem of missing data is the particular case
of incomplete data where each Y^(X^) is a function taking X^ onto a
specific subset of elements in X^.
For example, if in the first obser­
vation all variables are observed and in the second observation only.
the first two variables are observed, then Y^(X^) = X^ = ( x ^ x^g.,.x^)
and Yg(Xg) = (Xg1 Xgg}.
2.3.1 Missing at Random
An assumption often made in analysis of missing data is that the
missing observations are missing at random (MAR).
The data are.said to
be MAR if the distribution of (the.observed pattern of missing data}
given (the observed data and the (unobserved).values of the missing
data}, is the same for all possible values of the missing data (Rubin
1976).
That is, the observations that one tried to record are missing
not because of their value, but because of some random process which
causes, missingness in some simple probabilistic process.
Censored data
13
is an example of missing data that is not MAR.
Another common assumption made is that one can ignore the process
that causes missing data.
This means that the missingness pattern is
fixed and not a random variable, that the observed values arise from
the marginal density of the complete (though not fully observed) data.
Let
= (x\^...x^p) be the complete observation.
Let
^ indicate the
vector of values of X. that are observed and let X^™^ indicate the
I
I
values that are missing.
If X^ has a p.d.f. f^(X^|6), then one is
assuming that X ^ ^ has a distribution given by the density function
(m)
J fxCx1 1e)dxj
Rubin (1976) points out that ignoring the process that causes
missing data is not always valid— even if there are no data missing.'
However, he shows that for Bayesian analysis and likelihood inference,
the analysis ignoring the missingness process gives the same results
as the complete analysis, taking the.process into account, if two
criteria are met; namely, (I) the data are MAR, and (2) the parameter
of the missingness process is distinct from 9, the parameter of the
data.
The parameter, <j>, of the missingness process is said to be dis­
tinct from 0 if their joint parameter space factors into a <j>-space and
a 0-space, and if, when prior distributions are given for <j> and.9,
these distributions are independent.
As an example, suppose weather
data is being recorded by an instrument that has constant probability,
<j>, of failing to record a result, for all possible samples.
In this
14
example, <f> is distinct from 0, the parameter of the data.
example .is censoring.
Another
Suppose a variable U is only recorded if U _> <j>,
where <j> might, be the population mean.
Then <j) and 0 are distinct only
if <f> is known apriori.
Rubin concludes that the statistician should explicitly consider
the process that causes missing data far more often than is common
practice.
Although modeling the missingness process should be attempted
it is usually difficult, if not impossible, to get the information
necessary to perform such modeling.
.
2.3.2 Rubin’s Factorization Table and Identifiability of 0
Suppose a random sample has been taken and there are missing
values.
If one can assume that the values are MAR and that the
parameters are distinct, then Rubin (1974) describes an important
algorithm which provides a method by which one can often factor the
original estimation problem into smaller estimation problems.
Suppose
N observations are taken, each of which is a complete or partial
realization of p variables.
The factorization is summarized in a table
which identifies, the parameters associated with each factoir and the
observations to be used to estimate these parameters.
The M.L.E.or-
Bayes estimates of the parameters associated with a factor, which are
found using only the indicated observations; are identical to the
estimates calculated using all available data,
(Anderson (1957)
15
suggested the idea of factoring the likelihood function; however he
only considered M.L.E. for MVN data.)
It may happen that because of the pattern of missingness, certain
parameters are not identifiable.
parameter values 8^ and
That is, there may exist two distinct
say, such that ^ ( Y ^ , ... ,YjjI0 ) =
fY (Yi.... Y n Iq 2) except on sets of probability zero with respect to both
p.d.f.’s.
Especially with a large data set where each observation con­
sists of many variables, it may not be immediately obvious which
parameters are identifiable.
Incorrect and misleading results may be
obtained by estimating parameters which are not identifiable.
Rubin's
factorization table helps determine which parameters, if any, are not
identifiable.
In general, if there are fewer cases than parameters associated
with one of the factors, then one has an identifiability problem.
Inspecting the data to see that all pairs of variables have cases in
common is not enough to guarantee estiinability.
100 cases with three variables x;Q > x;L2,xi3’
51-100 only x ^
Suppose that on the cases.
is measured, on cases 4-50 x ^
on cases 1-3 x ^ , X ^ > arid x ^ are measured.
For example, consider
and x ^
are,measured, and
Using standard notation
for conditional normal distributions, the joint p.d.f. of all 100
observed vectors can be factored into
16
11 * ....... 100, 1 ' 12'
f (x-1 >•• • >x
’X50,2’X13’X23’X33l9®1) =
100,11yI'0!^
f C x 1 2 , ' " , x S O , 2 Ix 1 1 . . . X 1 0 0 , 1
;a2-l’P2-l’cr2*l^
2
f ^x13’X23’X 33'Xl l ’*" * X100,1'X1 2 ' " '’X50,2;cl3-12’331-2 ’^32,1 ’CT3*12^. * ■
But the parameters in the last p.d.f. on the right-hand side of the
equation are not identifiable.
Rubin's factorization is an important tool in estimation when
observations are missing.
It provides a method for possibly simplifying
the estimation of 0, summarizes the information available, and charac­
terizes the problems associated with a particular pattern of missingness
While creating the factorization table by hand is conceptually easy, it
is practically impossible for any but small data sets.
Chapter 3
describes a computer program, REACTOR, which creates Rubin's fac­
torization table for a given set of data.
2.4 Maximum Likelihood Estimation with Incomplete Data
Various algorithms for calculating the M.L.E. of the parameter of
a multivariate distribution, when the sample contains incomplete obser­
vations, have been discussed in the literature.
Let X in x be the complete data vector with sampling density
f (x|0), 0 e fi.
are incomplete.
Let L(X;0) = In f (x|0).
The observed data, Y = Y(X),
The unobserved X is known only to lie in x O O > the
17
subset of x determined by Y = Y(X).
densities for the observed data.
data, g(Y I0) ^
Let g(Y|6) be the family of
Then in the special case of missing
^ (X|0)dX, where it is assumed that the process
producing missingness can be ignored.
Let L(Y;0) = In g(Y;0).
The
goal then is to find Q* e Q which maximizes L(Y;0).
Dempster, Laird, and Rubin (D, L, and R) (1976) have developed a
general iterative algorithm for computing the M.L.E. when the obser­
vations can be viewed as incomplete data.
The algorithm is called the
EM algorithm because each iteration, from 0
steps:
to 0
(r+1)
, involves two
the expectation step and the maximization step.
to be E(L(X;0) |Y,0^).
is:
(r)
E-step:
of 0 e
Define Q (0|0^)
Then the general EM iteration, 0 ^
compute Q(010 ^ ) ;
M-step:
which maximizes Q(0|0^r^).
choose
-> g (r+l) $
to be the value
The heuristic idea is that one
would, ideally, like to find the value 0 which maximizes L(X;0).
Since
L(X;0) is not known, because X is riot known, maximize E(L(X;0) |Y,0'’ ')
at each step.
Let D10Q (0 I0A ) = 3Q(0|0a )/30, and p2OQ(0|.0A ) = 32Q(0 |0A ) / (30)2 .
Then the usual procedure, is to choose 0
D ^ Q (6
|0^)
This way, 8(^+1)
0<r>.
■
(r+1)
to satisfy
■■
= 0 and D 2^Q(0^r:*"1^ |0^r^.) nonpositive definite.
at least a local maximum of Q(0|0^r^), for fixed
•
Some pertinent questions to ask are:
is there a unique global
maximum of L(Y;0); if so, does the algorithm converge; if.it converges,
18
is it to a value 0* which maximizes L(Y;0); and how fast does it
converge?
case.
D, L, and R answer some of these questions for the general
Under assumptions of continuity and differentiability, they
show that if the EM algorithm converges to 0* and if D ^ Q (0* 10*) is
negative definite, then the limiting 0* is a local or global maximum
of L(Y;0).
If, in addition, each g(r+l)
the unique max of Q(0|0^r^ ) ,
the EM procedure monotonically approaches this maximum; i.e., .
L ( Y ; 0 ^ + ^ ) _> L(Y;0^r^).
They also give a formula which describes
the rate of convergence.
They note that if the (Fisher) information
loss due to incompleteness is small, the algorithm converges rapidly.
The amount of information lost may vary across the elements of 0, so
certain components of 0 may approach 0* more rapidly, using the EM
algorithm, than others.
The EM algorithm is very general; it could be applied to many
distributions f (x|0); ■However it may be very complicated dr even
impossible to apply for certain problems,
D, L, and R work out several
specific examples, both in terms of specifying f (X|0) ,and specifying
the function Y = Y ( X ) .
Orchard and Woodbury
,
(0
and W).
(1970)
also derive a special case
. of the EM algorithm, as a consequence of their missing, information
principle, for obtaining the M.L.E, of a parameter 0 when the p.d.f. of
X^ is f (Xjs) but X^ is not completely observed.
The variables in X^
can be partitioned X^ - .(Y^,Z^) where Y^ is, the vector of. observed
19
components and
contains the variables that were unobserved.
Beale
and Little (1975) give a detailed derivation of 0 and W's missingness
information principle, with a slightly different emphasis.
In their
discussion, it is easy to see that this is another example of an EM
algorithm.
The idea is to treat the missing components, Z^, as a random
variable with some known distribution, f (z |y ;6^).
find 0 which maximizes L(Y;0).
The goal is to
It may be easier, however, to find
the value of 0 which maximizes the expected value of L(Y,Z;0) if Z
is treated as a random variable with known distribution.
So, maximize
E z |.y.0(L(Y,Z;0A)) for any fixed 0^; note, that this is D, L, and R's
Q ( 0 10.) .
A
Suppose 0 = 0
ID
maximizes E ,
Z Iy £U
(L(Y,Z;0 )); then define the
A
transformation 0 by 0 = 0(0.). The transformation $ will define the
m
A
(r)
(r+1)
equations for each iteration 0 v
-> 0 ^
, and repeated iteration is
done until there is no appreciable change.
0 is a fixed point of 0; i.e., 0 = 0(0).
They show that the M.L.E.
Also, if the likelihood
function is differentiable, any solution of the fixed point equations,
0 = 0(0), automatically satisfies the likelihood equations and is a
relative max or a stationary point; therefore the procedure won't
converge to a relative minimum.
20
2.4.1
Maximum Likelihood Estimation when f (x|6) is of the Regular
Exponential Family
Sundberg
(1974, 1976)
worked but the special case where
f(x|6)
the form of the regular exponential family; f (x|0) = b(X)e®
has
/a(6) ,
where 6 is a p x I vector of parameters and t(X) is a p x I vector of
complete-data sufficient statistics.
D, L, and R also looked at this
case in detail and point out that Sundberg's algorithm is the EM
algorithm.
Sundberg shows that the likelihood equations can be written
as E^Ct(X)|Y) = E^Ct(X)).
(He attributes this result to unpublished
notes of Martin-Lof (1966).)
solving these equations.
as 0^r+^
He defines an iterative algorithm for
An iteration from 0
(r)
to 0
(r+i)
is defined
= m t_"*"(E(t|Y,0^r^)) , where m^(0) = E0( t ( X ) ) a n d assuming m ^
exists.
The EM algorithm when f (X|0) is of the exponential family is:
E.-step:
compute t ^
= E(t(X) |Y,0^r^); M-step;
the solution of the equations Eg(t(X)) = t^r^ .
determine
as
Therefore, Sundberg's
algorithm is the same as the EM algorithm.
Again, the same questions are pertinent.
does L(Y;0) have a unique global maximum?
In this special case,
Perhaps not; since Y does
not necessarily have an exponential distribution, there may be several
roots of the likelihood equations and it may be that none of them give
a global maximum.
Simdberg (1974) gives an example and Rubin, in his
discussion following Hartley and Hocking (1971), gives two examples
where there, is not a unique M.L.E.
21
Does the. algorithm converge to a relative maximum of L(Y;0)?
From
D, L, and R, we have that if the algorithm converges to 0* and if
D
20
Q(0*10*) is negative definite, then 0* is a local or global max.
0
Let 0 be the true parameter vector.
Var
q
Sundberg shows that if
(E 0 (t(X)|Y)) is positive definite, then for large N the algorithm
converges, say to 0*.
For the case of exponential families,
D ^ Q ( 0 * I0*) = -Var (t I0*) , which is. negative definite.
Therefore, for
large N, if Var -(E .(t|y )) is positive definite, the algorithm con0
0U
verges to a value 0* where 0* gives a relative maximum of L(Y;0).
Also, since Q(010 )
is convex in 0, if all 0 ^
of fi, then each Q Cr*1"!)
are in the interior
the unique maximum of Q(010 )
and the
convergence is monotone.
Let J y (9) be the Fisher information matrix when observing Y(X);
J (0) = E(3L(Y;0)/90)^ = Var(8L(Y;0)/30).
y
J (0) = Varg(Eg(t|Y)).
Sundberg notes that
Therefore, the necessary condition for con-
y
o
vergence can be restated as the condition that J (0 ) is positive
y
definite.
'
Let J (0) be the Fisher information matrix when observing
x
X; Jx (6) = Var0 Ct(X)).
Then Jy(B) = J 35 (S) - E q (Varg (t|Y)) and
the matrix Eg (Varg (t|Y)) could be considered the loss of information
due to observing Y(X) instead of X.
The matrix J "^(B)Eg (Varg (t|Y))
could be considered a measure of relative loss of Fisher information
due to observing Y rather than X.
‘
22
Sundberg defines the factor of convergence, of the algorithm, in
the following way:
asymptotically as r ^ 00, the error |6^r^ - 6*|
decreases by this factor at each iteration.
He shows that if J (6^)
.
y
is positive definite, the factor of convergence for the algorithm
0 ^
-> 0* is the maximal eigenvalue of J ^ ( O ) E g (VarQ (t|Y)) , which
will be < I.
So again, this implies that the smaller the relative
loss of information, due to observing Y(X) instead of X, the faster
the algorithm will converge.
Louis, Heghinian, and Albert (1976) consider a slightly less
general problem.
They are interested in the problem of finding a
M.L.E. of the parameters where the data are a sample from a regular
exponential family but the data are partially specified in that one
knows for each X. that X. e R., a subset of the reals.
i
l
i
if the data are observed exactly,
missing, then R. = the reals.
For. example,
= {X^}; if an observation is
Their iterative method replaces missing
or. partially specified data with estimates, which are found using.a
current set of (estimated) parameter values.
This pseudo-data, com­
plete now, is used to form a new closed form estimate of the parameters
This procedure continues.until the.sequence of parameter values con­
verges; they show that, under regularity conditions, it does converge
to a maximal point.
The algorithm for calculating the estimate of the
parameter in each iteration is the EM algorithm for the exponential
family with partially specified data.
23
Blight (1970) found the M.L.E. of parameters of a regular
exponential family when the observed data are censored in a specific
way.
Within a certain region one can observe the values exactly; for
data falling outside this region, only grouped frequencies are known.
2.4.2
Notation and Definitions for Patterns of Missing Observations
Let
be a I x p random vector.
A random sample X^,Xg,...,X^ is
taken but there are missing data in that for each observation i, only a
subset, Y^, of the variables (x^^,x^g,...,x^^) is observed.
the N x p
Let M be
incompleteness matrix; the entry (ij) in M is I if x ^
observed and 0 if x . . is not observed.
1J
is
Replace all rows of M having
the same 0-1 pattern with one row having that pattern, and call the
resulting incompleteness matrix M*.
Let k be the total number of
distinct patterns of observed variables among the N observations.
Then M* is a k x p matrix.
Let M* = (m. .) .
Missingness pattern i
■
th
refers to the missingness pattern denoted by the i
row of M*.
A data set is said to have a monotone pattern of missingness if
the matrix M* can be rearranged so that if
all i = 1,2,...,Z.
= I for
In other words, if the random vectors can be
ordered such that the variables in
observed in Y^
= I, then m ^
are a subset of the variables
for all i, the random sample is said to have a
monotone pattern of missingness,
24
For example, suppose there are 3 variables which were to be
recorded and the following observations were seen, where
indicates
a missing value.
Y1 = (3,6,7)
Y 2 = (*,*,8)
Y 3 = (8,1,3)
.
Y4 = (2,0,-1)
Y 5 = (*,4,2)
Then
/111
M=I
/ 001
111
\ HI
AOll
The rows of M* can be reordered so that
M* =
/111
Oil
Vooi
Therefore, this data has. a monotone pattern of missingness.
Without loss of generality, the observations can be reordered
such that the first n(l) observations are those with missingness
25
pattern I; the next n(2) observations are those with missingness
pattern 2; etc.
The last n(k) observations are those with the
missingness pattern of the U t*1 row of M*.
2.4.3 M.L.E. of Parameters of a MVN Distribution
Orchard and Woodbury (0 and W) (1970) work out the specific
algorithm for estimating the mean vector and the variance-covariance
matrix when the data are from a MVN distribution and there are missing
components in the random sample.
and Hasselblad (1970).
This is also discussed by Woodbury,
The MVN distribution belongs to the exponential
class of distributions, and 0 and W's algorithm is a special case of
Sundberg's algorithm and the EM algorithm.
However, 0 and W have
worked out the details for using the algorithm in this special case,
so that the algorithm is computer programmable.
The details for this
algorithm and a computer program are given in chapter 4.
Another method of finding the M.L.E. is the method of scoring.
Let the Lt*1 score be S^(6) = dL(X;0)/d0^, where L(X;0) is the loglikelihood of X |0, and let J^(0) be the Fisher information matrix
(E0 (SiSj)).
The likelihood equations are (S^...Sp) 1V= 0, which may
be difficult to solve.
The method.of scoring is an iterative procedure
based on estimating the score by the first.two terms of its Taylor’s
expansion.
The second term is actually estimated by its expected
I
value.
iterate,
The (I + I)-Iterate, 0 ^ + ^ ,
e(1), by 9(1+1)
is then obtained from the previous
0(i) + Jx1 (6(:L))(S1 (6(i)). ..Sp (e(1)))
The equations for the iterative method of scoring when the obser­
vations are from a MVN distribution and some components are missing are
worked out by Hartley and Hocking (1971) and are restated by Little
(1976).
Hartley and Hocking’s development of this procedure stems from
treating the missing observations as unknown parameters, rather than as
random variables as 0 and W did.
Little (1976) compares the method of scoring and 0 and W ’s
iterative procedure for solving for the M.L.E. of the means, ju, arid, the
regression coefficients,
when there are missing data in a multi­
variate sample from a MVN distribution, where one variable is specified
as dependent.
Point estimation, confidence intervals, and hypothesis
tests are discussed.
The advantages of 0 and W ’s method over the
method of scoring are:
i) large matrix inversions are avoided;
ii) it is easy to program; and ill) it provides fitted values for the
missing variables., The major advantages of the method of scoring over
0 and.W’s algorithm are:
i) as a. by-product, it produces an estimate
of the standard errors of the estimate; and ii) it converges faster,
at a quadratic rate compared to a linear rate for 0 and W (Little 1976),
The most important advantage of the method of scoring is that it
provides estimates of the standard errors.
However, under certain
circumstances, any algorithm can easily estimate the standard error
27
of the estimate.
Hocking and Smith (1968) give equations for the
estimated covariance matrix of the estimates when the data have a
monotone missingness pattern with either 2 or 3 patterns of missingness;
i.e., M* is either a 2 x p o r a 3 x p
missingness, when estimating just
are easily obtained (Little 1976).
matrix.
For any pattern of
estimates of the standard error
Also, Beale and Little (1975)
suggest a procedure for estimating the standard error of
which is
easily obtained and performed well in a simulation study.
Beale and Little, in the same article, compare six methods for
finding the M.L.E. from a MVN sample which has missing components.
The methods considered include ordinary least squares on complete
observations only, iterated Buck (to be described later), and three
weighted least squares procedures.
The simulation was done with
variable #1 always identified as the dependent variable, x . , and from
2-4 independent variables, X25X^,x^.
Let x ^ , j = I , ... ,4, be the
value of the J t*1 variable in the It*1 observation.
The criteria used
to judge the estimator^ was
J 1fxU - »0
where the b .'s are the estimated regression coefficients using a
particular method, and the x ^ ' s are the true values of all variables
before deletion.
28
The methods that they found best were the iterated Buck's and a
method of weighted least squares.
The iterated Buck's procedure is a
modification of Buck's method for finding M.L.E.'s of parameters from,
a multivariate data set, not necessarily normal, which has missing
data (Buck 1960).
The iterated Buck procedure, also called the cor­
rected M.L.E. procedure, is equivalent to 0 and W s procedure except
that the variance-covariance M.L.E.'s are. multiplied by N/(N - I) where
The iterated Buck procedure and 0 and W s
N is the sample size.
algorithm gave almost identical results.
The method of weighted least squares is as follows.
Get an
estimate of the covariance matrix of all variables, $, and the mean
vector,
-
jj,
1
using the corrected M.L. procedure.
' I ■ ’ ^
Using | and
for
each observation i, missing values are estimated by the estimated
conditional mean of the unobserved independent variables given the
observed independent variables.
Let a
denote the.residual variance
2
of x^ when all independent variables are fitted and let a^ be the
conditional variance of x .. given the observed variables in observation i.
2
Let s
and s, be the corresponding estimates.
2 2
W i = s Zsi
0
.
11
2
Define
if the dependent variable Xii is observed
.otherwise .
Note that if all independent variables and x._ are present iii obsef;
\
vation i, then W 4 = I.
11
Then a weighted least squares analysis with
29
weights
is carried out on all the data with
present.
There are many papers in the literature dealing specifically
with linear models and least squares estimation when there are missing
data.
Afifi and Elashoff (1967, 1969, 1969a) and Hamilton.(1975) .
provide surveys and bibliographies for this area.
2.5 Estimating Linear Combinations of the Means When There are
'
Missing Data in a Bivariate Normal Sample
Let X. be a I x 2 random vector which is distributed bivariate
I
normal with mean vector
011
°1 2
I.
0
= (B^Gg)' and variance-covariance matrix
A random sample is taken and there are missing data.
012 022
denoted by the incompleteness matrix M*.
Assuming that there is at
least one complete, observation, there are two possible forms that the
incompleteness matrix M* can.take, M*
2.5.1 Estimation of
6
.=.0^ - 0^ When M* = ^ ^
. This is a monotone pattern of missingness where there are h(l)
vectors of complete observations, X'.= (x. 1 ,x 0) , and n( 2 ) observations
■.
•
.
1
11
12
.
of X^ alone, X^ = (x^^), i = l,2,...,n(2). N = n(l) + n(2). Let
30
n(l)
(I)
Z
X 'j
—
n (l)
/n(l)
j =
=
? x
i=l
/n(2 )
— ■ (IN
_
(I )
(xij - x -j
)(xik - x -k
) ,
Consider the class
;
1 ,2
13
(2)
X *1
^jk "
x
i=l
3
,k =
1 ,2
.
of linear combinations which can be written as
- -'(2)
Z(t) = A(t)x<1^1^ + (I - A(t))x.1
" _
- x
I
x *2
CD
where A(t) = (n(l) + n(2)t)/N and t is a function of the complete
—
(i)
pairs and is uncorrelated with x ^
When | is known, the M.L.E. is Z C a ^ ) and so belongs to C^.
M.L.E. is the minimum variance unbiased estimate of
When
.• •
is not known, a simple estimate of
4
6
6
The
.
• —
—
(i)
is x ^ - x ^
, where
—
,
. ’—
m
x ^ is the mean over all N observations of variable I, and x,^ . is the
meant over all observations of variable 2. . This estimate is also in C 1
since it.equals Z(O).
M.L.E. is .ZCa^/ajj)
The M.L.E. of
.6
also belongs to this class; the
(Anderson 1957) :and .(Lin 1971).
Mehta and Gurland
(M arid G) (1969) suggest Z( . 2 a { a + a^)) as an estimate of
is appropriate in the. neighborhood a^
+ ^ 2 2 ^ w^en CTll^ a
coefficient.
22
=
a22 ~
6
which
The M.L.E. of p is
where p is the population correlation
Lin (1971) did a simulation study comparing the simple
31
estimate Z(O), the M.L.E. ZCa 1 0 Za11), M and G's estimate
IZ 1 1
Z C Z a ^ / ( a ^ + a^g)), and the estimate Z{a.^/a^.
moderate values of n(l), 14
He looked at
n(l) j< 101. ■ The estimates were
evaluated by their efficiency E(Z(t^) -
2
6
) /E(ZCtg) -
6
)
2
which equals
Var(Z(t^))/Var(ZCtg)) since all estimates in this class are unbiased.
Lin found that when p is in the neighborhood of 0, the simple estimate
Z(O) is most efficient.
When
p
f 0, but 0 ^
Z(2a^g/(a^ + Bgg)) is most efficient.
of
6
If
a 2 2 = 1» M and G's
p
= 0, Z(O) is the M.L.E.
= Ogg, Z(2a^g/(a^^ + agg)) is the M.L.E, of S.
and if
IP I = .1 and CT2 j
/ a 22
—
Z ^a12^a22^ is inost efficient.
When
In all other
circumstances, Lin's statistic, the M.L.E. Z(a^g/a^) is most
efficient.
These are not surprising results; the M.L.E. is efficient.
Mehta and Swamy (1974) use a Bayesian approach; the emphasis is
on evaluating the effect of using the extra observations in estimation.
They place a non-informative prior on 0, assume $ is known, and show
that the Bayes estimate is the M.L.E. of 6.
They compare
f ( 6 Icomplete data only) to f (6 |all observed data).
They find that the
means of the two distributions are not markedly different, but the
extra observations significantly reduce the variance,
11
2.5.2 Estimation of
6
When M*
10
01
This is an example of a pattern of missingness which is not
monotone.
There are n(l) complete observations,
= (x^^x^) >
i = I,...,n(l), an additional n( 2 ) observations on x^ alone,
, i = l,...,n(2), and an additional n(3) observations on
Xg alone, X^ = (x^g), I = I,-- ,ri(3).
-
(I)
x .
=
n(1)
Z
x
/n(l) , j =
1=1
J
N = n(l) + n(2) + n(3).
1 ,2
Let
;
J
n( 2 )
E x, /n(2 ) ;
i=l
~
(3)
n(3)
x. 2 . =. E x
f
i=l
ajk ' ^ 1-
/n(3) ;
- ri)
- (I)
^3
K x ik - x-k ). -
•
Consider the class C 0 of estimators which can be expressed as
' 2
■
.
■ ' . ■ ■ ■ : ■
■
.
■
.
WCr1U1V). ” A(T1U)X,^^^+(l-A(r1u))x^(2L8(r1v)x,2^L(l-B(r1v))x
where
2 .
A(r,u) = n(l)(n(l)+n(3)+n(2)u)/((n(l)4n(2)) (n(l)+n(3))-n(2)n(3) r^)
33
and
B(r,v) = n(l) (n(l)-hi(2)+n(3)v)/((n(l)+n(2))(n(l)+n(3))-n(2)n(3)r2) .
Note that the class
of section 2.5.1 is the subclass of
when
n(3) = 0 .
When I is known, the M.L.E. of
6
is w ^P>ai2^aii,ai2^a22^ w^iere P
is the population correlation coefficient.
When $ is not known, the M.L.E. does not have a closed form, but
it can be calculated for a given set of data.
A simple estimate is the
mean of all observations on x. minus the mean of all observations on
Xg,.which is equal to W(0,0,0).
Lih and Stivers (1974) suggest using a modified M.L.E,, using the
M.L.E. of $ obtained using only the complete data.
6
That is,
= W(a^g//a^ag^,a^g/a^^,a^g/agg). It.is unbiased, asymptotically
normally distributed, and asymptotically efficient, as n(l) -* <»,
n(j)/n(l) -* Cj, where 0 < C j < « > , j
= 2,3.
Mehta and Swamy (1974) also looked at this problem using a
Bayesian approach.
However, finding f (6 [observed data) involves a
numerical technique which is time consuming and expensive.
They do
one specific example where they compare f(&|complete data) to
f(6 |all observed data) and find that both the mean and variance are
affected.
They also compare f (6 |observed data) to
f(6 || = |, observed data), where $ is replaced by its M 1 L 1 E i, |, in
34
two specific examples.
They found that f (6 || = | , observed data)
does not provide a good approximation to f (6 |observed data) for small
samples.
However if the degrees of freedom of the distribution of
f (6 (I = I , data) are reduced by the number of parameters estimated.
by the data, it provides a better approximation.
Hamden, Pirie, and Khuri (1976) look at the special case where
0
I
= 0 .
They consider a more specific class of estimators than the
L
class
.
They give an unbiased estimate of the common mean which
minimizes the variance of the estimate.
2.5.3 Hypothesis Tests About Linear Combinations of the Means
Recent literature has been more concerned with testing the
hypothesis H^ : c '0 = 0, than in estimating c'0.
If | is known, the
problem is not complicated; c ’0/Var(c'0) can be calculated where 0 is
the M.L.E. of 0, and Var(c'0) will be a function of |.
has a known t-distribution.
This statistic
If | is unknown, but the sample size is
A
A
A
A
A
large, the test statistic c'0/Var(c'0) can be used, where Var(c*0) is
found by substituting the M.L.E. for the elements of | into the func­
tion Var(c'0).
be used.
The asymptotic distribution of this statistic can then
For small sample problems, one can often find an exact test,
i.e., a test whose exact distribution is known, by discarding some
data.
The emphasis in the current literature has been on finding tests,
using all available data, for which one can find exactly, or
35
approximately, the small sample distribution.
The situation is compli­
cated in that the most powerful test, for a given size a, appears to
vary depending on what can be assumed about p or about CT2j / a 2
2
*
Little (1976a) suggests a class of statistics based on linear
combinations of sample means to consider for testing
: c'
6
=
0
.
His goal is to find statistics which have known or well-approximated
distributions and lose little in efficiency, when compared.to the
A
•statistic using the M.L.E., c' 6 .
(Efficiency is measured in terms of
f\,
A
Var(cT0)/Var(c?0).)
In a simulation study, he compares several
statistics and their approximate distributions for testing Hg : Gg =
and H n :
U
6
0
= 0.
Morrison and Bhoj (1973) consider the power of the likelihood
ratio test (l.r.t.) of H_ : c 1 6 = 0 vs H
: c'8
u
a
f 0 when there are
MVN(0,$) data which has missing observations such that the incom/
pleteness matrix can be written M* = I
1 1
11
.
...I \
IQ " o )"
1 1
They compare the
l.r.t. using all available data to the test using only complete pairs.
When
, •
2
is known, the l.r.t. is distributed as a noncentral % and
4
always has higher power than the test using only the complete obser-r
vations.
When $ is unknown, they consider two specific examples and
find that again, the l.r.t. using all observed data is better, has
higher power, than the test based only on the complete data.
attribute the generalized l.r.t. for hypotheses on
0
They
, when data are
36
from a MVN(0,|) distribution, $ unknown but nonsingular, to
R. P. Bhargava.
2.5.4
Hypothesis Tests about
6
When M * = ^
Lin (1973) discusses tests of Hr. :
U
)
< <5_ vs H
6
—
0
a
:
6
> Sr..
0
He
considers four special cases defined by the extent of knowledge one
has about .|, in terms of p and d = cr^^/a 2
2
‘
^ach case he gives an
exact test; if this exact test involves discarding data, he also
suggests tests based on all available data and gives their approximate
distribution.
Mehta and Guriand (M and G) (1969a, 1973) propose a statistic T
for testing H^ : <S = 0.
ZfZa^g/Ca^i + a2 2 ^
It is based on their estimate
anc* *s aPProPr:i-at:e if d = I.
They give constants
k for applying the test T > k for a size a = .05 test.
When I is known, Morrison (1973) finds the l.r.t. statistic for
testing H q :
6
= 0.
It has a t-distfibution under H q .
When $ is not
known, he suggests replacing $ with | found from complete pairs.
He
gets an approximate t-distribution for the associated test statistic.
Naik (1975) proposes a test statistic for.testing H q :
H
a
:
6
6
= 0 vs
^ 0 which is based on the simple statistic Z(O) and is designed
such that the size of the test does not exceed the pre-assigned level
a when p _< 0 and in fact cannot exceed a as long as a^/d^ > 2pn(l)/N.
In a simulation study, he found that his test is more powerful than
37
the paired t-test when p
O and for small positive values of p.
also gives a test statistic for
:
6
=
0
vs
:
6
<
0
He
, and gives
some comparison to Lin's test statistic and M and G's test.
In a simulation study, Lin and Stivers (1975) look at the powers .
and levels of significance for the testing procedures of Liri, M and G,
Morrison, and the paired t-test on complete data only, for testing
Hq . :
6
= 0.
They find that when p >; .9, the paired t-test is always
most powerful.
They give criteria, in terms of sample size and p, for
establishing the preferred test to use when p < .9,
Draper and Guttman (1977) use a Bayesian technique, assuming
squared error loss and a noninformative prior on
testing of H q : 6 = 0 ;
the M.L.E. of
6
.
They find
6
0
and ^ , for hypothesis
= E (6 |incomplete data) which is also
They offer three statistics, using
with their approximate t-distributiohs for inference.
6
as the numerator,
In three exam­
ples, they compare the approximate distributions to the true distri­
butions and find, that two of the approximations are excellent even for
small samples.
smhll.
In all examples the number of incomplete pairs was
The best approximation was the t-distribution where they
matched the mean and variance of f (6 |incomplete data) and removed two
degrees of freedom to allow for this.
They compare their test statistic
to others in the literature— Naik, Lin, Morrison,.M and G.
For example,
M and G's test statistic is a special case of Draper and Guttman's.
As
38
well as suggesting a usable test. Draper and Guttmah provide a brief
summary of the work done in this area.
2.5.5 Hypothesis Test about
6
When M* = ( 10
oiy
The notation is the same as that used in section 2.5.2.
When
n(l), n(2), and n(3) are large, an exact test can be found using the
a
M.L.E. of
6
, 5, and its asymptotic distribution.
The problem then is
to find tests which can be used when asymptotic results are not
appropriate.
Using only the n(l) complete pairs, the paired t-test for testing
Hq :
6
=
0
can be used; it is an exact test.
Lin and Stivers (1974) suggest four statistics for testing
Hq
6
= 0 which use all available data.
One is based on the estimate
\Ll^22’a12^al l ’vL2^a22^ and the ot^ ers a^e based on W ( O iO 1 O) ,
the difference in sample means.
for each statistic.
They give approximate t-distributions
Ekbohm (1976) does a simulation study comparing
these statistics and two others.
.
Some of the tests are based on the
heteroscedastic case and some utilize a homoscedasticity assumption.
.
Ekbphm finds that when the population correlation is medium or large, .
or one does not know anything about its value, the tests based on
W ^al 2 ^ a11^22’a12^al l ’a12^2?^ are to, be recommended,
He gives specific
recommendations which depend on the relation between n(l), n( 2 ) , arid
n(3).
:
39
Bhoj (1978) proposes two test statistics which are intuitively
appealing for testing
on complete pairs.
degrees of freedom.
: 6=0.
Under
When a^
Let t^ be the paired t-test based
, t^ has a t-distribution with n(l) - I
= Ogg' Bhoj suggests a test T which is .
the weighted sum, Xt^ + (I - D t ^ , 'of two independent random variables
. —
(2)
— (3)
t^ and tg, where t^ is a function of x ^
~ x »2
and the standard
pooled estimate of the variance when there are unequal sample sizes.
The distribution of t^ under
degrees of freedom.
is a t-distribution with N - n(l) - 2
The distribution of T can be adequately approxi­
mated by a t-distribution.
I
When
f Ogg' BhOj proposes a test statistic T
which is a
weighted sum of t^ and t^, where t^ is Scheffe's statistic for testing
equality of means with uncorrelated data and has a t-distribution with
n( 2 ) - I degrees of freedom under H^.
Bhdj compares his test statistics to the simple paired t-test
which used complete data only; the expected squared lengths of 95%
confidence intervals was used to make the comparison.
statistics T and T
His test
can give considerable gain and he .gives recom­
mendations for the choice of X, which depends on the size of p and
the value of h(2) and n(3) compared to n(l).
..
3.
RUBIN’S FACTORIZATION TABLE
Rubin (1974) describes a technique which is of use when estimating
the parameters of a multivariate data set which contains blocks of
missing observations.
The likelihood of the observed data is factored
into a product of likelihoods.
The result is summarized in a fac­
torization table which identifies the parameters which can be estimated
using standard complete-data techniques and the parameters which must
be estimated using missing-data techniques.
Let Z be an N x p data matrix representing the potential reali­
zation of p variables on N experimental units.
The rows of Z are
assumed to be independently and identically distributed, and
the vector of parameters of the density of Z .
*1
is
Let M be the N x p
incompleteness matrix of 0 's and I's; M was defined in 2.4.2.
Jt
0
The
column of M, or Z, represents the observations on the J t "*1 variable.
Definitions:
1) The k ^
1
column is said to be more observed than the J t
*1
column if,
th
whenever an entry in j
k
th column is also I.
column is I, the corresponding entry in the
For example, / I \ is more observed than / I \
I \
/ I
I I
I 0
0
I o
i/
W
2) Two columns are said to be never ,jointly observed if, whenever an
entry in one column is I, the corresponding entry in the other column
.
41
is
0
.
For example, / I
^ a r e never jointly observed..
/ I
0
0
\
0
\ I
3.1 Creating Rubin's Factorization Table
Rubin (1974) describes the steps for creating the factorization
as follows.
Step I. Replace all rows of M having the same 0-1
pattern with one row having that pattern, noting which rows
of Z are represented by that pattern. Similarly, replace
all columns of M having the same 0 - 1 pattern with one col­
umn having that pattern, noting which columns of Z are repre­
sented Jjy that pattern.
Call the resulting 0-1 incompleteness
matrix M.. See Table I for the example of M that we will use
to illustrate this method.
I.
Example of Incompleteness Matrix M for a
550 x 8 Data Matrix
Columns represented
— --- :
------------ -------- :—
3
4,5
6,7
1 ,2
I
I.
I
I
I
0
0
0
I
0
.
'
I.
0
I
.1
0
0
.0
0
0
0
' Rows
8
I
I
I
I
. d
1 -1 0 0
101-275
276-300
301-450
. 451-550
■ - Step 2., Reorder and partition the columns of M into
(M^)Mg) such that each column in M^ is either (a) more ob­
served than every column in M^, or (b) never jointly observed
42
with any column in S 1 .
If this cannot be done, the pattern
of incompleteness in M (and M) is "irreducible," and no
further progress can be made. Assuming this step can be
performed, proceed to Step 3. In Table 2 see the result
f\j
of Step 2 for the matrix M of Table I.
Here, every col­
umn in M„ is more observed than all columns in
z
2
.
The First Partitioning of the
S of
SI1.
Table I
Columns represented
Rows
1 ,2
4,5
6,7
3
8
0
I
I
I
0
I
I
I
I
I
0
0
I
I
I
I
0
0
0
0
.
0
.
0
.
1 -1 0 0
101-275
276-300
301-450
451-550
0
0
\
Step 3.
Apply the procedure in Step 2 to both partitions
created in Step 2.
If both
and
are. irreducible, stop;
otherwise proceed trying to repartition each partition created.
'V
■
•
Continue until no partition of M can be further, partitioned.
See Table 3 for the final partitioning for our example* This
f\j"*
partitioning was achieved by examining the M- partition of
Ta&le 2 and noting that Columns (.6,7) are more observed than
Columns (4,5), and Columns.(1,2) are never jointly observed
with Columns (4,5).
The S 0 partition of Table 2 and all
partitions in Table 3 are irreducible.
43
3.
The Final Partitioning of the S' of Table I
Columns represented
Rows
4,5
1 ,2
6,7
3
8
I
0
0
I
I
0
I
I
I
I
I
0
0
0
I
I
I
I
0
0
0
0
0
. I
0
— ---
%
M
M1 ;
--- M
2
1 -1 0 0
101-275
276-300
301-450
451-550
3
Step 4. Summarize the final partitions in a "factoriza­
tion table" as illustrated in Table 4 for our example.
Labeling the final partitions
,...
from left to right,
list for each St1:
i
1.
the "conditioned" variables— the variables
(columns of Z) represented by the i *123"*1 parti-?
. tion: shy, Z±.
2
.
the "marginal" variables— the variables, repre-
til
seated in partitions to the right of the i ,
partition, that are more observed than variables
in the I t * 1 partition:
3.
Say, Z ^ .
the "missing variables"— the variables,, repre­
sented in the partitions to the right of the i*"*\
partition, that are never jointly observed with..
the variables in the It^ partition:
4;
say, Z ^ .
whether it is a "complete-data" partition— one
column in SL, or an "incomplete-data" partition— more than one column in
rXj
44
5.
the rows of
which are at least partially
observed (i.e., the rows of Z represented by
rXj
rows of ML that are not all zero).
4.
Partition
complete
incomplete
incomplete
1
4,5
,2 ,6 ,7
3,8
Marginal
variables
3,8
-
Missing
variables
Rows of
Z
1 ,2
1 -1 0 0
CO
3
Conditioned
variables.
CO
2
Complete
or
incomplete
<o
I
The Factorization Table for Table 3
-
.
1-300
1-550
Each partition in M, and thus each row of the factori­
zation table, corresponds to a factor of the likelihood of
the observed data. The
factor is the conditional joint
distribution of the conditioned variables for that partition,
Z^, given the marginal variables for that partition Z^jfc.
Hence, the final factorization of the likelihood of the
observed data is
11 / fCz1I
(3.1)
i=l
i
where Z ^
is empty and er .r* may be written as
8
^.
o
In equation (3.1), Z^ is the collection of unobserved scalar random
variables in the i
tlx
partition.
For the general normal model, the M.L.E. or Bayes estimates of the
parameters of the i
th
factor, using only the indicated rows and columns
45
in the factorization table, are identical to the estimates using all
rows and columns of Z.
If a partition is "complete", it involves a
completely observed data matrix and the usual computational methods
can be used.
The table can also indicate parameters which are not identifiable.
The parameters of conditional association between the conditioned
variables and missing variables given the marginal variables in each
partition are not identifiable.
In the example, these are the
parameters of association between (4 and 5) and (I and 2) given
variables (3,
6
, 7, and
8
).
Whenever two variables are never jointly,
observed, the parameters of conditional association between them are
not identifiable.
The table also shows the number of observations
available for estimating the parameters; if there are too few obser­
vations, the parameters may be not identifiable.
3.2 The Computer Program
The FORTRAN language computer program REACTOR calculates and. .
outputs the reduced incompleteness matrix 5 and Rubin's factorization
table.
In the flow chart, a partition C^ is said to be "to the right
of" partition C 0 if each column in C 1 is either (a) more observed than
every column in C^, or (b) never jointly observed with any column in C g .
46
Star
Input the N x p
missingness matrix M.
which has dimensions n x k, say. Keep track
of which variables are associated with each
rXt
column of M and which observations are
associated with each row of 5.
Output M
I or
M is irreducible.
Output message to
this effect.
Calculate MXO = the number of columns which
cannot be "to the right of" any other column.
5 is irreducible.
Output message to
this effect.
W
Stop
47
There are NC
possible combinations
of the NK integers in C taken I at a time.
Let C , ,C„,...,C„„ be these NO combinations.
Set C* = all integers in C
which are not in C T.
Is C*
Is
I > MXO?
to the right
Is C
to the right
_ of C*?
J + 1|4
48
For all J in C
set IV(J)
For all J in C
set IV(J) = IL
set IV(J)
NXT + I
NXT = 0?
NEXT(NXT)
II = NEXT(NXT)
IL = IL + I
IL =
NXT - I
Set C = set of all J, J = I, I
such that IV(J) = II.
C defines the next partition
which the program will try to
further reduce.
A
49
Each column J 1 J = I,...,k, is associated,
now, with a number IV(J).
If IV(I) = IV(J), then I and J are in the
same partition.
If IV(I) < IV(J), then J is in a partition
to the right of the partition which
contains I.
Find IN and MAX such that IN <_ IV(J) £ MAX
for all J = I .... k, and there exists
and J 2 such that IV(J^) = IN and
IV(J0) = MAX.
M is irreducible.
Output message to
this effect.
E
50
Set NVAR equal to a
column number such that
IV(NVAR) = IL.
IV(J) < IL?
Column J is not
associated with
this partition.
The variables associated
with Column J are
conditional variables
in this partition.______
The variables
associated with
column J are
missing variables
in this partition
•''column J
more observed
vthan column,
^XNVAR?
The variables associated
with column J are marginal
variables in this partition.
51
If the number of columns associated with
conditioned variables is > I, this
partition is incomplete. Otherwise, the
partition is complete.
Output the line of the
factorization table for
this partition.
X Is
\
IL = MAX?
Set IL equal to the next largest number
such that there exists J, J = I .... k,
where IV(J) = IL.
E
52
3.3 Sample Problem
This problem is a slight modification of the example in Rubin
(1974).
Suppose the original data consisted of the following eight
variables and ten cases (an asterisk indicates a missing value):
Variable
Case
I
I
* .
2
.2
3
4
5
6
7
8
9
10
1 .2
.5
*
*
.7
*
*
*
3
2
A
4
3
2
5
6
1 .6
I
1 .1
7
4
3
1 .1
3.0
A
A
1.5
.6
1 .0
A
A
1 .6
A
A
A
A
A
A
A
A
A
A
A
A
A
4.0
.
1.5
3.0
.9
1 .1
A
1 .0
A
A
A
A
A
I.
.5
A
A
.5
.9
2
I.
A
.5
2 .0
A
. A
2 .0
A
A
■ I.
8
.0 1
.15
1 .1
.8
.5
.5
.05
A
A
.5
The input deck, on file, is as follows (ten column fields
indicated by vertical lines):
00111111
11100111
11100001
11100001
00000001
00111111
11100111
. 00100000
00100000
00111111
On the following page is a listing of the output generated by
REACTOR.
reduced
In c o m p l e t e n e s s
matrix
■NEW VARIABLES
NEW CASE # (// OF EQUIVALENT ROWS)
I
3)
(
2
2 )
■(
3
2 )
(
4
(
D
5
2 )
(
ORIGINAL DATA HAS
REDUCED MATRIX HAS
2
3
4
5
I
I
I
I
I
I
0
I
I
0
0
0
0
0
0
I
I
I
I
0
I
0
0
0
0
10 ROWS AND 8 COLUMNS
5 ROWS AND 5 COLUMNS
FACTORIZATION TABLE (RUBIN,JAEA,69:467)
PARTITION
I
2
3
COMPLETE
OR
INCOMPLETE
COMPLETE
INCOMPLETE
INCOMPLETE
: CONDITIONED
VARIABLES
( 2)
(4)
(2)
3
1 4
2 5
. MARGINAL
VARIABLES
( 4)
( 2)
( 0)
2
2
0
4
5
MISSING
VARIABLES
5
(
(
(
2
0
0
)
)
)
# OF
PARAMS
I
13
0
22
0
5
// OF
ROWS
3
7
10
*** NOTE: VARIABLE //zS IN THIS TABLE ARE NOT NECESSARILY THE ORIGINAL VARIABLE //zS.
TO FIND WHICH ORIGINAL VARIABLE //zS CORRESPOND TO THE NEW VARIABLE //zS,
AND TO FIND WHICH CASES CORRESPOND TO THE VARIOUS PARTITIONS, SEE FILE //4
THE NUMBERS IN PARENS,( ), ARE THE TOTAL NUMBER OF ORIGINAL VARIABLES
ASSOCIATED WITH THAT CELL.
4.
ORCHARD AND WOODBURY'S ALGORITHM
FOR MULTIVARIATE NORMAL DATA .
4.1
The Iterative Algorithm
Let X 1 = (xjj»•*•,xip)’ I = I)•••»N, be a random sample of N
vectors from the p-variate normal distribution which has true mean
vector
0
' =
(0
.....
0
) and true variance-covariance matrix
aU
•••
aIp
°22
* *'
a2 p
a2p
Let
app
X = (X1'...Xl)' he the input data matrix and suppose some
Nxp
elements of X are missing.
The N rows of X are associated with the
cases and the p columns of X are associated with the variables.
If no elements of X were missing, then the sample mean vector and
the sample covariance matrix are the maximum likelihood estimates of
and $, respectively.
M.L.E. of
6
The goal of the computations is to find the
and $ when there are some data values missing.
Hamilton (1975) presents some guidelines for the appropriateness
of maximum likelihood estimation.
The maximum likelihood estimates .
are best if the nprmality condition and one of the following condi­
tions hold:
0
55
(a) the sample size is greater than 300,
(b) the sample size is greater than 75 but less than 300, and the
intercorrelations are classified as "medium" or "high,"
(c) the sample size is less than 75, but the intercorrelations
are classified as "high."
The Orchard and Woodbury (1970) iterative procedure to compute
the maximum likelihood estimates of 0 and | when X has missing elements
has four basic steps.
1.
Choose initial guesses, 0
(0 )
and V
(0 )
, for the mean vector and
-(0 )
;
is the mean of x . .1S over all cases
covariance matrix.
0
J
iJ
Afny
*fn)
where variable j is not missing; Er ' = (0V
i
Z
q
Afo)
...Er ')*.
P
Next,
N
0j
is substituted for any missing value of variable j ,
j = I,.. .,p, thereby artificially completing X to form X ^ .
’ Then
V^0) = (1/N) 2 (X(0) ' - 0 (O))(X(O) ' - 0 ^ ) ) '
.
1=1
2.
Let 0^™^' and
be the estimates of 0 and $ at the m4/
iteration, in = 0,1,2,...
.
elements in X are estimated.
a time.
Using 0^m ^ and
, the missing
The cases are completed one at
Suppose that case i has exactly p^ variables missing.
Let p^ = p - Pg be the number of variables not missing.
A
Drop
:
the superscript from 0 and V for ease in writing formulas.
56
Rearrange the order of variables in case i and in
so that X
I
lx?
(XuIxi2) •
V = ^l Z Zll
and in V
= ^lZ 0I ^ ; and
0
.V xil| xi2
P1
P2
9
P2 T
\ ; where the last p„ rows and columns
fzrzil ?22
pI
P 2
correspond to the missing variables.
-It
Then, letting ( V ^
be the (£,j)th element of V 1 1 V^ 2 >
xIj ' .9j +
cvU vI 2 ^ j txIZ - V
'
j = P 1 + !,...,P1 + P g , are the estimated values of the
missing variables for case i.
At this time in the calcupI
lations, the matrix V*
= pI Z ^
pxp
P2 V
calculated for use in step 3.
columns of
p 2
^
v 22
- V 2 1 VU
V1 2
After these calculations, the
■
and the rows a n d .columns of V* are rearranged
to correspond to the original order of variables.
This.pro­
cedure, which is essentially a regression substitution method,
is repeated for all cases.
ynri-1 ) _
The resulting data matrix is
57
3.
Using
, revised estimates
q
an(j ^(m+-!) are
calculated:
= (i/N) I ; < * »
i=l 12
3
. and
v.(m+l) = (ly,N)
% v* +
i=l
4.
% ^x (mH) ' _ 0 (nrf-1)^^.(m+l) ’ _ Q (mi-1)^
i=l
1
If Q Cm "*"!) and
1
^
1
are essentially identical to
and
V ^ , respectively, the iterative procedure is terminated
and Q
and
estimates of
4.2
0
are printed, as the maximum likelihood
and $.
The Computer Program
The FORTRAN language computer program MISSMLE calculates the
M.L.E. of the mean vector and the variance-covariance matrix using
the iterative procedure.of Orchard and Woodbury.
It also calculates
estimates of the regression coefficients, if applicable.
MISSMLE
does riot check that the parameters are identifiable for the given
pattern of missing data.
For meaningful results, the user must verifv
this himself; Rubin’s factorization table (chapter 3) is useful in
determining identiflability.
* * .? .!
P lo w C h a r t f o r MIiJSMLE
Input d a t a matrix
Compute means, g j , J - I ,
u sing p r e s e n t data
S ubstitute
means
for
miss i n g data
Compute Initial c o v a r i a n c e matrix
N
V
P a r tition
2. Begin loop for
Set the m a trix
V*
-I
- 0
Begin
iteration
the I
loop
all data'
present?.
Calculate n ew estimate
of m i s s i n g d a t a p o i n t as
Repeat
for
all P 2
miss i n g points,
Accumulate
the m a t r i x V*
b y a d d i n g to it the elements
of V22jl " V22"V21 V11 V12
in the a p p ropriate places
Accum u l a t e
/finished
all N cases'
^•Calculate new means,
0 j ■ I j/N» j * l , . ,.,p
Ca l c u l a t e new covar i a n c e matrix
N
JonvergenceJ
Less thafls,
^OO Iteration!
' X s^done?
V - I tv + JSi(Ji-I)CXi-*)-]
^/Dependent
^
variable
^ s s Specified?
Regression
calculations
J=
59
4.2.2
The Test Used for Convergence
The user inputs a stopping value STP or the. program defaults to
STP = .001.
At each iteration a new mean vector and a new variance^covariance
matrix are calculated.
At the I t
*1
step, max
T
i- 1
is calculated,
i- 1
where T ranges over all parameters in the mean vector and variance-:
covariance matrix, and; T
step.
is that parameter .as estimated at the I t
*1
If this maximum percent change is less than or equal to. STP,
the iterations stop.
A maximum of 100 iterations will be done.
number suggested by Beale and Little (1975).
This was the minimum
If 100 iterations are
done with no convergence, this result is noted in the output.
4.2.3
Regression Calculations
Standard regression terminology and notation are used in this
section.
If a dependent variable is specified, the regression coefficients
V.
: :
g, the SSE 1 and .R
2 ■■■
.
,
: -■
are. calculated..
■ / ctO IY '' X
.
Let V = I -v — — — ■ I be the final variance-covariance matrix which
V
W
has been rearranged so that O q is the: variance component for the
specified dependent variable.
Then the calculations done are;
60
R
2
Y 'V22Y
= --- Y ~
(= RSQD) ,
0O '
2
-I
S S E = G 0 - Y 1V2 ^ Y ,
A
g=
4.3
_-l
y
'V 2 2
(= Estimated Regression Coefficients) .
Sample Problem .
The data for this example were taken from Woodbury and Hasselblad
(1970).
No labels were input so those used are the default values.
Variable 2 was specified as the dependent variable; all intermediate
output is requested.
The only value signifying,missing data is 0.0.
The cards on the following page are the input deck used in this
example.
61
ARTIFICIAL EXAMPLE OF A TRIVARIATE NORMAL
I
20
3
2
I 0.0
(3F.0)
.422,0.,0.
-1.306,0.,0.
-.125,0.,0.
-.983,0.,0.
.453,0.,0.
.274,-.065,0.
1 . . .169.0.
.510,.477,0.
— .767,— .310,0.
1.075.. 304.0.
-.656,-2.142,-.927
-.754,I.234,-.03
-I.Ill,-.297,.248
— .846,— .942,— .527
I.031,-.309,-1.264
-I.466,-I.465,-.988
.230,1.064,-1.161
.355,-.029,-.174
-.539,-1.181,-1.702
1.034,.325,.268
On the following pages is a partial listing of the output that
was generated.
Step 11 was the last iteration.
Only output from that
iteration is included, but this type of information was output for
each iteration.
62
A R T I F I C I A L E X A M P L E OF A T R I V A R I A T E
SAMPLE SIZE
20
I N I T I A L VAR S .
3
D E P . VAR.
2
S T O P P I N G V A L U E IS . 0 0 1 0 0 0 0 0
NUMBER
X( I )
OF P R E S E N T
NUMBER
OF M I S S I N G
X(
CASEI
INITIAL
I
.42 2 0 0 0
2
-1.30600
3 -.125000
4 -.983000
5
.453000
6
.2 7400.0
7
1 . 0 0 0 0 0
8
.510000
9 -.767000
1 0
1.07500
11
-.656000 .
12
-.754000
13 - 1 . 1 1 1 0 0
14 - . 8 4 6 0 0 0
.15
1.03100
16 - 1 . 4 6 6 0 0
17
.23 0 0 0 0
, 18
.355000
19. - . 5 3 9 0 0 0
2 0
1.03400
3)
15
1 0
5
10
DATA
0
VECTOR
-.108450
X(
DATA
2 0
MEAN
2)
NORMAL
-.211133
DATA
-.211133
-.211133
-.211133
-.211133
'
-.211133
- . 650000E-01
.16 9 0 0 0
.477000
-.310000
.30 4 0 0 0
-2.14200
1.23400
-.297000
-.942000
-.309000
-1.46500
1.06400
- . 2 9 0 0 0 O E - 01
-1.18100
.32 5 0 0 0
-.625700
*
*
*
*
*
-.625700
-.625700
-.625700
-.625700
-.625700
-.625700
-.625700
-.625700
-.625700
-.625700
-.927000
-.300000E-01
.248000
-.527000
-1.26400
-.988000
-1.16100
-.174000
-1.70200
.26 8 0 0 0
X( I)
X( 2)
X(
INITIAL VARIANCE-1 CORRELATION MATRIX
.68 0 7 1 6
.432145
.58 7 0 8 5
. 1 4 3 8 0 O E -01
.389216
.210256
3)
*
*
*
*
*
*
*
*
*
*
OF
STEP
11
X( I)
NEW DATA
,422000
2
-1.30600
3 -.125000
A -.983000
5
.453000
6
.274000
7
1 . 0 0 0 0 0
8
.510000
9 -.767000
1 0
1.07500
11
-.656000
12
-.754000
13 - 1 . 1 1 1 0 0
14 - . 8 4 6 0 0 0
15
1.03100
16 - 1 . 4 6 6 0 0
17.
.230000
18
.355000
. 19 - . 5 3 9 0 0 0
2 0
1.03400
CASE
I
MAX PERCENT
NEW
MEAN
CHANGE
I
X(
2)
.2464 9 6 E - 0 I *
-.853437
*
-.253310 . *
-.689304
*
.4040 2 3 E - 0 I *
- . 650000E-01
.169000
.477000 .
-.310000
.304000
-2.14200
1.23400
-.297000
-.942000 .
-.309000
-1.46500
1.06400
-. 2 9 0 0 0 0 E - 0 1
-1.18100
.325000
OCCURS
-.603218
-.623830
-.609743
-.6.19977
-.602849
-.609441
-.642417
-.476374
-.534220
-.611622
-.927000
-. 3 0 0 0 0 0 E
.248000
-.527000
-I .26400
-.988000
-I .16100
-.174000
- I .70200
.26 8 0 0 0
AT V C ( 3 , 1 )
VARIANCE-CORRELATION MATRIX
.68 0 7 1 6
.47 6 7 1 5
.773458
. 15677.2E-01
.341328
VECTOR
-.108450
X(
-.244900
.394625
-.609535
«• *
RESULTS
64
ESTIMATED REGRESSION COEFFICIENTS
I :
I
3
B ( I):
.502572
.46 7 5 0 9
CASE
.511.454
.33 8 7 4 3
DEPENDENT
X( 2)
X(I)
1
.2 4 6 4 9 6 E - 0 1 *
.42 2 0 0 0
2 -.853437
* -1.30600
3 -.253310
* -.125000
4 -.689304
* -.983000
5
.404023E -01*
.453000
6
-. 6 5 0 0 0 0 E - 0 1
.27400.0
7
.169000
1.00000
8
.477000
.510000
9 -.310000
-.767000
10
.304000
1.07500
11 - 2 . 1 4 2 0 0
-.656000
12
1.23400
-.754000
13 - . 2 9 7 0 0 0
-1.11100
.14
-.942000
-.846000
15 - . 3 0 9 0 0 0
1.03100
16 - I .465 0 0
- I .466 0 0
17
1.06400
.230000
18 - . 2 9 0 0 0 0 E - 0 1
.355000
19 - 1 , 1 8 1 0 0
-.539000
20
.325000
1.03400 .
X(
3)
-.603218
-.623830
-.609743
-.619977
-.602849
-.609441
-.642417
-.476374
-.534220
-.611622
-.927000
-.300000E-bl
.248000
-.527000
-1.26400
. -.988000
-1.16100
-.174000
-1.70200
.268000
»'»»».*
SSE=
RSQD=
*
*
*
*
*
5.
MAXIMUM LIKELIHOOD, BAYES, AND
EMPIRICAL BAYES ESTIMATION OF
6
WHEN THERE ARE MISSING DATA
5.1 Assumptions
Let
be the I x p
vector
has a MVN( 8 , d i s t r i b u t i o n where
such that X^, given
6
is a p x I vector (6
^ .,..8
I is the p x p positive definite variance-covariance matrix.
6
,
) ’ and
The
elements of $ are a...
A random sample of N observations of X^ is taken and there are
missing observations.
The incompleteness matrix M* describes the
pattern of misdingness, indicating which variables are observed and
which are missing.
Every variable is assumed to be observed at least
once.
It is assumed that the data are MAR and the parameter of the
missingness process is distinct from 0.
This implies that the process
that causes missing can be ignored,
Recall that M*. = (m ,) is the k x p incompleteness matrix; k is
■ ■
the number of distinct patterns of observed variables.
Missingness
pattern i refers to the pattern of missingness described by. the i
row of M*.
th
Let n(i) denote the number of observations X. which have
J
k
missingness pattern i; I < n(i) <_ N and Z n(i) = N. It is assumed
1=1
that M*, n(l),...,n(k) are such that 0 is identifiable.
.66
5i2 More Notation
Let
i I-S 1
1
be the number of variables observed in pattern i;
p
; Si =
P
E
z Vill.
Z=I
Let Ci , i = I...k, be the set of subscripts, j , such that variable
is observed in the i
th
pattern; Ci = {j :
=
1
}.
Without loss of generality we can reorder the observations such
that the first n(l) are those with missingness pattern I,
observations be labeled
Let these
.
.
v (d
/I)
n(l)
and recall that each
X D/ .
is a I % s. vector with elements x . ,, j e C1 .
I
I
IJ
The next n(2) observations are those with missihgness pattern 2 and are
labeled
x(2)
Xi- ..
n( 2 )
where each Xi
is a I x Sg vector, etc.
The program REACTOR
■
'.'7
identifies the observations which.have missiiigness pattern i,
i = I , ...,k.
Let P^, i =
be the s^ x.p matrix of I's and 0 ’s such
that P_. times the p x I vector of all variables is equal to the
s^ x I vector of variables that were actually observed.
P^ is
obtained from (m.-,...m, ) in the following manner.
Ii
ip
skip it.
If m ^
If m . = 0,
IJ
.th
- I, add another row to P^, a row where the j
position is I and all others are 0.
For example, if
1 0 0 0 0 0
Cm1
1
...m^g) = (110101), then P^ is the 4 x
6
matrix u
0 1 0 0 0 0
q 0o i o o
\
\
I '
,000001/
Note that P^P^ = I ^ ^ y
the s^ x
identity matrix.
Let i . be the s, x s , matrix P. t P!.
Ti
i
i
i T r
Let B be the
Es
i=l
x p matrix
It will be convenient to have a special notation for block
diagonal matrices.
Let. the direct sum
■r
'
■ ■r
'■
E
T ,where each T 4 is a
1-1 '
.
/' '
'■
.
-
q. x q . matrix, denote the.E q . x E q . matrix with the matrices T.; down
1
1
■■ ••■...
I 1
I 1
:,
"■
68
the diagonal, and. all other entries 0.. . For example.
3 +
Z
1=1
0
T., =
T2- 0
Let I ... denote the s. x s. identity
s(i)
1
1
?■
V0
O
T
matrix.
The matrix B'( E n(i)I ...)B equals the p x p diagonal matrix
i=l
su;
P +
E (t.), where t. equals the total number of observations of variable
i=l
1
i.
By assumption, each t^ is strictly greater than 0.
Let t^. equal
the total number of observations in which variable i and variable j
are both observed.; t.
1
whose ij
th
= t..
Then B ’( E
1
n(i)|.)B is the p x p matrix
1=1
1
'
element is t..cr...
1J 1J
Let S . be the s . x I vector which is the sum of observation
i
i
.
n(i)
(O'
Let S be the E s . x I
vectors having pattern i; S. = E X.
i=l i
1
J - I j
vector
Then s|e ~ MVN((E+n(i)I ...)B6 ,E+ (n(i)4.)) .
I
su;
I
1
The
associated vector of means is S = (E (l/n(i))I ,.i.)S ;
_
k
I
SU;
S|0 -v MVN(B 6 ,E ((l/n(i))|.)) .
I
1
.
. . .
The sum over all observed values of variable j is denoted by
h(Z)
E x (Z)
): , summing over all Z
I i=l 1J
x ., j = I ,...,p ; that is, x . = E
■J
such that j E Cgy
'3
Then B 'S
(xe1 ...X e )'.
*I
*P
Define
69
X = ( H
l/t j B ' S = (x 1 /t1 ...x /t ) * = (x _;..x )'
i
*1
I
*p p
-I
'P,
i= 1
5.3 Sufficient Statistics
For all variables x. which are observed in missingness pattern i,
J
(i)
define x . =
•J
(i)
x » . , the sum of the n(i) values of x. present in
-lJ
3
Z
£ = 1
Let S be the set of all
the observations with missingness pattern i .
(i)
x'. , over all j e
3
and over all patterns i;
m
S = {x .
: j e C., i = l,...,k}.
'J
There are
1
k
Z s. elements in S.
i=l 1
The set S is sufficient, but not necessarily minimal sufficient,
for 0 (Little 1976a).
Since S is the vector of the elements of S, S
is a sufficient statistic for. 8 .
5.4 The M.L.E. of 0 When
is Known
In the notation of this section the M.L.E. of 0 is
.
8
k
= (B'( Z +n(i)|7 1 )B)™1 B'( Z
i=l
1
i=l
;
1"
or equivalently,
k
k
0 = (B’( Z +n(i)|7 1 )B)"'1 B ,( Z ^ ( i ) ^ 1)^
i=l
1
i=l
1 '
(Hartley and Hocking 1971).
(5.1)
The estimate is unbiased and is distributed,
k
given 0, as MVN(0,(B'( Z n(i)|. )B)
i=l
1
).
The M.L.E. has the form of a
70
weighted least squares.estimate.
Proposition 5.4.1.
If n(i)/N goes in probability to r .,
I
P
_
(n(i)/N — » r.), where 0 < r. < I, i = I , ...,k, and S and S . are
N-*” 1
1
1
J
independent, (S I] S.), i, j = I , ... ,k, i f j, then VlJ( 8
1
- 0) converges
J
in distribution to Z as N goes to
k
(VN ( 6
-
8
) •> Z) , where Z is
_
distributed MVN(0,(B' ( Z + ri|™1 )B)7^1) .
Proof.
Recall that S = (S^
J
... j S^)* where
is the vector of
sample means using only the data with missingness pattern i. ■ By the
■ '_
multivariate central limit theorem, /n(i) (S. - P.0)
1
Z^ ~ M V N (0,P^ I P^), for i = l,...,k.
L
—
Z . where
1 n(i)-x» 1
Therefore, using Slutsky's
'
1
P
theorem (Rao 1973) and the assumption n(i)/N — > r ., this implies
_
■
L
/
1
,v¥(s. - P. 8 ) -»Z* %.MVN(0, (l/r.)| ) , for i = I, —
1
1
N-x» 1
1 1
are assumed to be independent for all i ^ j ,
j-1
k
./N(S - B 8 ) -> Z* % MVN (0, Z (l/r.)|J
N-K°
fxVi=I
1
^
Aj(0. -
8
.
) = (B '( Z + (n(i)/N)^.1 )B)
i=l
1
N^oo
(B'( Z +f J:71 )B.)
i=l
1
1
,k.
_ "
_
Since S . and S .
1
J
this implies that
Therefore
. .
B tC Z + (n(i)/N)| .1) ^ ( S - B0)
i=l
X
B 1C Z +r.|.l)Z*
i=l
-L
k +
where Z* ~ MVN(0, Z (1/r )Z.) .
Therefore, v^
( 6
-
9
1=1
k ,
Z ^ M VN(0, (B 1 ( E r.^^
i=l
i
L
)
Z where
N-*=
_i
)B)
).
The proof of Proposition 5.4.1 is
complete.
Note that in Proposition 5.4.1 the only assumption made on the .
distribution of the X ^ es was that S^ Il S ..
conditional distribution
In particular, under the
|0 ~ MVN(9,.|), the vector
is independent
L
of S., i f j , and therefore v^(0 - 0)
Z where
^
1
. N-x» .
k +
I
I
Z ~ MVN(0,(B'( % r
)B) ■ ).
i=l
5.5
Bayes Estimation of, 0 When | is Known
p +; .
It is assumed that the prior distribution bn 0 is. MVN(0, Z A ^ j ,
i=l
. . where Z A,, is a p x p diagonal matrix with the positive scalars A
.
I
1
'
down the diagonal;
•
■■
''
The loss function is L(0,0) = (0 ,- 0)lVCb - .0)
where V is any positive definite matrix of constants.
.
Proposition 5.5.1.
P
The Bayes estimate of 0.is
■
^
0 (B) = ( Z + (IZAi); + ,B1C Z +n(i)|i;L)B)"1 B ,'( Z +n ( D ^ 1)S , which can
i=l
i=l.
*
1=1
also be written as
P T
. k
B'" = ( 2 ^(IZA1)
(1/A.) + B;,
1 ( Z
i=l
1 = 1
k ,
T *
n(i)|“ )B)“ B'( Z n ( i ) ^ )B0 ,
The
1=1
.
••••;
conditional distribution, of 0
.k.
Ovjj-Ie ~ MVN((Z+ (1/A ) + B ' ( ^ n ( i ) C ^ ) B ) ^ B ' ( ^ n ( i ) CI1-Ix
- V)B-8X,
I
I
1
I
1
V1V1/--JY TlY - 1 TJ » f V +
(Z+ (l/A ) + B t
*((^n(I)
TI--1Y
j")B)"^
'(E+n(i)
1
I
I
U-Ix
1
(E+ ( I M i) + B'(E+n(i)ti 1 )B)"1) .
.I
Proof.
9
8
Let T =
X .
,.x L-Ixmx-I
(E (1/A.) +. B.'(Z n(i) £7 >B)
.
I
1
I
1
k
Ie rv, MVN( e, (B* (.E+n(i) |:~1 )B)~1) and
. . .
I
i
|0
Recall that
p
8
~ MVN(0,E+A.).
- I 1
Therefore,
k
'
+
-I ■A
a. MVN(TB'( e n(i)
) B 0 ,T) (DeGroot 1970) and under squared .error
a Z^ \
loss the Bayes estimate,
A(B)
bution.
Because
0
8
^
, is the mean of the posterior distri-
A
A
is a linear transformation of 0 and
has a
0
AfgY
MVN distribution, the MVN distribution of. O- . is easily described.
This completes the proof of Proposition 5.5.1.
The following lemma is useful in matrix manipulation.
Lemma 5.5.2 (Woodbury's theorem) . .If T is a p
x p matrix, U is
q x p , H is q x q , and W is q x p , then
(T + U'HW)"^ = T
" 1
- T- 1 U 'H (H + HWT- 1 U 1H)" 1 HWT "
1
73
provided the inverses exist.
Proof.
The result is easily checked by multiplication.
Using this Iemmai 0-
(I
can be rewritten as
- (I
which makes it resemble an R.- R estimate.
diagonal matrix, a^
0
^
= (I - a
=
0
For example, if $ is a.
when i f j , then
( a ^ +. t^A^))0 ^, where t_^ is the total number of obser­
vations in which x. is observed, and 0 ^
I •
'
(5.2)
+ ( r A )B'(Z n(i)|? 1 )B)"1)e
Proposition 5.5.3.
,
is art. R - R estimate.
p
.
-
'
If n(i)/N — > r . , where 0 < r. < I, i = I,...,k,
1
L
and ?
H S , i f j, then vW(0(B).- .0)
J
'
Proof.
k
Z ~ MVN(D,(B'(E+ T-IJ 1 )B)~1) .
I
1
Using equation (5.2) for 0 V ^ , Slutsky's theorem, and the
results of Proposition 5.4.1,
.Vn c q 1
-
0
) = /N ( 0 -
0
L-lv_\-1 2 \
- (I + (E+A,)B'(Z+n (i)T 1 )B)" 1 O)
P ’' I 1
I
1
/N(0 - 0) - (v^/N) (IZN)- 1 CI
P
pH-.
/,x^-is^x-i:
+ ( E % ) B ' (E^h(i)$-^)B)-^0
i
I
i
i
X '••
74
= /5(8 - 0) - (I/
/5)((1/N)I
'
L
P4.
P..
ki
+ (Z+A.)B,(E+ (n(i)/N)^T1 )B)
I 1
I
1
_q
“ 1
I
— > Z - O v (0 + (E A )B'(Z r i / ) ? ) 1B ,
■■ N-x»'
I 1
I 1 1
k+
where Z ~ MVN ((3,(B* (Z
-I
)B)
-I
).
This completes the proof.
Therefore under the conditional distribution
v5 ( 0 A
(B)
.
|0 ^ M V N (0,^ ,
- 0) -» Z ~ MVN(0, (B' (Z+ T 1 IT 1 )B)"1) , and the Bayes estimate
N-X”
I 1 1
is asymptotically equivalent to the M.L.E. 0.
a
5.6
ZnX
The Risk of the Bayes Estimate 0 V
a
Compared to the M.L.E. 0
The loss function used in this section is still
L(0,0) = (0 - 0).'V(0 - 0), where V is a positive definite matrix of
constants.
Consider the improvement in the risk when using
a
'a
A ZnN
than 0; define IM = R ( 0 ,0) - R(0,0v y) .
0
^
rather
x
im­
a
The Bayes estimate
0
proves on 0 when IM > 0 .
Another measure of improvement is the improvement in Bayes risk;
define I M ^
= r(B) - r(0^B ^).
we know that I M ^
By the definition of a Bayes estimate,
> G.
A.
Recall from equation (5.2) that 5(B) can then be written as (I
P
- W)0.
75
Lemma 5.6.1.
The trace of a matrix is indicated by tr( ).
P- 1
I M = tr(W'VW(Z A )) \- tr(W'VW8
I
Proof.
8
Then
h.
-I
-I
') + tr(VW(B'(Z n(i)|7 )B) )
I
.
By definition,
IM 4.Eg|e (6 -
= E 0 J6
6
) 'V( 8 -
8
) - E*
| 0.(8
-
8
+ W 6 ) 'V( 8 -
8
+ W8)
(2.(6 - 0)'W8 - e'w'we)
ZE0 i0(8 - 8) 'VW(0 - 6) - E ^ 0(SfW 1W e )
Recalling that
8
|0 rO MVN(0 ,(Bf(i:+n.(i)| .^)B)
I
1
then
-IM = 2 t r ( W ( B ,(E+n(i)^“ 1 )B)"1)
I
k.
- tr(WfVW(Bf(E^n(i)|~l)B)~l) - S fW fWS'
" I
k
= tr(VW(Bf(E+n(i)i~^)B)"l)
I
+ tr(W,W ( B ,(E+n(i)|" 1 )B)’'1 (W,"i - I)) - tr(WfW S S f)
■ ■ _i
k+
,_i
P+
Since W f
= I + B f(E n(i)|, )B(E A ),
I
1
I '
(5.3)
I
76
IM = .tr(W (B'(E+H(I)IT 1 )B)""1) + tr(WfW ( I +A 1)) - tr(WfW
I
1
.
I 1
6 6
f)
and this completes the proof.
Lemma 5; 6 .2. '
■k
IM(B) = t r ( W ( B f.(i:+ (n(i)|71 )B)™1)
Proof.
Using the definition of IM'
(5.4)
. and equation (5.3),
I M (B) = E 0 (IM)
p,
k
,
■,
tr(WfW ( I A )) - tr(WfWE(ee')) + tr(W(B'(I n(i)|~ )B) )
I
I
tr(VW(Bf(I+n(i)$:l)B)~l) . . .
I
This completes the proof.
Note that (B' (2+rt(i)i 1 'L)B) is symmetric and positive definite,
i
I
k
-I
P,
k
Px
Therefore.(B1(I n(i)$^ )B)(Z
[Z A^)(B'(I n(i)$^ )B) is positive definite,
I
n
1
I
' P.
since I A 1 is.
.
.
1
.
1
k
_V
This implies that W ( B f(I n(i)$. )B) ., which can be .
I
1
r'
■■
.
77
K.
K
'
p
JC
written as (B' (E+H ( I ) K 1)B + B'(Z+n(i)$Tl)B(Z+A.)B'(Z+n(i)$~l)B)~l, is
. . .
I
I
1
I 1
I
1
a positive definite matrix.
Proposition 5.6.3.
Proof.
Therefore we have the following result.
>
0
.
Recall from equation (5.4) that
IM(B) = tr(VW(B' (E+H ( I ) ^ 1 )B)"1) .
I
Since V is assumed to be a positive definite matrix and the trace of
the product of two positive definite matrices is strictly positive,
then IM
(B)
>0
and this completes the proof.
Therefore, in terms of Bayes risk, the Bayes estimate always
improves on the M.L.E.
Lemma 5.6.4.
definite,.and if
If
P
■P
2
P_L
E (0 /A.) < I, then (E A ) - 0 0 ’ is positive
i=l
I
.
2
Pi-
E,(0./A.) = I, then (E.A.) - 0 0 1 is positive
.1 = 1 . V 1 " ' '
I
semi-definite.
Proof,
By theorem 8.5.2 in Graybill (1969), the characteristic
" P+
P 2
P
equation for (E A ) - 00’ is (I - E 8 ,/(A - X)) H (A
I
i=l
.
i=l
- X) = 0.
.
78
Suppose A* < 0 arid Z (0 /A )
. 1=1 1 1
I.
Then
'
P
Z 0;/(A. - A*) < Z 6 /A. _< I, which implies that
I = I 1
1
i=l 1
1
(I -
Z 8 ./(A - A*)) > 0.
i=l
1
that (I -
Z
1=1
6
Since A
/ (A, - A*)) H (A
i=l
teristic. root of (Z A ) I 1
- A* > 0 for each i, this implies
I
8 8
'.
- A*) > 0 and A* cannot be a charac-
Therefore, there are no negative roots
..
of the characteristic equation when
•
P o
P
P
2
Z (.6 ./A.)
±=i
o
.
P
If . Z Q /k < I, then (I - Z 8^/(A
i=l 1
1
i=l
1
0
I.
1
- 0))n(A. - 0) > 0
I
and
1
is not a root of the characteristic equation.
P 2
So for Z 6 ./A. <: I, all characteristic roots are positive.
•
i-i 1
1
.
Since
a symmetric matrix having characteristic.roots which are all positive
is positive definite (Graybill 1969) , (Z A.)
I i
P ■O
2 ef/A,
1 = 1 1
.1
I, t hen.(I -
Z6./A.)nA.
I
1
1
I
1
t
8 8 .1
is positive defiriite.
■ . :
0 and A = 0 is a chafac-
79
teristic root of (2 A.) - 80'.
I 1 .
. Therefore, if 20,/A; _< I, the characteristic roots of (2 A.) - 00
I 1
1
I 1
are non-negative and at least one root is equal to zero; by theorem
P+
12.2.2 in Graybill (1969) , (2 A.) - 00' is positive semi-definite.
I 1
This completes the proof of lemma 5.6.4.
Proposition 5.6.5.
Proof.
If
2 0./A, < I, then IM > 0,
i=l 1
1
.
By equations (5.3) and (5.4), we have
IM = tr(W'W(2 A.)) - tr(W'VW00 ') + IM^
Yb )
By proposition 5.6.3, IMv ■ > 0 .
Therefore,
IM > tr(W'VW(2 A )) - tr(W'VW8
I
. = tr(W,W ( ( 2 +A.) I 1
8 8
6
')
')) .
P 9
■.
p +;
By lemma 5.6.4, 20,/A £ I implies that (2 A.) - 00' is non-negative
'i 1
1 •• ■
■ ■
:i 1
definite.
••'
P+
Since W ’VW is positive definite, tr(W'VW((2 A.) - 00')) £ 0
' I
■;
80
and IM > 0.
This completes the proof.
In particular, for the two most common loss functions
L(0,6) =
(8
- 9)'(9 - 0) and L(0,0) = (0 - 0)'| ^(0 - 0), the risk of
the Bayes estimate is strictly less than the risk of the M.L.E. if
lies within the ellipsoid
P 2
Z y /A
= I.
0
The region in which IM is
1=1
greater than
may be much more general than this ellipsoid, especially
0
k
if tr(VW(B' (Z n(i)|/)B) x) is large.
I
5.7 Empirical Bayes Estimation of 0 When i is Known
Consider the following simple example.
Let X^, i = I , ... ,N, be a
2
2
random sample of p x I vectors where Xl |0 -v MVN(0,a I ) and o is
1
P
'■ _
N
known. The M.L.E. of 0 is X = Z X.'/N. If 0 ^ MVN(0,Al ) , where
i=l 1
P
A is a positive constant, then the Bayes estimate under squared error
2
loss is (I - cr / (a
2
—
+ NA))X.
2
An unbiased estimate, of I/(a, + NA),
under the unconditional distribution of X, is (p - 2)/NX'X.
Therefore
an empirical Bayes estimate, which turns out to be a James-Steiri
estimate, is (I - a (p i 2)/NX’X)X.
. Suppose however that there are. missing observations.
this.case, the M.L.E. of £ is (Z I/t^)B'S.= X
■
I
Then for
where
x .. = Zx ./t., the mean over the t . observations of variable j .
.j
£ 1J J
I
The
81
Bayes estimate is
g(B)
P I
9
9
( S i - a /(a
I
(I - a 2 /(a 2 + tiA))x.i .
_
+ t .A))X or
^
This is now in the form of a ridge-
'2
2
regression estimate; the term (I - a / (a. + t^A)) is different for
each i.
2
It is more difficult to get. unbiased estimates of I /(a
2
i = I,...,py than it was to get an.unbiased estimate of I/(a
+ t A),
i
+ NA);..
Alternative types of. estimators could be considered and maximum
likelihood estimation of A might seem the logical place to start.
However, even in this simple case, the M.L.E. of A may not exist,
and if it exists there is not a simple closed form expression for
it.
An unbiased estimate of A is A
___ yP
(X'X - a E(l/t.))/p; this is
I
also a method of moments estimator.
5.7.1 An Empirical Bayes Estimator for the General Case
Proposition 5.7;I .
where Q
; k •
= (En(I)I1)+
I
Proof.
LetU=
The unconditional distribution of S is MVN(£,Q)
k
P
k
(E n(i)Is(i))B(i: A ^ B V C E n(i)I?:(i)),
I
I
I
(El/A,) + B'(E n(i)|.)B.
■' I
1
I
.
1
Recall that
k
s|e ^ .MVN(: (E+n(i)Ig.^±))B6, (E4ln (I)I1) ) and 8 ~ MVN(Jg1CE+A1)).
Then
82
the p.d.f. of S,f(S;8 ) is
J-
c &xp(. - Id '(Z-rC i Z n a ) )|T1)d - A e X z r I Z A J e ) d e
I
(Z n(i)I
where D = S .
and
f(s)
8
.
2.
I
1
x
.. )B 8 , and c is a constant with respect.to S
s'
Therefore, using theorem 10.5.1 in Graybill (1969),
J (IZn(I))I1
............... ..
= c exp( - As' ( z
= C1 e x p ( -
Asv (E
)s)
..
k,
exp( -
ASVC
(L+ (IZn(I))I. )
I
, J 1- I .
I
1
k
,
1
k
(Z+ITi)BU-iB l(L+ITi) ) S )
I
I
where C 1 is a,constant with respect to S.
1
..........
- A ( e ’ ue - 2 e ' B ' ( z |~ ) s ) ) d e
A-is
(IZn(I))I1 ) s ) e x p ( A s ’ (sk+x-iNT.
I. ) b u -i„.t
V ( Zk+
+I
J )S )
I
= C
J f e......
x p(
I
By straightforward algebra, .
•
.
k' ■
k '
% '
it can be checked that ( (Z+ (lZn(i))|T1) - (L+IT1)BU-1B '(L+IT1) )Q = I
I
I 1
I .1
and Q( (L+ (IZnCi))T1) - '(L+T 1)BU-1B' (L+IT1) ) = I, where I is the
I
I
I 1
k ,
k
'L s . x L s . identity matrix;
I 1
I 1
S ~ MVN(^,Q).
o -1C
Therefore, f (S) = C 1 e x p( - A SH'0
S) and
1
This completes the proof of proposition 5.7.1..
83
k
Therefore .0 ^ MVN(0, (B'(Z+H ( D i J 1 )B)^
~
I
1
X=
P+
(Z'l/t^)B’S = (x#^...x
p
+ (Z+A.)) and
I 1
)' is distributed
P+
K
p
.
p
MVN(0, (Z+ l/t.)B»(Z n(i)D)B(Z 1/t.) + ( Z % ) )
I
I
I
1
I
.
Using the unconditional distribution of X, there is not a general
closed form solution for the M.L.E. of the A^'s.
In fact, since it is
assumed that each A^ > 0, a M.L.E. does not necessarily exist.
—2
unbiased estimate of A^ is A^ = x ^ - a^^/t^, i = I , ...,p.
estimates are method of moments (M.O.M.) estimates.
An
These
Using these A 's
I
and equation 5.2, an empirical Bayes estimate of 0 is
Q(EB)
= ( I
- (I
k.
-I :\-l
+ (Z+ X^i - O 1 1 Zti)B '(Z-hH ( I ) ^ 1 )B) '•L ) 0.
,
I
I
The asymptotic, properties of
0
(EB)
are given i n ■the following
proposition.
. Proposition 5.7.2. Suppose n(i)/N -> r ., where 0 < r . < I,
' ' ' ' N-Xjo 1
1
i =. I , ... ,k, and S . LI ,S ., i,j = l,...,k, i f j.
1
J
Then
L
. ^ ( 0 (EB) - 0) -XZ A, MVN(0, (B '(Z+ r1 |1 1 )B)"1)
N-Xd
. where
0
V
y is defined in equation (5.5) .
:
.
(5.5)
84
Proof.
I
j
Recall that t^ = En(j) where the summation is over all j ,
k, such that i e C., i .e ., over all j such that variable i
.
.
J
■
.
was observed in missingness pattern
3
.
where the summation is again over all.
3
Let c. = Zr., i = I , ...,p,
1
,
J
= I,...,k, such that i e C.
3
J
P
Then c. > 0, £ = I , ... ,p, and t./N--*c.; Since each r. > 0 and c. > 0
1
1
N-x» 1 .
1
1
__ N_ P _1
'__1_' P n
a(DN>i '
_J.
P n
' Cf BH-
, JN
'
P _1
^i. ^ i
^ i.
'" "
... .
From the proof of proposition 5.4.1, we have .
/N(S - B9)
->Z
N-x”
^ MVN(0,(2 .(l/r ) L ) )
I
1
.
Since X = (2+ l/t.)B*(Z+n(i))S, this implies
I
I
" V n (X -
‘ L '
p
,p
1■
) . > z . ~ MVN(0,(2+ l/c.)B '(Z+ r,$ .)B(Z+ 1/C .) .
. N-x» 1
' I ‘ 1
I 1 1
I . 1
6
,
P
- pV
,
Then, by Slutsky’s, law, X '■>0 and X X' -x 09 ...
N^oo
' . N-x” ‘
'
..
If D is a p x p matrix, let the notation dia.g{D} indicate the
p % p diagonal matrix with the same diagonal elements as D.
^(
q
CEB) ^ Qj can be written as
Then
85
/N ( 8 - 0) - VS"( I + dmgtii X' - ^(Z+ l/ti)}B,(2+fa(i)i71)B )-;L0 =
P
I
1
I
1
VS(6 —
0
) —
•
'
;
___ .
P
k
.
+ XdtaglX X1 } - dcag{|}(z:+ l/t,))B,(Z+ (n(i)/N)i7J")B
. ' V 1
I
-1 .-'
V^/N( (1/N)I
) " 1
;
Using the results of proposition 5.4.1 and repeated applications of
Slutsky's law, this implies
*^(0™
L
k
- 0) -y Z - 0(0 + (dtag{00'} - dtag{|}0)B,a:+ r J:3‘1 )b )'"?-0 = z
N- * 00
I
k
_
_
where Z t MVN(0,(B'(Z r.|. )B)
. .This completes the proof.
I
Therefore, under the conditional distribution X^|0,
L
■ k '
^j(0(EB) - O) -)-Z n, MVN(0, (B 1 ( Z V
.
; I 1
:
f
1
)B)"1)
1
and this .E.B. estimator is consistent and it is asymptotically
equivalent to’
.,this M.L.E. 0.
One apparent difficulty with this E.B. estimator, is that Ay may
be negative, while the true parameter A. is assumed to be strictly
positive..
I
.
■
■
*
^
■ • . . . ■ ■
When ^ is a diagonal matrix and A^ > 0, A^ is also the M.L.E.
of A 1 ..
86
5.8 Examples .
.
.
5.8.1 The Design
Consider an experiment carried out in a randomized block design.
The model for the observed variables,
where b^, i = 1,...,N, is the i
the
., is y „
= a + b^ + t . + e ^ .,
block effect, t^ , j = I ,..,,p, is
treatment effect, and t^ represents the control.
The vectbrs
e^ = (e^y..-e^p)'> I = 1,...,N, are independently and identically
distributed with the (p + I)-variate normal distribution with mean
and variance-covariance matrix Q.
0
Since Q is not. necessarily a
diagonal matrix, the treatment effects may be correlated.
For a
recent reference for this nonstandard block design model, see Mudholkar
and Subbaiah (1976).
0
I
Suppose the parameters of interest are
= t. - tn , i = l,...,p, the differences between treatment and
i
U
control.
Let
Let Xi
0
= (0 ,.. . 0
I p
)'.
= y^^ - y i0, and X ± = ( x^. ..x^), i = I,.... ,N, and
1 = l,...,p.
Then the model for x '
ij
is x. .= t. - t_ + e . .
ij
3
0
ij
e,_ and
10
X i I0 ~ MVN(0,|)
where $ = (-* ] I
pxl
P
JQt-J ! I )'.
p Xp
Suppose one or more treatment variables are missing for sonie
observations.
If Q. is known, and therefore | is known, then the
problem is to estimate
0
from a random sample Xi , i =
1
,.,.,N, where
87
Xj:|0 ~ MVN(0,|) and some data are missing.
In this example, shrinking
the maximum likelihood estimate of the treatment effects towards the
control is a conservative and reasonable approach.
So for p ^ 3, the
,-*1
estimate 0 V . , which shrinks the M.L.E. towards O 9 is a reasonable
estimate to consider.
Mudholkar and Subbaiah (1976) give an example of a randomized
block design where the parameter of interest is
between treatment and control.
0
, the differences
They give the sample mean and variance
from a clinical trial where p = 4.
Using their sample estimates as
true parameters, consider an example where
/1.5616 \
1.3529 I
1.5457
\ 1.0222 / .
I
. _
9 I
/5.1791
3.5200
4.1789
X 4.7738
_I
2.9220
3.1260, 4.5538
3.8497 3.9780
I and 4 I
5.5509
Fpf the following examples, N random. 4 x I vectors were generated with
this MVN,(.0j|) distribution.
.
'' ■ ■
■
" (R)
0
■
and
zero.
0
.
'
'
A (EB)
0
' ;' ■ ■
. .
The Bayes and empirical Bayeh estimates,
•
' ..
.
.
.
'
, should give the best results when
The value of
0
'
0
is zero or near
used for these examples is not exactly equal to
, but the values of the. variances are sufficiently large so that
close to zero relative to the variances.,
0
is
Therefore these examples may
or may not show the empirical Bayes estimate to advantage.
v.
88
5.8.2
Numerical Examples Where | is Known .
In each of the examples 1-3, described below, the M,L,E,
calculated using equation (5.1).
6
is
The empirical Bayes estimate, 0
is calculated using equation (5.5).
The estimate 0
(EB)
(EB)
uses
—2
•
A.-. = x % - .Cf4 4 Ztil; when t4 is large, the term O 4 4 Zt4 will, have little
r ii
ii i
-2
influence. . The empirical Bayes estimate using
= x,^, i = l,...,p,
'L 'A(2 )
was also calculated; it is denoted by 0 V '.
The values for two loss
functions are calculated for each estimate:
I)
2
)
(0
-
0 ^'^
^ '(8
-
0
), where
0
is the estimate
- $)
(0
0
,
0
^EB\
'.(0
of
-
0
) and
.
4 2 In each of the following examples the number E0./A was calculated
I 1
1
—2
,
where A^ = x ^ - o^^/t^.
This was of interest because the Bayes esti-
4 %
mate will improve bn the M.L.E. when E0./A. < I.
I 1
1 _
4
.
Therefore E0.4 /A.
. 1 i i
...
might give some, indication of the suitability of using the E.B. esti­
mate; however, these examples don't show any obvious relation between .
this number and the performance of Ov
'Example I .
N = 100.
'.
There are three patterns of missingness,
with missinghess pattern
n(i)
80
'mix
M*
1110
,0110/
and
10
10
89
2
"
= 2.91.
The following is a summary of the results.
V
VV
*4-
-
0
- '$ -
Il
<?CD
0
.341
.231
.234
0
.190
.106
,108
0
.257
.160
.162
0
.323
.206
.209
.323
.133
,136
.034
.0 2 2
.0 2 2
6I
.
(% -
0
) ' (% -
-
0
)'F^
0
(0
)
-
0
)
'In this.example the E .B . estimate performed better than the M.L.E.
a (TTg}
' *^ (2 )
There was little difference between 0 ■
and 0. . , which was to be:
expected since t. > 80 for each i.
Example 2.
N = 50.
In this example the missingness pattern was
generated by assuming that the probability of the variables being
missing w a s :
• Variable #
Pr (missing)
1
. 2.
2
.4
3
4
.2
.90
The resulting incompleteness matrix is
1111
1011
1101
0011
1110
1010
1001
0111
0101.
1100
0100
0010
0110
1000
M*
4 2 /
E0i/Ai =2 .60.
' '
.'
'
The following is a summary of the results.
■■
.? =
. > 1 - 6V
:
®2 ." : 6 2
6 3
v
■
®4 - '6A
.:
;
» , e (EB)
0
.564
.366
.,374
' .411
.256
.. .263
.429
,249
,256
.419 .
,2 1 0
.219
.
■( 8
-
0 ) ’ (0
.. $
-
0
-
0
)
).'$- 1 ($ -
.306 .
.847
0
)
.1 1 2
V
. .079
.322
.080
Again the E.B.■estimators performed significantly better than the
M.L.E.,
The estimate 0 ^
did hot do quite as. well as 0 ^ ^ .
■
;
......
s
.
91
Example 3 .
N = 16.
The incompleteness matrix is
n(i)
M* =
4 2 Z 8 ,/A
= 3.55.
/1111\
.
I 0110 I and
V0 0 1 1 /
8
4
4
.
Following is a summary of the results.
-
1
. K,
8
°1
" 9I
&2
-
@2
^4 - 94
-
-
8 ) ' ( 8
' (ef> e
8
)
8
)
-
6
) ,
.S -
I
e <EB)
.032
-.466
-.413
.2 1 2
— .160
-.1 2 1
- .1 0 2
*3 - 93
(0
=
■ -.545 .
-.499
. .074
-.425
-.373
. .062
.721
.574
.270
.262
.233
'
In this example, neither E.B. estimator performed well at all
compared to the M.L.E.
The M.L.E.’s were either very close to the
true value or underestimated the true value, except for
/so-
shrinking the M.L.E. towards 0 did not improve the estimation.
sample size was.smaller here than in examples I and
- 4 % .
of 20 /A
2
The.
and the value
:
was larger.
,
.
. :
:
'
However one cannot draw any general conclusions from these three
examples. , A simulation study is necessary to make such conclusions,
92
and such a study is not within the scope of this paper.
These examples
do indicate that the empirical Bayes procedure warrants further
investigation.
5.8.3 An Example Where | is Not Known
The assumption that $ is known is a restriction bn the usefulness
of the E.B. method.
Future study will need, to extend the empirical
Bayes method to problems of estimating 0 when $ is not known.
Until
such work is completed, it seems reasonable to use the naive approach
of replacing $ of equation (5.5) by some consistent estimate of $.
Consider the data and missingriess pattern of example I.
now that $ is not known.
The M.L.E.
6
Assume
can be calculated by 0 and W ’s ,
iterative technique using the computer program MISSMLE,
To calculate
the empirical Bayes procedures, replace the elements of $ in equation
(5.5) by their M.L.E.'s from the 80 complete data cases.
As the.table
below shows, the. E.B. procedures still perform well relative to maximum
likelihood.
93
rV
6
*1
- 0I
* 2 .- 0 2
S
.
.
04
.
•
(e-
8
)' ( 8
-r
(6 -
8
)'$"1
(8
8
)
-
6
)
•=
8
8 -
e‘ <EB)
8 = 5 <2)
.340
.251
.253
.190
,126
.128
.257
.177
,178
.325
.237
,239
.323
,167
.169
.034
.023
.023
.
'
6
.
SUMMARY
This paper is concerned with the problem of simultaneous estimation
of the means using incomplete multivariate data.
The major results of
this paper are two new estimators for the mean— a Bayes estimator and
an empirical Bayes estimator.
These are compared to the maximum
likelihood estimate of the mean.
Large sample and small sample prop­
erties of the Bayes estimator are found, and large sample properties
of the empirical Bayes estimator are found.
Although small sample
properties of the empirical Bayes estimator, are more difficult to find,
numerical examples indicate that under some conditions this estimator
may improve on the maximum likelihood estimator.
Also, this paper
presents two computer programs which provide additional tools for
estimation when there are data missing.
The computer program REACTOR creates Rubin's factorization table
(1974); without this program, his algorithm is only feasible for small
data sets.
This factorization provides a useful summary of the data
and the missingness pattern.
REACTOR may break the problem into
several simpler estimation problems and, more importantly, it can show
that some parameters are not identifiable.
For these reasons, the data
should be analyzed by REACTOR before estimation is performed.
The. second computer program is MISSMLE which calculates the.
A.
maximum likelihood estimate of the mean vector (6 ) arid variance-
. ,
covariance matrix (4 ) from a multivariate normal data set with missing
95
observations.
(1970).
It uses Orchard and Woodbury's iterative procedure
Maximum likelihood is a widely used method in statistics
and is asymptotically optimal in the setting of this paper.
•
• ■
. ■
.
.
.
.
.
.
.
•
'
•
, '
. .
For large
.
■
data sets, MISSMLE should be used.
The major results presented in this paper are a Bayes estimator,
and an empirical.Bayes estimator of the mean vector,
8
, from a
multivariate.normal data set with missing observations.
The ,empirical
Bayes estimator resembles the popular ridge-regression estimator and
shrinks the maximum likelihood estimator towards an a priori mean
vector.
From what is known about the complete data case, it was
anticipated that the empirical Bayes estimator could be an improvement
over maximum likelihood, in the small sample situation, for simul­
taneous estimation of the means under squared error loss.
As a preliminary step in deriving the empirical Bayes estimator,
the Bayes estimate, 0.^
, under squared error loss is found;
A.
multivariate normal prior with mean (j> and variance-covariance matrix
p
...
.. , '
I A., is assumed. .The Bayes estimate has the following properties;
i
1
.
. I.
■
-
■ ■ ■ ■ / • ■ ;
v
•
•
■
:
.
■
/
.
\
.
: :
’
It has the same form as a ridge-regression estimator.
2.
It is biased.
3.
Under mild assumptions, it. is consistent for 0 and it Is
asymptotically equivalent to
•
0
.
,
>
96
4.
Under squared error loss, the Bayes risk of Q v ^ is strictly
'
less than the Bayes risk of
5.
If the true
8
.
P 2
6
satisfies Z 6 ./A. < I, then the risk of
6
^ ' is
■
The converse is not true*
1
strictly less than the risk of
8
.
However, the Bayes estimator is a function of the A.'s, which are
not usually known.
Replacing the A 's by estimates converts the Bayes
estimate into an empirical Bayes estimate,
a (EB)
0
.
The estimates of
the A ^ ’s are found from the unconditional distribution of the sample.
In this paper, a method of moments estimate of the A ^ ’s was used.
The
resulting empirical Bayes estimate has the following properties:
1.
It has. the same form as a ridge-regression estimator.
2.
It is biased.
3.
Under mild conditions, it is consistent for 0 and it is
,-
a
asymptotically equivalent to
The estimate
examples
6
v
8
^
6
.
is easy to compute, and in several numerical
improved on
8
.
It is assumed that the process causing the missingness can be
ignored, and that every variable is observed at least once.
Neither of
these assumptions are particularly restrictive or unreasonable.
The
data are assumed to come from a (p)^variate normal distribution,
MVN(
8
, $ )•
pxl pxp
While.this is a common assumption in statistical work,
97
it may be restrictive in practice.
Some work, has been done on testing
for multivariate normality and the researcher should verify that the
data come from a multivariate normal distribution.
Robustness of the
empirical Bayes estimator, however, Would be. a useful area for future
work.
The most restrictive assumption made is that 4 is known.
This
is a restriction which needs to be removed in future Work, as will be
discussed later.
There are a number of potentially profitable ways that the results
of this paper can be extended and generalized.
The most important
result needed is a theorem dealing with the improvement in the risk of
^ /TTTJY
Sv
^
' compared to 0.
useful.
-
'
An unbiased estimate of this improvement would be
Also, other estimates of the A^'s could be considered.
An empirical Bayes estimator with the elements of
4
also replaced
by estimates could be the next step.. This would remove the restriction
that i is assumed known.
•
if such an estimate can be developed which
. v
:
'
/
v
:
,
does as well as, or better than, the maximum likelihood estimate^ it
■
would provide a simple., noniterative, procedure for. simultaneous esti­
mation of the means, .for any pattern of missingness,.
The final example
of chapter 5 shows how one can handle, for now, the case of
in an ad hoc way.
4
unknown
■
Estimation of the means is of limited use without the-associated
inference procedures— confidence limits and hypothesis tests,
At this
98
point, only the asymptotic distribution of
procedures.
0
^EB^ is available for such
For extending the results, of this paper to statistical
inference, further work, including a simulation study, is needed.
BIBLIOGRAPHY
Afifi, A.A., and Elashoff , R.M. (1966), "Missing Observations in
■Multivariate Statistics, I. Review of the Literature,"
Journal of the American Statistical Association, 61, 595-603.
____ (1967), Missing Observations in Multivariate Statistics, II.
.Point Estimation in Simple Linear Regression," Journal of the
American Statistical Association, 62, 10-29.
j
.
(1969), "Missing Observations in Multivariate Statistics, III.
Large Sample Analysis of Simple Linear Regression," Journal] of
the American Statistical Association, 64, 337-358.
1
_____ (1969a), "Missing Observations in Multivariate Statistics, IV.
A Note on Simple Linear Regression," Journal of the American
Statistical Association, 64, 359-365.
Anderson, T.W. (1957), "Maximum Likelihood Estimates for a .Multivariate
Normal Distribution When Some Observations are Missing," Journal
of the American Statistical Association, 52, 200-203.
Beale, E.M.L., and Little, R;J.A. (1975) , "Missing Values in
Multivariate Analysis," Journal,of the Royal Statistical Society ,
Series B , 37, 129-145.
Bhoj, D.S. (1978), "Testing Equality of Means of Correlated Variates
with Missing Observations on Both Responses , 11 Biometrika, 65,.
225-228.
Blight,. B.J.N. (1970), "Estimation from a Censored Samplfe for the
Exponential Family," Biometrics, 57, 389-395.
Buck, S.F. (1960), "A Method of Estimation of Mis.sing Values in
. Multivariate Data Suitable for Use with an Electronic Computer,"
Journal of the Royal Statistical Society, Series B , 22, 302-306.
DeGroot, Morris H . (1970), Optimal Statistical Decisions, New York:
McGraw-Hill, Inc.',
Dempster, A.P., Laird, N . M ; a n d Rubin, D.B. (1976), "Maximum
Likfelitiobd from Incomplete Data Via the EM Algorithm," Research
Reports S-38\NS-320, Department of Statistics, Harvard University.
100
Draper, N.R., and Guttman, I. (1977), "Inference from Incomplete Data
on the Difference of Means of Correlated Variables: A Bayesian
Approach," Technical Report #496; Department of Statistics,
University of Wisconsin, Madison.
Edgett, G.L. (1956), "Multiple Regression with Missing Observations.
Among the Independent Variables," Journal of the American
Statistical Association, 51, 122r-131.
Efron, B., and Morris, C. (1972), "Empirical Bayes on Vector Obser­
vations: An Extension of Stein's Method," Biometfika, 59,
.335-347.:
.
.Ekbohm, G. (1976) , "On Comparing Means in the.Paired Case with
Incomplete Data on Both Responses," Biometrika, 63, 299-304.
Ferguson, Thomas S. (1967), Mathematical Statistics, New York:
Academic Press.
Graybill, Franklin A. (1969) , Introduction to Matrices with Applications
in Statistics, Belmont, California: Wadsworth Publishing Company,
Inc..
.
.
•
_____ (1976), Theory and Application of the Linear Model, Belmont,
California: Wadsworth Publishing Company, Inc. .
: Hamdan, M.A., Pirie, W.R., and Khuri,. A.I. (1976), "Unbiased Estimation
of the. Common Mean Based bn Incomplete Bivariate Normal Samples,"
Biometrische Zeitschrift, 18, 245-249.
Hamilton,. Martin A. (1975), "Regression Analysis When There Are .Missing
Observations— Survey and. Bibliography," Statistical Center
Technical Report #1-3-75, Montana State University.
Hartley, H.Q., and Hocking, ft.R., (1971) ; "The Analysis of incomplete
Data," Biometrics. 27, 783-823.
Hinkins, Susan M. (197.6), "MISSMLE - A Computer Program for Calculating
Maximum Likelihood Estimates When Some.Data are Missing," .
Statistical Center Technical Report #1-6-76, Montana State .
University. '
_______ (1976a), "REACTOR - A Computer Program to Create Rubin's Table
for Factoring a Multivariate Estimation Problem When Some Data
are Missing," Statistical Center Technical Report #1-12-76, Montana State University.
Hocking, R.R., and Smith, W.B. (1968), "Estimation of the Parameters
in the Multivariate Normal Distribution with Missing Observations,"
Journal of the American Statistical Association, 63, 159-173.
■•
_
(1972), "Optimum incomplete Multinormal Samples," Technometrics;
1 4 , 299 -3 0 7 .'
;
Hberl, A.E., and Kennard, R.W. (1970), "Ridge Regression: Biased
Estimation for Nonorthogonal Problems," Technometrics, 12, 55-67.
Hudson, H.M. (1974), "Empirical Bayes Estimation," Technical Report
#58, Department of Statistics, Stanford University, Stanford, ..
California.
James, W., and Stein, D. (I960), "Estimation with Quadratic Loss,"
Proceedings of the Fourth Berkeley Symposium, I, 361-379,
Lin, P.E. (1971), "Estimation Procedures for Difference of Means with
Missing Data," Journal of the American Statistical Association,
6 6 , 634-636.
.
____ (1973), "Procedures for. Testing the Difference of Means with
Incomplete Data," Journal of the American Statistical Association,
. 68, 699-703.
Lin, P.E., and Stivers, L.E. (1974), "On Differences of Means with
"Incomplete Data," Biometrika, 61, 325-334.
. ' (1975)., "Testing for Equality of Means with Incomplete Data on
One Variable: .A. Monte-Carlo S t u d y Journal of the American
Statistical Assdciationj 70, 190-193.
Little, R.J.A.; (1976),."Receht Developments in Inference About Means .
and Regression Coefficients for a, Multivariate Sample with Missing
:Values/' Technical Report #26, department of Statistics,
University of Chicago.
102
______ ■ (1976a), "Inference About .Means from Incomplete Multivariate
Data," Blometrika-, 63, 593-604.
Lord, F.M. (1955),."Estimation of Parameters from Incomplete Data,"
Journal of the American Statistical Association, 50, 870-876.
Louis, T.A., Heghinian, S ., arid Albert, A. (1976), "Maximum Likelihood
Estimation Using Pseudo-Data Iteration," Boston University Research
Report #2-76, Mathematics Department and Cancer Research Center,
. Boston University.
Mehta, J.S., and Gurland, J. (1969), "Some Properties and an Application
of a Statistic Arising in Testing Correlation," Annals of
Mathematical Statistics, 40, 1736-1745<
_____ (1969a), "Testing Equality of Means in the Presence of Correlation," Bibmetrika, 56, 119-126.
_____
.
(19.73), "A Test of Equality of Means in the Presence of Corre­
lation and Missing Data," Biometrika, 60, 211-213.
Mehta, J .S ., and Swamy, P.A.V.B. (1974), "Bayesian Analysis of a
Bivariate Normal Distribution When Some Data are Missing,"
Contributions to Economic Analysis, 8 6 ., 289-309.
Milliken, G.A., and McDonald, L.L. (1976), "Linear Models and Their
Analysis in the Case of Missing or Incomplete Data: A Unifying
Approach^" Biometrische Zeitschrift, 18, 381-396.
Morrison, D.F. (1971), "Expectations and Variances of Maximum
Likelihood Estimates of the Multivariate Normal Distribution
Parameters with. Missing Data:," Journal of the American Statistical
. Association, 6 6 , 602-604.
._______ _ (1973), "A Test of Equality of Means of Correlated Variates
with Missing Data on One Response," Biometfika, 60, 101-105i.
Morrison, D.F., and Bhoj, D.S. (1973), "Power of the Likelihood Ratio
Test on the Mean Vector of the Multivariate. Normal Distribution,
with Missing. Observations," Biometrika, 60, 365-368,
Mudholkar, G.S., and Subbaiah, P. (1976), "Unequal Precision Multiple
Comparisons for Randomized Block Designs under Nonstandard Condi­
tions
Journal of _the American Statistical Association, 71,
429-434. : ■ ■
103
Naik 5 U.D. (1975), "On Testing Equality of Means of Correlated Variables
with Incomplete Data," Biometrika 5 62, 615-622,
Nicholson, G.E. (1957), "Estimation of Parameters from Incomplete
Multivariate Sampled," Journal of the American Statistical
Association, 52, 523-526.
Orchard, T., and Woodbury, M;A. (1970), "A Missing Information Prin­
ciple: Theory and Applications," Proceedings of the Sixth
Berkeley Symposium on Mathematical Statistics and Probability,
■ I, 697-715.
Rao, C. Radhakrishna (1973), Linear Statistical Inference and Its
Applications, New York: John Wiley and Sons, Inc.
Rubin, D. (1971), Contribution to the discussion of "The Analysis of
Incomplete Data" by Hartley and Hocking, Biometrics, 27, 808-813.
:_____ (1974), "Characterizing the Estimation of Parameters in
Incomplete-Data Problems," Journal of the American Statistical
Association, 69, 467-474.
_____
_____
(1976), "Inference and Missing Data," Biometrika, 63, 581-592.
(1977), "Comparing Regressions When Some Predictor Values Are
Missing," Technometrics, 18, 201-205.
Sundberg, R. (1974), "Maximum Likelihood Theory for Incomplete Data from
an Exponential Family," Scandinavian.Journal of Statistics, I,
49-58.
■'
(1976), "An Iterative Method for Solution of the Statistical
Equations for Incomplete Data from Exponential Families/'
Communications in Statistics, B5, 55-64.
Wilks, S.S. (1932), "Moments and Distributions of Estimates of Popula. tion Parameters from fragmentary Samples," Annals of Mathematical
Statistics, 3, 163-203.
Woodbury; M.A. (19.71), Contribution.to the discussion of "The Analysis
of Incomplete Data" by Hartley arid Hocking, Biometrics, 27,
808-813. .
104
Woodbury, M . A . a n d Hasselblad, V. (1970), "Maximum Likelihood Esti­
mates of the Variance-Covariance Matrix from the Multivariate
Normal," SHARE National Meeting, Denver, Colorado.
3
D378
H592
cop. 2
DATE
Hinkins, Susan M
Using incomplete multi­
variate data to simul­
taneously estimate the
means
ISSUED TO
0378
»592
cop.2
Download