Empirical Bayes Methods for Combining Likelihoods

advertisement
Empirical Bayes Methods for Combining Likelihoods
Bradley EFRON
Supposethatseveralindependentexperimentsareobserved,each one yieldinga likelihoodLk (0k) for a real-valuedparameterof
interestOk. For example,Ok mightbe the log-oddsratio for a 2 x 2 table relatingto the kth populationin a series of medical
experiments.This articleconcernsthe followingempiricalBayes question:How can we combineall of the likelihoodsLk to get
an intervalestimatefor any one of the Ok'S, say 01? The resultsarepresentedin the formof a realisticcomputational
schemethat
allowsmodelbuildingandmodelcheckingin the spiritof a regressionanalysis.No specialmathematicalformsarerequiredfor
the priorsor the likelihoods.This schemeis designedto take advantageof recentmethodsthatproduceapproximatenumerical
likelihoodsLk(6k) even in very complicatedsituations,with all nuisanceparameterseliminated.The empiricalBayes likelihood
theoryis extendedto situationswherethe Ok'S have a regressionstructureas well as an empiricalBayes relationship.Most of
the discussionis presentedin termsof a hierarchicalBayes model and concernshow such a modelcan be implementedwithout
requiringlargeamountsof Bayesianinput.Frequentistapproaches,such as bias correctionandrobustness,play a centralrole in
the methodology.
KEY WORDS: ABC method;Confidenceexpectation;Generalizedlinearmixedmodels;HierarchicalBayes;Meta-analysisfor
likelihoods;Relevance;Specialexponentialfamilies.
1.
INTRODUCTION
for
A typical statistical analysis blends data from independent experimental units into a single combined inference
for a parameter of interest 0. Empirical Bayes, or hierarchical or meta-analytic analyses, involve a second level of
data acquisition. Several independent experiments are observed, each involving many units, but each perhaps having
a different value of the parameter0. Inferences about one or
all of the 0's are then made on the basis of the full two-level
compound data set. This article concerns the construction
of empirical Bayes interval estimates for the individual parameters 0, when the observed data consist of a separate
likelihood for each case.
Table 1, the ulcer data, exemplifies this situation. Fortyone randomized trials of a new surgical treatment for stomach ulcers were conducted between 1980 and 1989 (Sacks,
Chalmers, Blum, Berrier, and Pagano 1990). The kth experiment's data are recorded as
(k= 1,2,...,41),
(ak,bk,ck,dk)
(1)
where ak and bk are the number of occurrences and nonoccurrences for the Treatment (the new surgery) and Ck and
dk are the occurrences and nonoccurrences for Control (an
older surgery). Occurrence here refers to an adverse eventrecurrent bleeding.
The estimated log odds ratio for experiment k,
0k
log
bk
dk
2
0k,
also appearing in Table 1, is
SDk ={ ak +.S +
bb
+ .5
+k
+ -5
dk +
.5
SD13 = .61, for example.
The statistic 0k is an estimate of the true log-odds ratio
0k in the kth experimental population,
Pk
{OccurrencelTreatment}
log ( Pk {NonoccurrencelTreatment}
Pk {OccurrencelControl} N
Pk{NonoccurrencelControl}
(4)
where Pk indicates probabilities for population k. The data
(ak, bk, Ck, dk), considered as a 2 x 2 table with fixed margins, give a conditional likelihood for 0k,
Lk(O) (a
a
+ bk)
(Ck
+
dk )okak/S(O)
(5)
(Lehmann 1959, sec. 4.6). Here Sk(Ok) is the sum of the
numerator in (5) over the allowable choices of ak subject
to the 2 x 2 table's marginal constraints.
Figure 1 shows 10 of the 41 likelihoods (5), each normalized so as to integrate to 1 over the range of the graph.
It seems clear that the 0k values are not all the same. For
instance, L8(08) and L13(013) barely overlap. On the other
hand, the 0k values are not wildly discrepant, most of the
41 likelihood functions being concentrated in the range
(2) Ok E (-6,3).
measures the excess occurrence of Treatment over Control.
For example 013 = .60, suggesting a greater rate of occurrence for the Treatment in the 13th experimental population. But 08 =-4.17
indicates a lesser rate of occurrence
in the 8th population. An approximate standard deviation
Bradley Efron is Professor, Department of Statistics, Stanford University, Stanford, CA 94305. Research was supported by National Science
Foundation Grant DMS92-04864 and National Institutes of Health Grant
5 ROI CA59039-20.
538
This article concerns making interval estimates for any
one of the parameters, say 08, on the basis of all the data in
Table 1. We could of course use only the data from experiment 8 to form a classical confidence interval for 08. This
amounts to assuming that the other 40 experiments have no
relevance to the 8th population. At the other extreme, we
could assume that all of the 0k values were equal and form
? 1996 American Statistical Association
Journal of the American Statistical Association
June 1996, Vol. 91, No. 434, Theory and Methods
Efron: Empirical Bayes Methods for Combining Likelihoods
*
539
L4Asll
Our empiricalBayes confidenceintervalswill be constructed directly from the likelihoods Lk (0k), k = 1,
2 ... K, with no other reference to the original data
that gave these likelihoods.This can be a big advantage
in situationswhere the individualexperimentsare more
complicatedthan those in Table 1. Suppose,for instance,
that we had observed d-dimensionalnormal vectors xk
=1,2,... K and
Nd(/,k,Zk) independentlyfor k
that the parameterof interest was Ok = second-largest
eigenvalue of Xlk. Currentresearchhas providedseveral
good ways to constructapproximatelikelihoods LXk(Ok)
for Ok alone, with all nuisance parameterseliminated
(see Barndorff-Nielsen1986, Cox and Reid 1987, and
Efron 1993).The set of approximatelikelihoods,Lxk (Ok), k
1,2, ... K, could serve as the input to our empirical
Bayes analysis.In this way empiricalBayes methodscan
be broughtto bear on very complicatedsituations.
The emphasishere is on a generalapproachto empirical
Bayes confidenceintervalsthatdoes not requiremathematically tractableforms for the likelihoodsor for the family
of possible prior densities.Our results,which aresprimarily methodological,are presentedin the form of a realistic computationalalgorithmthat allowsmodelbuildingand
model checking,in the spiritof a regressionanalysis.As a
price for this degree of generality,all of our methodsare
approximate.They can be thought of as computer-based
generalizationsof the analyticresultsof CarlinandGelfand
(1990, 1991), Laird and Louis (1987), Morris(1983), and
otherworksreferredto by those authors.Section6 extends
the empiricalBayes likelihoodmethods to problemsalso
having a regressionstructure.This extensiongives a very
generalversion of mixed models applyingto generalized
linearmodel (GLM)analyses.
The frameworkfor our discussion is the hierarchical
Bayes model describedin Section 2. Philosophically,this
,
*.
I
'cJ<L
.'1~~'
,
L
o
_~~~~~~
_
V_It
.
,
the oter40lieihod6
2
4
log odds ratio
4
e
Figure 1. Tenof the 41 IndividualLikelihoodsLk(Ok), (5);k
1, 3, 5,
to Satisfyf58Lk(6,0
Normalized
6, 7, 8, 10, 12, 13,and40; Likelihoods
dok ==1. Starsindicate
L40(040),whichis morenegativelylocatedthan
the other40 likelihoods;L41(041), not shown, is perfectlyflat.
a confidenceintervalfor the commonlog-oddsratio0 from
the totals (a+, b+Ic+Id)
d
(170,736,352,556). The em-
piricalBayes methodsdiscussedhere compromisebetween
these extremes.Ourgoal is to make a correctBayesianassessmentof the informationconcerning08 in the other 40
experimentswithout having to specify a full Bayes prior
distribution.Morris (1983) gives an excellent description
of the empiricalBayes viewpoint.
Table 1. Ulcer Data
Experiment
a
b
c
d
*1
2
*3
4
*5
*6
*7
*8
9
*10
11
*12
*13
14
15
16
17
18
19
20
7
8
5
7
3
4
4
1
3
2
6
2
9
7
3
4
2
1
4
7
8
11
29
29
9
3
13
15
11
36
6
5
12
14
22
7
8
30
24
36
11
8
4
4
0
4
13
13
7
12
8
7
7
5
11
6
8
4
15
16
2
8
35
27
12
0
11
3
15
20
0
2
17
20
21
4
2
23
16
27
0
-1.84
-.32
.41
.49
Inf
-Inf
-1.35
-4.17
-.54
-2.38
-Inf
-2.17
.60
.69
-1.35
-.97
-2.77
-1.65
-1.73
-1.11
SD
Experiment
a
b
c
d
.86
.66
.68
.65
1.57
1.65
.68
1.04
.76
.75
1.56
1.06
.61
.66
.68
.86
1.02
.98
.62
.51
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
*40
41
6
4
14
6
0
1
5
0
0
2
1
8
6
0
4
10
3
4
7
0
0
34
14
54
15
6
9
12
10
22
16
14
16
6
20
13
30
13
30
31
34
9
13
5
13
8
6
5
5
12
8
10
7
15
7
5
2
12
2
5
15
34
0
8
34
61
13
0
10
10
2
16
11
6
12
2
18
14
8
14
14
22
0
16
0
-2.22
.66
.20
-.43
-Inf
-1.50
-.18
-Inf
-Inf
-1.98
-2.79
-.92
-1.25
-Inf
.77
-1.50
.48
-.99
-1.11
-Inf
NA
SD
.61
.71
.42
.64
2.08
1.02
.73
1.60
1.49
.80
1.01
.57
.92
1.51
.87
.57
.91
.71
.52
2.01
2.04
NOTE: 41 independent experiments comparing Treatment, a new surgery for stomach ulcer, with Control, an older surgery; (a, b) = (occurrences, nonoccurrences) on Treatment; (c, d)
= (occurrences, nonoccurrences) on Control; occurrence refers to an adverse event, recurrent bleeding. 4 is estimated log odds ratio (2); SD = estimated standard deviation for 4 (3). Experiments 40 and 41 are not included in the empirical Bayes analysis of Section 3. Stars indicate likelihoods shown in Figure 1. Data from Sacks et al. (1990). Carl Morris, in Part 8 of his
Commentary following this article, points out discrepancies between Table 1 and the original data for cities 4, 24, 32, and 33.
540
Journal of the American Statistical Association, June 1996
hyperprior
density
h(-)
4
hyperparameter
____________________________
I 1
|
nduced
a1-?
I
prior density
|
induced
I
prior-jparameter
of interest-
~
h(rjlx)
X
X
I
g(x0/)
I
|
00
1
foo(.)
direct data
xO
oi
02
03
1
1
1
f02(.)
f03(.)
.
X2
X3
*
f0(l
X1
**
unobserved
parameters
K
1
..
dl(x)
|
individual
(K()
sampling densities
observed data
XK
I
density
~~~~~~sannpling
I
x
"other" data
Figure 2. The Hierarchical Bayes Model Hyperparameter 77 Sampled From Hyperprior Density h(). Prior density gq() yields unobservable
parameters 00, 01, 02, 03 .... OK; each Ok determines independent sampling density fok () for observed data xk; inferences about 00 are made on the
basis of xo and the "other"data x = (x1, x2, x3,..., XK). It is convenient to assume that x is observed before xo (dashed lines), leading to the full
Bayesian analysis (13).
is a very attractive framework for the combination of independent likelihoods. In practice, however, it is difficult to
apply hierarchical Bayes methods because of the substantial
Bayesian inputs that they require.
This article can be thought of as an attempt to make hierarchical Bayes analyses more flexible and practical, and
more palatable to frequentists. Section 3 considers a databased approach to the choice of prior distributions, less
rigid than the usual appeal to conjugate families. Section 5,
on bias correction, concerns frequentist adjustments to the
maximum likelihood estimation of hierarchical parameters.
These adjustments correct a simple frequentist approach to
better agree with what one would obtain from a full hierarchical Bayes analysis. Section 4 concerns robustness:
How we can protect a genuinely unusual case, perhaps the
one drug out of many tested that really works, from being
overmodified by Bayesian shrinkage to the mean? Section
7 summarizes the various suggestions and results.
'q =
(m1,?2)
h( ,
72 ) d?Id2
sampledfrom the vague hyperpriordensity
= dlId?72/9 2< 72 > 0-
The prior density g97(.) yields some unobservable realvalued parameters Ok,
Ol,*
00 02,
iid
*... OK "'(
,*)q,
(6)
with iid meaning independent and identically distributed.
(In this section the cases are indexed from zero to K instead of from 1 to k, with 00 representing a parameter of
particular interest.) Each Ok determines a sampling density
fok (.) that then yields some observable data Xk,
Xk
rc-'
fok (.)
independently for
k = 0, 1, 2,. . ., K.
(7)
The Xk can be vector valued, and the densities fok (.) can
differ in form for the different cases. In fact our methods
use only the likelihoods
Lk (Ok)
CfOk (Xk),
(8)
with xk fixed as observed and c indicating an arbitrarypositive constant. In an earlier work (Efron 1993), I showed how
likelihoods Lk (Ok) can be accurately approximated even in
2. HIERARCHICALAND EMPIRICALBAYES MODELS
cases where nuisance parameters affect the distributions of
Empirical Bayes confidence intervals can be thought of the xk.
as approximations to a full hierarchical Bayes analysis of
Our task here is to form interval estimates for any one
a compound data set. That is the point of view taken by of the Ok, say 00, on the basis of all of the observed data.
Carlin and Gelfand (1990, 1991), Laird and Louis (1987), The Bayes posterior interval for 00 can be conveniently deand this article. This section briefly reviews the connection scribed by assuming that the "other"data,
between hierarchical and empirical Bayes analyses.
X = (XI i X2 i... i XK),
(9)
Figure 2 diagrams the hierarchical Bayes model as used
here. At the most remote level from the data, a hyperprior are observed before the direct data xo for 00. This route is
density h(.) yields a hyperparametern that determines the illustrated by the dashed lines in Figure 2.
form of a Bayes prior density g, (.). For example, g, (.)
The marginal sampling density, d, (x), is obtained by incould correspond to a normal distribution N(7q1,q2), with tegrating out the unobserved vector of "other" parameters
Efron: Empirical Bayes Methods for Combining Likelihoods
0 = (0,
02,...
541
Now n = (M, A). Marginally, the components xk of x are
independently normal,
, OK),
dc4(x)
gJ 9(o) fo(x) dO,
(10)
dM,A(xk)
9,(0=) = II
9,(Ok, a
=
fO, (Xk).
Bayes's rule
gives
h(1Ix) = ch(r1)dq(x)
N(M, A + 1),
(18)
so that the MLE 4 = (M, A) has
K
(11)
M
as the conditionaldensity of rj, given x, with c a positive
constantas before. This providesan inducedprior density
=x = EZxk/K
k=1
and
for 00:
9x(0o) Jh(rlIx)gn
K
(o) dr.
(12)
A = Z(xk -_.)2/K
-
1.
(19)
k==1
A final applicationof Bayes's rule gives the posterior
densityof 00 based on all of the data:
Lo(0o) equals ce-(00o-x)
/2. We assume that A in (19) is
nonnegative.
(13)
PX(0oIxo) = cgx (0oo)fo0(xo) = cgx (0o) Lo (0o)
Formula (16) gives a MLE posterior density for 00 that
The genericpositive constantc may differin the two parts is normal,
of (13). HierarchicalBayes interval estimates for 00 are
computedfrom px(00Ixo). Notice that (13) generalizesthe pi,(Oolxo) N(x + (1 - B)(xo - t), 1 - B),
usual form of Bayes's rule that would apply if i (and so
g,) were known:
[B_=
Pn,(Oolxo)= cg?7(Oo)Lo(0o).
(20)
(14)
level aointerval for 00 goes
The generalization consists of averaging g,7(0o) over so an empirical Bayes one-sided
from -oc to
(1-B)(xo-x) - (1-x)
/B , with
h(r1jx).
The hierarchicalBayes formula(13) neatly separatesthe
informationfor 00:the likelihoodLo(0o) is the directinformationabout00 containedin xo; the inducedpriordensity
g9(Oo)is the informationabout 0o in the other data x. In
practice,though,(13) can be difficultto use becauseof the
substantialBayesianinputsrequiredin Figure2. Empirical
Bayes methodscan be thoughtof as practicalapproximations to a full hierarchicalBayes analysis.
The maximumlikelihoodestimate(MLE)of rqbased on
the otherdatax is
fi
= argmaxc hd7
(x)}*
(15)
1.645 etc.
Formula (20) suggests the empirical Bayes point estimator
X(.95)=
0o=J+(I-B)(xo-x)
(21)
for 00. This is a little different and not as good as the JamesStein estimator
So=Z
+ (1-B)(xo-X)
{
B(xo)/ZK(x1)2
B =(K -2)/
'
(Xk-)
(22)
A familiarempiricalBayes tactic is to replacethe induced
priorgx(Oo)with the MLE prior gj7(00) in (13), giving the see Efron
and Morris (1973, lem. 2). The diminished accu-
MLE posterior density:
pj,(00Ixo) - cg7 (0o)Lo(0o)
(16)
(see Morris 1983). Percentiles of the distributioncorrespondingto pf7(Golxo)
providea simple form of empirical
Bayes confidenceintervalsfor 00. The hyperpriordensity
h(.) is no longer necessary.This approachand improvements on it are consideredin the next three sections. We
follow similarideas of CarlinandGelfand(1990) andLaird
andLouis (1987),particularlyin Section5, which concerns
correctingthe biases in (16).
As a simple example, suppose that both gYr() and fo,( )
in Figure2 are normal:
Sk
'-N(M, A)
and xkISk -N(Ok, 1)
for k =O, 1, 2, . ..,K.
(17)
racy comes from the fact that (20) does not use xo in the
estimation of q = (M, A).
We can imagine estimating (M, A) as in (19), except us.. . ,XK, in which xo is
ing a data set xo,XO.... ,XO,X1,X2,
repeated no times. For large K, straightforward calculations show that the choice no = 5, rather than no = 0
as in (19), makes the empirical Bayes point estimate (2.16)
nearly equal the James-Stein estimate (22). Efron and Morris (1973, sec. 2) derived no = 3 as the optimal choice for
a slightly different estimation problem.
The point here is that xo can reasonably be included in the
estimation of X in formula (16). This is done in the examples
that follow, though it would also be feasible to carry out the
calculations with x0 excluded. In these examples x indicates
all of the data, with the cases numbered 1, 2, ... ., K as in
Section 1.
Journal of the American Statistical Association, June 1996
542
3.
EXPONENTIALFAMILIESOF PRIOR DENSITIES
The family of possible prior densities g, (.) in Figure 2 is
an essential ingredient of our empirical Bayes analysis. This
section discusses a wide class of priors based on exponential
family theory. Exponential family priors are flexible enough
to allow a model building and model checking approach to
empirical Bayes problems, more in the spirit of a regression
analysis than of a traditional conjugate prior Bayes solution.
The price for this generality is solutions that are numerical
and approximate rather than formulaic and exact.
An exponential family .F of prior densities g,7(0) for a
real-valued parameter 0 can be written as
.F = { n(0) = e?7't(0)(23)
1(?) go(0), E A}.
Here n is a p-dimensional parameter vector-the natural or
canonical vector-and t(0) is the p-dimensional sufficient
vector, depending on 0. For example, we might take p = 3
and t(0) = (0,02, 03)'. The nonnegative function go(0) is
the carrier density. The moment-generating function 0(nQ)
is chosen to make g,(0) integrate to 1 over 0, the range of
0:
Ie
g(0) dO=
len
dO= 1.
t(0)-0(q)Yo(0)
(25)
If h(0) is a vector or a matrix, then definition (25) applies
component-wise. Notice that the integrals in (25) are onedimensional no matter what the dimensions of r7,t, or h.
This leads to easy numerical evaluation of I, (h). Existence
of the integral (25) is assumed in what follows, as are other
formal properties such as differentiation under the integral
sign. Brown (1984, chap. 2) showed why these properties
usually hold in exponential families.
The marginal density of Xk, the kth component of x
in Figure 2, can be expressed in terms of the likelihood
Lk (0k)
=
CfOk (Xk ),
dr,(Xk) =
=
g4
J
(0k)fok(Xk)
XI
(0k
0go
)en't(Ok
a
Letting
In(h)
m, (xk)
m,(Xk)
t(O)h(O)e,'t(O)go (0) dO= In(th).
J
be the marginal log-likelihood for xk,
=
I(1) + 1og c,
log I(Lk)-1og
we obtain the marginal score function (O/9r1)m,(xk)
(29):
a
M?(Xk)
=
In (tL)
-IT7 (t)
from
(31 a)
The total score (a/arj)m,(x) is the sum of (31a) over the
marginally independent components Xk,
K
mq(x)= E
(31b)
(Xk)_
a
It is easy to compute the marginal MLE i, (15), from the
estimating equation:
a
K
[In(tLk)
Inj(t)1
In the examples that follow, (32) was solved by NewtonRaphson iteration. The second derivative matrix is obtained
from (29) and (31):
-
02
m(x)
K rFI (t2)
_
(I(t)
7
{[ In(1L)
I[*(Lk)
~2
In(1)Lk)
(33)
I( Ir(Lk))]}
Here (02/aTr2)m,(x) is the p x p matrix with entries
(a2/aT1iaTjj)M7(x), and for any p vector v, the notation
v2 indicates the p x p matrix (vivj). A more intuitive explanation of (32) and (33), relating to the EM algorithm,
appears in Remark B.
Newton-Raphson iterates i(0), '(1),... converging to
the solution of (32) are obtained from the usual updating
equation,
)Lk (0k) dSk/61'(4)
*
(26)
1)--j
+(j
n(j~~~~
~~~~~
matrix_(2
The[observedinformation
dO
=lego
eo(r))
(0)e,)'t(0)
=
I, (1)
(27)
according to (24), formula (26) gives
=
(30)
dOk
Because
dn (Xk )
(29)
-log(d,7(Xk))
(24)
The convex set A is the subset of 7 vectors for which (24)
is possible. All of these are standard definitions for an absolutely continuous exponential family on the real line, following Lehmann (1983, sec. 1.4), though (23) may look
peculiar because 0 is the random variable rather than the
parameter.
It is easy to find the marginal MLE i, (15), in the exponential family F. For any function h(0), define I, (h) to be
the integral,
In(h)-lX h(0)er) t(0)go (0) dO.
The constant c depends on xk but not on the choices of
, t(0) or go(0) in (23).
The gradient (a/0rj)Ir,(h) (i.e. the vector with components (&/r/Ak) I4(h) for k = 1, 2, ... ,p) is
cIrB(Lk)/IrB(1) .
(28)
2M(x)
g(34)
The observed information matrix -(02/0rn2)m~(x) l, gives
an approximate covariance matrix for the MLE i,
Efron: Empirical Bayes Methods for Combining Likelihoods
543
quad..r~~~~~~~~~
-.
~~~~~cubic
A
00
~
'7
~
cbi
.2~~~~~~~~12
e
o
- -5
5.--..~
(a)
e
5
0
5
(b)
Figure 3. Fitting a Prior Density gq (.) to the Ulcer Data Likelihoods. (a) Carriergo equals average of likelihoods (dotted line); quad (solid line)
is gr4 from quadratic model t(O) = (0, 02)'; cubic (dashed line) is gr4 from cubic model t(O) = (0, 02, 03)1. (b) log(gj4) for quadratic (solid line) and
cubic (dashed line) models. Cases k = 40, 41 in Table 1 were not included in this analysis.
Notice that i and cov(i) are obtained from the likelihoods
, LK, with no further reference to the original
L1, .L2...
data X1, X2.... ,XK. Jackknife estimates of covariance are
used as a check on formula (35) in what follows, and in fact
give larger estimated variance in our examples.
This program was applied to the ulcer data of Table 1,
with cases k = 40, 41 excluded for reasons discussed later,
so K = 39. The carrier go(0) in (23) was taken to the average likelihood,
indicates no improvement in going from the quadraticto the
cubic model. Other possible improvements on the quadratic
model fared no better.
All these calculations excluded cases k = 40 and 41 in
Table 1. L41(041) is perfectly flat, which drops it out of the
formulas whether or not we intend to do so. Figure 1 shows
that L40(040) is located in the far negative reaches of the 0
axis. When case k = 40 was included in the calculations,
W significantly favored the cubic model over the quadratic.
But
looking at the individual summands on the right side
K
of
(37),
case k = 40 by itself accounted for almost all of
(36)
E Lk (0),
9?(o) =
It was on these grounds that case 40 was
W's
significance.
k=1
removed from the analysis as an overly influential outlier.
with 58 Lk(0) dO= 1 as in Figure 1. The dottedcurve in In general, it is a good idea to consider the components of
Figure 3a shows go. Other choices of go, discussed later, W rather than just W alone.
made little difference to the estimation of g9.
The MLE quadratic prior density g2(0) has expectation
The results of two different choices of the exponential -1.22 and standard deviation 1.19. How accurate are these
family F, (23), are shown in Figure 3: the quadratic model values? Jackknife standarderrors were computed by removt(0) - (0,02)' and the cubic model t(0) = (0,02,03)1.
ing each case k = 1, 2,.. ., 39 in turn and recalculating g7
(We do not need an intercept term in t (0), for example, (which amounts to thinking of the cases xk as being ran(1, 0, 02)1, because constantsare absorbedinto 0(n).) The
domly sampled from some superpopulation):
cubic MLE g2 is centrally a little narrower than 9g for the
quadratic model, but it is bimodal, with a second mode at
Expectation =-1.22 ? .26
the left end of the 0 scale. The difference looks more draStandard deviation = 1.19 ? .31.
(38)
matic in Figure 3b, which plots the log densities.
The quadratic model forces g2(0) to be nearly symmet- The ? numbers are one jackknife standarderror.These nonric, whereas the cubic model allows for asymmetry. Does parametricjackknife standarderrors were larger than those
the cubic model provide a significantly better fit to the ul- obtained parametrically from (35). In doing the jackknife
cer data? We can use Wilks's likelihood ratio criterion to calculations, it was possible to remove the deleted case from
answer this question. Let I'(c) (h) and In(q)(h) indicate (25) the definition of go(0) in (36) and also remove it from the
for the cubic and quadraticMLE models. According to (28), MLE equation (32), thereby accounting for the data-based
Wilks's statistic is
choice of carrier. But doing so made very little difference
here.
The choice of the average likelihood (36) for the carrier
W = 21og d(c)(x)
dn(q) (X)
is convenient but not at all necessary. Smoother versions of
Yo(0), obtained by Gaussian smoothing of (36), were tried
(1))
(37)
(Lk)
on the ulcer data. These gave much different carriers go(0)
=
2E og ([(c)
In(q)
but almost identical estimates of the prior density g, (0); see
Remark A. Both here and in the toxoplasmosis example of
We can compare W to a x2 distribution to test for a signif- Section 6, the choice of go seemed to be the least sensiicantly better fit. In this case the computed value W = .64 tive aspect of the prior fitting process. A flat density over
544
Journal of the American Statistical Association, June 1996
the rangeof 0 gave almost the same results,but of course
this would not always be true. Recent work by the author
extendsformula(35) to include a data-basedchoice of the
carrier,but such formulascan be avoidedby the use of the
jackknifeas in (38).
In the normal-normalsituationof (17), the averagelikelihood go(0) approachesa N(M, A + 2) distributionas
K -- oo, considerablymore broadlysupportedthan g (0)
N(M, A), the prior density that we are trying to estimate. This wider supportwill usuallybe the case, making
(36) at least a sufficientlybroadchoice for the carrier.
RemarkA. Returningto Figure2, supposethat01, 02,...
(but not 00) are observed without error, so Lk (Ok)
is a delta function at Ok. In this case the average likelihood (36) is equivalentto the empiricaldistributionon
{0o, 02,..*, OK}. Then it is easy to show that the estimated
posteriordensity p,4(Oo
xo), (16), is the discrete distribution puttingprobabilityLo (Ok)/ >
Lo(0j) on Ok. This
is a reasonableestimator,but we might prefer a smoother
answer.Smoothnesscould be achievedby convolving(36)
with a normalkernel,as mentionedpreviously.
9g, say 9B. Typically, YB would be a vague, uninformative
prior density, not containing much information about 00.
The choice C is assumed to be independent of all random
mechanisms in Figure 2. This amounts to allocating a priori
probability hB that the information in x is irrelevant to the
estimation of 00. The statistician must choose hB, perhaps
using a conventional value like .05, but this is the only nondata-based element required in the analysis.
Having observed x, there are two possible prior densities
for 00:
-
00
9A
with probability hA,
or
OK
00
-
with probability hB=
9B
1
-
hA,
(41)
where we have renamed gx(00), (12), as 9A (00) for convenient notation. These give xo the marginal densities
or dB(xo) = fEY9B(Oo)foo
dA(XO) = fgeA(00)foo(XO)dOo
density being
with
the
overall
marginal
(xo) dOo,
d(xo) -hAdA(xo)
+ hBdB(xo).
(42)
After xo is observed, the a posteriori probabilities for the
RemarkB. Formulas(32) and (33) can be rewrittenin two choices in (41) are
termsof conditionalexpectationsand covariances:
hA(Xo) = hAdA(xo)/d(xo)
K
ar m2 (x) -,
0977
~
[E {t(0k)lXk}
-En
{t
(0k) }]
(39) and
k=1
hB(xo)
and
(43)
The overall posterior density for 00 is
02
-0
hBdB (xo)/d(xo).
mq(x)
hA (Xo)PA (So |xo) + hB (Xo)PB (So Ixo),
K
-
S
[COV, { t (0k)
-
COV, {t (0k) IXk}]
(40)
(44)
where
k=1
(45)
PC(So|xo) = gc (Oo)fo.(xo)/dc (xo)
These expressionsare relatedto the EM algorithmandthe
missing-dataprinciple,the missing data in this case being for C equal A or B.
the unobservedparameters01, 02, . . ., OK (see LittleandRuThere are two extreme cases of (44). If hA(xO) = 1, then
bin 1987, sec. 7.2, and Tanner1991).
we are certain that x is relevant to the estimation of 00.
In this case the a posteriori density for Oois PA(OOIXO)
4.
RELEVANCE
A crucial issue in empiricalBayes applicationsis the
relevanceof the "other"datax in Figure2 to the particular
parameterof interest00. EmpiricalBayes methodsusually
pull extremeresultstowardthe groupmean.If 00represents
somethingdesirablelike a cure rate, then the relevanceof
the other data may suddenlyseem debatable.This section
concernsa schemefor assessingthe relevanceof the other
data to 00, allowing 00 to opt out of the empiricalBayes
schemeundercertaincircumstances.This idea is relatedto
the limitedtranslationestimatorof EfronandMorris(1971,
1972).
It is easy to modifythe hierarchicalBayes modelto allow
for exceptionalcases. We assumethata randomchoice "C"
has been madebetweentwo differentpriordensitiesfor 0o:
C = A with probabilityhA, "A" meaning that Figure 2
appliesas shown;but C = B with probabilityhB = 1 - hA,
"B" meaningthat0o is drawnfrom a differentdensitythan
= Px(0oIxo), (13). If hA(xo) = 0, then x is completely
irrelevant, and the a posteriori density is PB(Oo xO). The
examplethat follows takes PB (O0IXO)to be the confidence
density, meaning that by definition the a posteriori intervals
based on PB (OoIXO) coincide with confidence intervals for
00 based only on xo; see Remark D.
Bayes rule gives
hA(XO)
hA R(x)
hB(Xo)
hB
where
R(xo)
=
fe, 9A(O)Lo(O)dO
dA(xo)
dB(xo)
-e
9B(Oo)Lo(00) dOo
(46)
At this point it looks like we could use (46) to evaluate
hA(xo) and hB(xo), and then take (44) as our overall a postenioni density for Oo, an appealing compromise between
Efron: Empirical Bayes Methods for Combining Likelihoods
545
Figure 4 shows the percentiles of the overall posterior
distribution, density (44), as a function of hA, the prior
probability that 08 should be included in the empirical
Bayes scheme. The central .90 interval with hA = .95
is (-5.26, -1.787), compared to (-5.10, -1.76) for hA
= 1.00, the full empirical Bayes inference. The likelihood
ratio R(x8) = dA(x8)/dB(x8) was estimated as .47 from
(47) and (50), so hA = .95 results in a posteriori probability hA(xo) = .90. As a point of comparison, the same
computation gives R(x) = .077 and hA(xo) = .59 for xo
representing experiment 40, the excluded case in Table 1.
C')
I
0.0
I
0.2
I
0.4
I
0.6
hA
0.8
1.0
Figure 4. Interval Estimates for 08, Ulcer Data, as a Function of hA,
the Prior Probability that 08 is Drawn From the Same Density as the
Other Parameters Ok. Curves indicate percentiles of the overall posterior
density (55). Likelihood ratio R, (57), estimated to equal .47.
fully including or fully excluding 00 from the empirical
Bayes inference scheme. But there is a practical difficulty.
Uninformative priors such as 9B(0O) are usually defined
only up to an unspecified multiplicative constant. This produces a similar ambiguity in R(xo), say
R(xo) = cr(xo),
(47)
where r(xo) is a known function but c is an unknown positive constant. What follows is a data-based scheme for evaluating c.
Notice that
Remark C. If we choose the uninformative prior gB
correctly, then the percentiles of the posterior density
PB (0o x0) = cgB (0o)Lo(0o) will nearlyequalthe confidence
limits for 00 based on xo. The prior density (51) is correct
in this sense for the likelihood L8(08) obtained by conditioning the eighth 2 x 2 ulcer data table on its marginals.
This means that the percentiles at hA = 0 in Figure 4
are nearly the standard confidence limits for 08 based on
(a8, b8, c8, d8).
Remark D. Formula (51) is a Welch-Peers prior density,
one that gives a posteriori intervals agreeing with ordinary
confidence intervals to a second order of approximation.
Of course, simpler choices, such as 9B(OO) = constant,
might be used instead. Earlier work (Efron 1993, sec. 6)
gives a straightforwardnumerical algorithm for calculating
Welch-Peers densities in general exponential family situations. Formula (51) was actually obtained from the doubly
adjusted ABC likelihood (6.26) of that paper.
5.
BIAS CORRECTION
The MLE prior density gi1(Oo) tends to be lighter-tailed
than
the induced prior gx(Oo), (12). This bias makes interI
(48)
hB(xo)d(xo)dxo = hB,
val estimates based on the MLE posterior density narrower
than they should be according to the corresponding hieraraccording to (43). This just says that the marginal expectachical Bayes analysis. Bias correction has become a major
tion of hB(xo) is hB. Solving (46), (47) gives
theme of recent empirical Bayes literature (see, for example, Carlin and Gelfand 1990, 1991, Laird and Louis 1987,
+ AA cr(xo)
and Morris 1983).
hB(xO)
1
This section introduces a new bias correction technique
We can use (48) and (49) to define a method-of-moments specifically designed for hierarchical Bayes situations. The
technique uses confidence interval calculations to approxiestimate c for c:
mate the induced prior gx(0o) that we would have obtained
starting from an appropriateuninformative hyperprior den(50)
- hB.
1+
hA
sity
h(Qq). The actual form of h(Qq)is never required here,
(Xk/
k=1
hB
which is a great practical advantage when dealing with the
Having determined c, we can now use (46) and (47) to eval- general exponential families of Section 3.
uate the compromise a posteriori density (44) for 00.
Suppose that having observed x from the multiparametric
This scheme was applied to the estimation of 08 for the family of densities d,,(x), we wish to make inferences about
ulcer data. The density 9A in (41) was a bias-corrected ver- -y(), a real-valued parameter of interest. In what follows,
sion of the quadratic MLE g71shown in Figure 3 (9A = gt 'y(,q) is ga (0o) evaluated at a fixed value of 00. Let 'y[oa]be
of Fig. 6 in the next section). The uninformative prior 9B
the endpoint of an approximateone-sided level oaconfidence
interval for -v having second-order accuracy,
was taken to be
gB(08)
=
ce
168608
for reasons discussed in Remarks C and D.
(51)
Prt7{?y<
Yx
[H]} =
+??(kr).
(52)
546
Journal of the American Statistical Association, June 1996
-~
~
~
~~7
-S
Here 8- is the usual estimate of standard error for a and
(a, b, cq) are the three ABC constants described by DiCiccio
and Efron (1992), Efron (1993), and Efron and Tibshirani
(1993). The numerical calculations needed for (a, b, cq) are
about twice those required to compare (8.
The calculations here were done using abcnon, an S program available from statlib@lib.stat.cam.edu by typing the
one-line message "send bootstrap funs from S". This is a
nonparametric version of the ABC method that treats the
K likelihoods Lk (0k) as iid observations from a space of
random functions.
Here is a simple example where the confidence expectation method can be directly compared to a genuine hierarchical Bayes analysis starting from an uninformative
hyperprior.We consider the normal situation (17) and (18)
with K = 20. The data vector x is taken to be a stylized
normal sample,
-:
0
442
4
2
e
x = c0 (1-1(.5)/20),
Figure 5. Estimates of Prior Density for the Normal Situation (17),
(18); K = 20, Stylized Sample x, (55). All estimates are symmetric about
0 = 0. The solid line is the induced Bayes priorgx(0o) startingfrom
vague hyperprior dmdA/(A + 1); stars show MLE gfi (0o); small dots indicate confidence expectation gt (0), nicely matching the Bayes curve;
large dashes indicate density obtained from Carlin and Gelfand's (1990)
calibration method, called cdf correction here. All of the curves are normalized to have area 1.
Second-order accuracy says that the accuracy of the approximate interval is improving an order of magnitude faster
than the 12Sz
usual
rate in the sample size K. The ABC
intervals (DiCiccio and Efron 1992) are convenient secondorder-accurate intervals for the calculations that follow,
though other schemes could be used.
g(
The confidence expectation
of
tYt=
j
a[a]
given x is defined as
doz.
(53)
This is the expectation of the confidence density, the den-
sity obtained by considering the confidence limits to be
quantiles; for example, that n exists in ( [.z90],K[. 91) with
probability .01. Efron (1993) shows that the confidence density derived from the ABC limits matches to second order
the Bayes a posteriori distribution of y(,q) given x, starting
from an appropriate uninformative prior densityh(a), the
Welch-Peers density.
The confidence expectationoft can be thought of as an
adjusted version of the MLE
[
=-d(t)
that better matches
tc of
the a posteriori expectation
given x, starting from
an uninformative distribution for q. We will apply this to
s = ot4(00) for a fixed value of 00, producing the confidence expectation gtA
(0') according to definition (53). This
th the confidence
otroiexpectation
of'ygvnx
startin of bias correction
frofor
is
expectation method
BC
is
ityd(e) The appeal of (0l) that it is a direct approximafor
Theo
itral fixeld valu ofpl approduingatheonmfition to the induced priorgx(iso), (12), starting from a vague
isthe confidence
expectation
mehd5fbiscrrcio:o
Welch-Peers
is the a posteriori
prior
gxt(y)
h(t), because
g11(Oo).
The apelo_ +~C(O stat it i a direct appoxma
1-1(1.5/20),
... .,
-1 (19.5/20)),
(55)
with the constant co chosen to make 20 x/20 = 2. Then
(M, A), (19), equals (0, 1), so that the MLE prior gi,(00)
is the standard normal density (1/v2ir)exp(-.502). This is
shown as the starred curve on the left half of Figure 5.
The Bayes curve in Figure 5 is the induced prior
gx (0o) obtained from the uninformative hyperior dMdA/(A
+ 1), A > 0. It has heavier tails than the MLE g4(0o). The
confidence expectation density gt (00), obtained from (54)
using abcnon, nicely matches the Bayes curve.
Another bias correction technique, called cdf correction
here, appears in Figure 5. This is a bootstrap method proposed by Carlin and Gelfand (1990), following suggestions
by Cox (1975) and Efron (1987). Cdf correction is a calibration technique introduced to achieve what Morris (1983)
calls the empirical Bayes confidence property: for example,
that a 95% empirical Bayes interval for 00 will contain 00
95% of the time when averaging over the random selection
of 00 and (x0, x) in Figure 2. This is a somewhat different
criterion than bias correcting g4 (0o) to match gx(0o), and
in fact cdf correction does not match the Bayes curve very
well in Figure 5.
Details of these calculations were provided in earlier
work (Efron 1994, sec. 4), along with another more favorable example of cdf correction. Laird and Louis' (1987)
Bayesian bootstrap technique was also briefly discussed in
that article, where it performed the poorest of the three
methods. On the basis of admittedly limited experience, the
author slightly prefers the confidence expectation method to
cdf correction, and strongly prefers either of these to the
Bayesian bootstrap.
The confidence expectation method of bias correction is
applied to the ulcer data in Figure 6, again using abcnon.
Figure 6a shows that gtA(0) has a substantially heavier left
tail than the uncorrected MLE prior g- (0), but there is not
much difference in the right tail. Figure 6b compares the
posterior MLE density pt7, (16), for populations 8 and 13
with the corresponding posterior ptAreplacing gt7(0) with
547
Efron: Empirical Bayes Methods for Combining Likelihoods
II
-5
ci-i
g
4-i
p'(O
I
lx~~~~~~~~~Welx3
and
L'
I
R
(a)
-10
p~~~~~~~~~~~(O~~131x13)
I
,
,
-5
,
I
-
0
(b
- -
0
-5
5-10
5
(b)
(a)
Figure 6. Application of Bias Correction to the Ulcer Data. (a) The solid line is the quadratic MLEgi) from Figure 3; the dotted line is confidence
expectation density gt. It has a noticeably heavier left tail. Dashed lines indicate likelihoods for populations 8 and 13. (b) Solid lines show MLE
aposteriori density for population 8 and population 13; dotted lines show a posteriori densities replacing g24 with gt. The bias correction makes a
substantial difference for population 8, but not for population 13.
gt (0). Bias correction has a large effect on the inference
for population 8, but not for population 13. The cdf correction method was also applied to the ulcer data, producing
a bias corrected prior density with tails moderately heavier
than gi (0). Only minor changes resulted in the posterior
densities for populations 8 and 13.
Remark E. We could bias correct the posterior density
= cg,(Oo)Lo(Oo) rather than the prior density
pi)(0OOXO)
gj1(0o).This is the route taken in all of the references. An
advantage of the tactic here, bias correcting gi (00), is that
it need be done only once rather than separately for each
Ok. In the example of this article, it was numerically more
stable than bias correcting pi (Ooxo) in cases where Lo(0o)
was far removed from gj1(00).
6.
on an unknown parameter vector r7,
gn(6) = et/'t(6)>-0(W)go(6).
As in (25), we define I,(h) = f h(6)e17't(6)go(6) d6 for any
function h(6), the integral being over the range of 6.
We observe the likelihood functions
+
O = zk ?k,
as in (8), and we wish to estimate (R, 3). Define
Lk(6k)-
where the fixed effects Vk depend on known q-dimensional
covariate vectors Ckand an unknown q-dimensional parameter vector 3,
Vk =
C3.
(57)
The random effects 8k are assumed to be iid draws from an
exponential family like (23), with density g, (6) depending
LXk(vk
(60)
+ 6k),
thought of as a function of 6k, with Vk and Xk held fixed.
It is easy to verify that the marginal density of Xk is
dn,(xk)
(56)
(59)
= Cfok (Xk)
Lxk (0k)
REGRESSION AND EMPIRICALBAYES
The individual parameters Ok in Figure 2 can be linked
to each other by a regression equation, as well as having an
empirical Bayes relationship. It is easy to incorporate regression structure into the empirical Bayes methodology of
the preceding sections. Doing so results in a very general
version of the classical mixed model. The random-effects
portion of the mixed model is used to explain the overdispersion in the binomial regression example that follows.
There is a large literature concerning normal theory random effects in generalized linear models (see Breslow and
Clayton 1993).
We write 0k as the sum of fixed and random effects,
(58)
(61)
= cIh(Lk)/hIn (1),
as in (28). Then (31a,b) follow, giving the maximum likelihood equation with respect to r7:
o
,a logd,(x)
a?1
k
[h(tLk)
I7
IK(t)i
[In(Lk)
I77(1)
Letting
Lk-k a
DVk
Lxk?&k)
6k
k k
and
2
Lk
Lk
(63)
thought of as functions of Vk, with 6k and Xk held fixed,
the maximum likelihood equation with respect to 3 is
o = a9 logd, O(x) =
a'3
ZCkI?(Lk)/I?(Lk))
(64)
~~~~k
Together, (62) and (64) determine the maximum likelihood estimates (i, /), with their solutions being found by
548
Journal of the American Statistical Association, June 1996
Table 2.
The Toxoplasmosis Data
k
r
s
n
k
r
s
n
k
r
s
n
k
r
s
n
1
2
3
4
5
6
7
8
9
1,620
1,650
1,650
1,735
1,750
1,750
1,756
1,770
1,770
5
15
0
2
2
2
2
6
33
18
30
1
4
2
8
12
11
54
10
11
12
13
14
15
16
17
18
1,780
1,796
1,800
1,800
1,830
1,834
1,871
1,890
1,900
8
41
3
8
0
53
7
24
3
13
77
5
10
1
75
16
51
10
19
20
21
22
23
24
25
26
27
1,918
1,920
1,920
1,936
1,973
1,976
2,000
2,000
2,050
23
3
0
3
3
1
1
0
7
43
6
1
10
10
6
5
1
24
28
29
30
31
32
33
34
2,063
2,077
2,100
2,200
2,240
2,250
2,292
46
7
9
4
4
8
23
82
19
13
22
9
11
37
356
697
Total:
NOTE: Number of subjects testing positive (s), number tested (n), and annual rainfall in mm. (r), 34 cities in El Salvador.
Newton-Raphsoniteration.The p x p secondderivativematrix -(02/0ri2)log
_(02
is given by (33). The q x q matrix
dc,(x)
/&/32)log dc,p3(x) iS
F,~~~~~~~~~~
( 1i(-Lk) I_II(Lk) )
02
0932 logd,7,(x)
EZCk
0/32
In (Lk)
n (Lk
1
k
(65)
and the p x q off-diagonalsecondderivativematrixis
a-1og log dn, (x)
z
I_h(tLk)
[h,(tLk)
1h,(Lk)
Ck.
The solid curvein the left panel of Figure7 is the MLE
cubic logistic regression,transformedback to the probability scale 7r= 1/(1 + e-9). The cubic model is quite significantbut it fails to accountfor nearlyhalf of the city-to-city
variability:the deviancesfrom the cubic fit are abouttwice
as big as they shouldbe undermodel (67) and (68). This
overdispersioncan be explained,or at least described,in
terms of the randomeffects in model (56)-(58). Roughly
speaking,the varianceof g9,(6)accountsfor abouthalf of
the total deviance of the data points from the regression
curve.
The mixed model (56) was appliedto the toxoplasmosis
data,with
vk
(66)
=3o +
/3lrk
+
32rk + /3r
3
(69)
using the quadraticmodel t(6) - (8,62)1 in (58). As a first
Table2 shows the data from a disease prevalencestudy step, the vk in (69) were fit to the observedproportions
in 34 cities andEl Salvador(Efron1986). In city k, Sk out Pk = sk/nk by ordinarylogistic regression,giving an initial
of nk subjectstestedpositivefor toxoplasmosisantibodies, estimate /3(1). Notice that this amounts to assuming that the
k = 1, 2,.. ., 34. We assume a binomial sampling model,
all equalzero;thatis, g,(6) puts all of its probabilityon
8 = 0. The initial fitted values v(l) determine likelihoods
(k
=
+
(
I
Bi(nk, rk),
,
Sk
=
Lk (k)
LXk(Zk(') + &k). The carrierdensity go(6) used in
(58) was ZkLk/34, with 55Lk(k) d8k =1.
where Ok is the logit of the true positive rate irk in city
Equation(62) was then solved to producean initialguess
k. A cubic logistic regressionin rainfallwas suggestedby
14(1) for rj. The solid curve in the right panel of Figure 7 is
previousanalyses:
8k
rV
gi)(1).Solving (64), with rq
/3lrk
+ /32r2 + 33r3.
(68)
p(2)
gives a second estimate
r(1),
-
k = 10 +
for 3, and so on, iteratingtowardthe MLE (1, /).
0~~~~~~~~~~~~~~~~
+
o)
160
1600
0
20ri
180/0
1800
2000
(a)
2200
+
rain-I
-2-(0
t20
-2
-1
0
1
2
(b)
Figure 7. Model (56) Applied to the Toxoplasmosis Data. (a): Rainfall versus proportion positive Pk
Sk/lnk. The solid line is the MLE cubic
logistic regression model, (69); the dashed line is the second iterative estimate. (b) The quadratic model estimates for gq; the solid line is the first
estimate, and the dashed curve is the second estimate, nearly the same.
549
Efron: Empirical Bayes Methods for Combining Likelihoods
Figure8 is the convolution
9
= 9(1) EDN(.209,.0972).
(72)
The
It has standarddeviation.53 comparedto .43 for 9
ad hoc estimator(72) is reasonablehere, but can be improvedby taking into accountnonnormalityin the likelihood for
v7.
The solid curve in Figure 8 is an empiricalBayes posterior density for
07,
likelihoodbased on
.90 intervalis
07
E
obtained by multiplying the binomial
n7) = (2, 12) by 3. The resulting
(S7,
(-1.54, -.13)
or
77 -
(.18,.47),
(73)
which,for comparison,does not containthe usualbinomial
point estimate7r7 2/12 = .17.
7.
-6
4
-2
0
2
4
6
Figure 8. Bayes Analysis of the Toxoplasmosis Rate in City 7. Stars
indicate binomial likelihood based on (S7, n7) = (2, 12); the dashed line
is prior density g, (72), incorporating random effect plus variability in
estimation of fixed effect V7; the solid line is the a posteriori density for
07. Dots indicate gi)(i), the prior density not including variabilityin V7.
In fact, there w as never much change from qn1,z31)
The dashed lines in Figure 7 show the second estimate,
the latternearlyequallinggi)(i).
1/(1 + e-rk ) andg9t(2),
instead of t(6) - (6, 62)' did
t
Using (6)
(6,62,63)'
not significantly improve the fit to the data. The bias corwere small
rections
Section 5otine
ignored in
enough to be
acltosa
yjckie
Th vluof .97wa
the
carrier
of
choices
case.
Smoother
this
density go had
tnaderrta
h
sa
a
lotdul
in~* (3)In
ahdcre3i
h
in Figure 7
thtmdl(9
assumes~
shown
All~ in all, the'scret
effect.
little
density gn(,)
is a reasonable estimate of the random component in the
mixed model (6.1). This density has expectation and standard deviation
-. 13 1t.0
and .43 ? .17,
(70)
as in (38). These conclusions about the 6 distribution can
be thought of as an overdispersion analysis of the toxoplasmosis data. The methods used here apply quite generally, though in this case the results are nearly the same as
those obtained for the normal theory model of Anderson
and Aitkin (1985) or of Breslow and Clayton (1993).
Suppose that we wish to estimate the toxoplasmosis rate
in any one city; for instance, city 7. This is similar to the
empiricalBayes estimationof 08 in Figure 6, except that
now an empiricalBayes confidenceintervalfor 07 = ZJ7
'
needs to include the variability in
regression estimate 3(1 gave
67
1v7=
.209 ?E.097.
. The cubic logistic
(71)
SUMMARY
This article proposes an empiricalBayes methodology
for the practicalsolution of hierarchicalBayes problems.
The emphasisis on computer-basedanalyses that do not
requirespeciallychosen mathematicalforms for their success. Thereare four main aspectsto our approach:
1. The use of likelihoodsLk (0k) ratherthanthe Xkthemselves, as inputsto the empiricalBayes algorithm.This allows the componentdatasets xk to be complicatedin their
own right,each perhapsinvolvingmany nuisanceparameters in additionto the parameterof interestOk. In the ulcer
data example of Table 1, each Xk involves two or three
nuisanceparameters,havingto do with the marginal2 x 2
probabilities.The conditionallikelihood(5) removesthese.
Earlierwork (Efron 1993) shows how to get approximate
likelihoodslike (5), with all nuisanceparametersremoved,
in quite generalsituations.
2. The use of specially designed exponentialfamilies
of prior densities, (23), ratherthan conjugateor normal
priors. Modern computertechnology,combinedwith the
fact that the componentlikelihoodsLk(Ok) are each onedimensional,makesthis numericallypractical.Besides being more flexible, the specially designed families allow
model building and model checking in the spirit of a regressionanalysis,as in Figure 3.
3. Allowing an individualcomponentproblem to opt
out of the empiricalBayes scheme if the relevanceof the
"other"dataappearssufficientlydoubtful.This avoidsovershrinkageof particularlyinterestingresults,an unpleasant
Bayes analyses(see
featureof most empirical/hierarchical
Efron 1981, sec. 8). The relevancecalculationsof Section4
robustifythe empiricalBayes estimationprocess in a way
thatallowsgenuinelyunusualcases to standby themselves.
Anotherway to do this would be by insisting on heavier
trialsin the priorfamily (23), but the schemehere is easier
to implementand to interpret.
4. A general nonparametricapproachto bias correcting the MLE priordensity g27.The confidenceexpectation
method of Section 5 fits in well with the goal of emulating a full hierarchicalBayes analysis for Figure 2, under
the usual assumptionthat the hyperprioris uninformative.
550
Journal of the American Statistical Association, June 1996
Earlier work (Efron 1994) also presents a jackknife implementation of Carlin and Gelfand's cdf correction method
that is computationally competitive with the confidence expectation approach. Neither method is yet on a solid theoretical basis, but the specific cases considered here are encouraging.
This methodology is capable of providing empirical
Bayes analyses for much more complicated situations than
the ulcer or toxoplasmosis examples. For instance, 0k might
be the proportion of HIV-positive 30-year-old white females
in city k, while xk is an extensive epidemiological survey
of that city. A reasonable goal of this work is to allow efficient and practical meta-analyses of arbitrarilycomplicated
parallel studies.
[Received November 1993. Revised August 1995.]
REFERENCES
Anderson, D., and Aitken, M. (1985), "VarianceComponent Models With
Binary Response: Interviewer Variability,"Journal of the Royal Statistical Society, Ser. B, 47, 203-210.
Barndorff-Nielsen, 0. (1986), "Inference on Full or Partial Parameters
Based on the Standardized Log Likelihood Ratio," Biometrika, 73, 307322.
Breslow, N., and Clayton, D. (1993), "Approximate Inference in Generalized Linear Mixed Models," Journal of the American Statistical Association, 88, 9-25.
Brown, L. (1984), Fundamentals of Statistical Exponential Families,
Monograph 9, Institute of Mathematical Statistics Lecture Notes, Hayward, CA: Institute of Mathematical Statistics.
Carlin, B., and Gelfand, A. (1990), "Approachesfor Empirical Bayes Confidence Intervals," Journal of the American Statistical Association, 85,
105-114.
(1991), "A Sample Reuse Method for Account Parametric Empirical Bayes Confidence Intervals,"Journal of the Royal Statistical Society,
Ser. B, 53, 189-200.
Cox, D. R. (1975), "Prediction Intervals and Empirical Bayes Confidence
Intervals,"in Perspectives in Probability and Statistics, ed. J. Gani, New
York: Academic Press, pp. 47-55.
Cox, D. R., and Reid, N. (1987), "ParameterOrthogonality and Approximate Conditional Inference" (with discussion), Journal of the Royal
Statistical Society, Ser. B, 49, 1-39.
DiCiccio, T., and Efron, B. (1992), "More Accurate Confidence Intervals
in Exponential Families," Biometrika, 79, 231-245.
Efron, B. (1982), "Maximum Likelihood and Decision Theory,"TheAnnals
of Statistics, 10, 340-356.
(1982), "The Jackknife, the Bootstrap, and Other Resampling
Plans," SIAM CBMS-NSF Monograph 38.
(1986), "Double-Exponential Families and Their Use in Generalized Linear Regression," Journal of the American Statistical Association,
81, 709-721.
(1987), Comment on "EmpiricalBayes Confidence Intervals Based
on Bootstrap Samples" by N. Laird and T. Louis, Journal of the American Statistical Association, 82, 754.
(1993), "Bayes and Likelihood Calculations From Confidence Intervals," Biometrika, 80, 3-26.
(1994), "Empirical Bayes Methods for Combining Likelihoods,"
Technical Report 166, Stanford University.
Efron, B., and Morris, C. (1971), "Limiting the Risk of Bayes and Empirical Bayes Estimators-Part I: The Bayes Case," Journal of the American
Statistical Association, 66, 807-815 (1971).
(1972), "Limiting the Risk of Bayes and Empirical Bayes
Estimators-Part II: The Empirical Bayes Case," Journal of the American Statistical Association, 67, 130-139.
(1973), "Stein's Estimation Rule and Its Competitors-and Empirical Bayes Approach," Journal of the American Statistical Association,
68, 117.
Efron, B., and Tibshirani, R. (1993), An Introduction to the Bootstrap, New
York: Chapman and Hall.
Laird, N., and Louis, T. (1987), "Empirical Bayes Confidence Intervals
Based on Bootstrap Samples" (with discussion), Journal of the American
Statistical Association, 82, 739-750.
Lehmann, E. (1959), TestingStatistical Hypotheses, New York:John Wiley.
(1983), Theory of Point Estimation, New York: John Wiley.
Morris, C. (1983), "Parametric Empirical Bayes Inference: Theory and
Applications," Journal of the American Statistical Association, 78, 4759.
Sacks, H. S., Chalmers, T. C., Blum, A. L., Berrier, J., and Pagano, D.
(1990), "Endoscopic Hemostasis, an Effective Therapy for Bleeding
Peptic Ulcers," Journal of the American Medical Association, 264, 494499.
Tanner, M. (1991), Tools for Statistical Inference-Observed Data and
Data Augmentation, New York: Springer-Verlag.
Download