Estimating a Weighted Average of Stratum

advertisement
Estimating a Weighted Average of Stratum-Specific Parameters
Babette A. Brumback1∗ , Larry H. Winner2 , George Casella2 , Malay Ghosh2 ,
Allyson Hall3 , Jianyi Zhang4 , Lorna Chorba4 , and Paul Duncan3
1
3
Department of Epidemiology and Biostatistics
Department of Health Services Research, Management, and Policy
4
Florida Center for Medicaid and the Uninsured
College of Public Health and Health Professions
2
Department of Statistics
College of Liberal Arts and Sciences
University of Florida
Gainesville, FL 32611, USA
∗
corresponding author
e-mail: brumback@phhp.ufl.edu
phone: 352-273-5366
fax: 352-273-5365
January 30, 2008
1
SUMMARY
This article investigates estimators of a weighted average of stratum-specific univariate
parameters and compares them in terms of a design-based estimate of mean squared error.
The research is motivated by a stratified survey sample of Florida Medicaid beneficiaries,
in which the parameters are population stratum means and the weights are known and
determined by the population sampling frame. Assuming heterogeneous parameters, it is
common to estimate the weighted average with the weighted sum of sample stratum means;
under homogeneity, one ignores the known weights in favor of precision weighting. Adaptive estimators arise from random effects models for the parameters. We propose adaptive
estimators motivated from these random effects models, but we compare their design-based
performance. We further propose selecting the tuning parameter to minimize a design-based
estimate of mean squared error. This differs from the model-based approach of selecting
the tuning parameter to accurately represent the heterogeneity of stratum means. Our
design-based approach effectively downweights strata with small weights in the assessment
of homogeneity, which can lead to a smaller mean squared error. We compare the standard
random effects model with identically distributed parameters to a novel alternative, which
models the variances of the parameters as inversely proportional to the known weights. An
advantage of the alternative model is that it parameterizes the mean as the weighted, as
opposed to the unweighted, average of the parameters. We show that for two strata, the
two models produce equivalent estimators, but for more than two strata, they do not. We
also present theoretical and computational details for estimators based on a general class
of random effects models. The methods are applied to estimate average satisfaction with
health plan and care among Florida beneficiaries just prior to Medicaid reform.
KEY WORDS: random effects model; weighted average; stratified sample; adaptive estimation; heterogeneity; shrinkage
2
1. Introduction
This article investigates estimators of a weighted average of stratum-specific univariate
parameters and compares them in terms of a design-based estimate of mean squared error.
Specifically, we consider estimation of
θs = Σni=1 wi θi ,
where n is the number of strata, θi , i = 1, . . . , n are the stratum-specific parameters and wi ,
i = 1, . . . , n are the weights, which are positive and sum to one. Although many applications
of statistics require estimating a weighted average, this article is motivated by a stratified
survey sample of Medicaid beneficiaries in two Florida counties to measure satisfaction with
medical care. The website ahca.myflorida.com/Medicaid provides information about the
Medicaid program in Florida. Surveyed participants each belong to one of five Medicaid
health care plans in Duval County (encompassing Jacksonville) or of fourteen in Broward
County (Fort Lauderdale). In order to maximize power for plan-to-plan comparisons, equal
sample sizes of 315 participants were targeted per plan; however, aggregate summaries are
also of interest. In this article, we focus on estimating aggregate summaries within county
for managed and less managed health care plans considered separately. Two survey items
are analyzed: A3 rates overall health plan satisfaction on a scale from 0-10, and A6 rates
overall health care satisfaction on the same scale. Differential nonresponse within plan by
age, gender, and race/ethnicity is adjusted for using logistic regression and inverse weighting.
Plan-level summaries of the data produced by SAS Proc Surveymeans are presented in Table
1. The weights in Table 1 are known and represent the relative sizes of the health care
plans within each county. Managed plans correspond to managed=1, less managed plans
to managed=0. To compare estimators of a weighted average, we target (T1) the average
A3 rating among less managed health plans in Duval County, (T2) the average A6 rating
among less managed health plans in Broward county, and (T3) the average A6 rating among
managed health plans in Broward county. The plan weights are renormalized so that the
sum of the weights associated with a given target equals one.
(Table 1 About Here)
3
Probably the most common estimator of θs is the weighted sum of the stratum-specific
sample means Yi , i = 1, . . . , n:
Ea = Σni=1 wi Yi ,
which is the best linear unbiased estimator (BLUE) of θs assuming the model
Yi = θi + ei ,
(1)
where the θi are fixed effects, and the ei are independent, approximately normal, mean zero
error terms with variances σi2 . However, if the stratum-specific population means are nearly
homogeneous (θi = θ0 , i = 1, . . . , n), then an estimator with lower mean-squared error
weights the stratum-specific sample means proportionally to their inverse-variances 1/σi2 ,
i = 1, . . . , n:
Eb =
Σni=1 σY2i
i
Σni=1 σ12
,
i
where in practice, we will substitute the sample standard errors si for σi , i = 1, . . . , n. This
estimator is the BLUE of θs assuming the model
Yi = θ0 + ei ,
where θ0 is a fixed effect, and the ei are as before. In practice, homogeneity is usually not
known a priori, and thus one strategy is to first test the null hypothesis of homogeneity using
a one-way analysis of variance, and then to select either Ea or Eb depending on whether the
test rejects or not.
For each one of the Medicaid targets, we used SAS Proc Surveyreg to test homogeneity
of the relevant plans. This procedure incorporates the nonresponse weights into the test, but
not the stratum weights. For T1, T2, and T3, the p-values were < 0.0001, 0.078, and 0.046,
respectively. Following the strategy we have just outlined, then, we would choose Ea as the
estimator for T1 and T3, but Eb as the estimator for T2. As we shall see, these decisions
are each incorrect, because they fail to incorporate the stratum weights; plans with smaller
weights should not count as much.
Adaptive estimators arising from random effects models represent a smooth alternative
to the test-then-estimate approach, and it is also possible to let the stratum weights guide
4
the procedure. In this article, we compare the standard DerSimonian and Laird [1] random
effects model with a novel alternative, which models the variances of the θi as inversely
proportional to the wi . An advantage of the alternative model is that it parameterizes the
mean as the weighted, as opposed to the unweighted, average of the parameters. We will
compare the performance of competing estimators using a design-based estimate of mean
squared error. That is, we will evaluate the competing estimators of θs under a sampling
model that conditions on the θi but not on the ei , i = 1, . . . , n. Thus, bias, variance,
and mean-squared error will be computed under a model that resamples the ei but leaves
the θi fixed. As mean-squared error typically depends on the values of θi , we will use the
design-based estimator of it that substitutes Yi for θi . The tuning parameter for an adaptive
estimator will be chosen to minimize this estimated MSE. Our method of estimating the MSE
will steer the adaptive estimators slightly more towards Ea than Eb , but this is preferable to
the other way around.
Whereas in this article, our inferences are confined to θs , the weighted mean of θi for our
sample of n strata, one might instead be interested in the weighted mean of θi for a larger
population of N strata (in which case, the strata are usually called clusters), that is, in
p
θp = ΣN
i=1 wi θi ,
where the wip are the weights normalized to the population rather than to the sample of n
clusters. In this case, the sampling distribution of competing estimators would be based on
resampling both the θi and the ei . The standard reason for using random effects models is
to account for resampling the θi . Our adaptive estimators of θs are also design-consistent
for θp when the sampled clusters are a simple random sample of the population of clusters.
In the earlier literature, DuMouchel and Duncan [2] considered estimation of θs in a more
general context; reduced to our problem, their results answer the question of when to use
Ea or Eb . They developed a test based on the difference statistic Ea − Eb , an approach
that goes beyond the testing of homogeneity employed above by incorporating the stratum
weights wi into the decision. Our method of selecting the tuning parameter to minimize a
design-based estimate of MSE induces a smoothed version of the DuMouchel and Duncan
test-then-estimate approach. Approaches that select the tuning parameter to accurately
5
reflect stratum heterogeneity induce a smoothed version of the ordinary test-then-estimate
approach employed above. Pfeffermann and Nathan [3] focused on estimating θp in a more
general context; reduced to our problem, their method corresponds to using the DerSimonian
and Laird [1] model for the θi and estimating θp with Σni=1 wi θ̂i , where the θ̂i are best linear
unbiased predictors (BLUPs, see [4]).
Following Fay and Herriot [5], who adapted the James-Stein estimator [6] and applied
it to simultaneously estimate income in several areas with populations less than 1,000, the
literature on small-area estimation is rife with random effects models for the θi ; see, for
example, the review article by Ghosh and Rao [7] and the references therein. However, the
focus in that literature is on simultaneous estimation of the θi rather than on their weighted
sum.
The survey sampling literature has also used random effects models to estimate θs ; Little
[8] reviews this literature and explains the difference between design-based and model-based
inference. An estimator is design consistent if it converges in probability to the parameter
it estimates when the probability calculations are based on the selection mechanism. A
similar comment applies to design unbiasedness. Design-based inference involves (a) using
a design-unbiased or approximately design-unbiased (e.g. design-consistent) estimator, and
(b) choosing a variance estimator that is design-unbiased or approximately design-unbiased.
On the other hand, model-based inference assumes an explicit model for unsampled data,
and uses this model to predict the nonsampled values and hence the posterior distribution of
the estimand. For examples of model-based inference on θs , see [9-12]. The perspective of the
present article is best viewed as design-based. Our adaptive estimators, though motivated by
random effects models, are consistent for θs and hence approximately design-unbiased, and
our approach to variance estimation enlists a design-unbiased estimator. Furthermore, we
select the variance parameter for the random effects models to minimize a design-unbiased
estimate of MSE. Model-based approaches would select this parameter to accurately reflect
the heterogeneity of strata. Our design-based approach has the advantage of effectively
downweighting strata associated with small weights in the assessment of homogeneity. Thus,
although a given subset of plans could differ substantially from the rest, if the combined
6
weight associated with it were small, our method would aggregate the plans as though they
were homogeneous. This can lead to a smaller MSE. Another contribution of the present
article is its investigation of a novel random effects model, which models the variance of θi
as inversely proportional to the wi . Finally, the focus of the present article is more general
than that of survey sampling; the weights need not be design weights, but rather could be
specified for other reasons.
Another related literature is that of multilevel modeling of complex survey data, the title
of a recent paper by Rabe-Hesketh and Skrondal [13]. Following Pfeffermann et al. [14],
this literature concentrates on estimating θp but in a more general context; when reduced
to our problem, the basic idea is to model the distribution of the θi in the entire population
using a multilevel model that incorporates θp as a parameter. The next step is to relate
the data, generated under informative sampling, to the census model using either weighting
likelihoods (i.e. pseudolikelihoods, as in [13]) or weighted iterative generalized least squares
algorithms (as in [14]). In the present paper, we apply random effects models for the sample
(rather than the census) of θi , but rather than including θs , or more generally, θp , as a direct
parameter in the models, we estimate it indirectly as Σni=1 wi θ̂i , where the θ̂i are BLUPs.
However, as we show in section 2, our novel model, which models the variances of the θi as
inversely proportional to the wi , effectively incorporates θs as a direct parameter.
The last literature we will consider is that of generalized estimating equations (GEE
[15]), where the focus is on estimating the “population average” of clustered data. In the
present paper, we reduce each of the clusters to a single statistic Yi , which associates with a
weight wi or wip , depending on whether θs or θp is the target of inference. In a recent paper,
Williamson, Datta, and Satten [16] address the problem of GEE estimation with so-called
informative cluster size; when reduced to our setting with the θi specified as cluster means
rather than as general parameters (such as odds ratios, for example), what they propose
is to estimate θp for the special case of all wip =
1
,
N
by using θ̂as with wi =
1
n
and the Yi
encoding the sample means of the clusters. However, their methodology diverges from ours
for general parameters, such as odds ratios; thus our approach represents an alternative
worth developing further.
7
The present article is organized as follows. In section 2, we present the two random effects
models and discuss the advantage of the novel alternative model, and section 3 compares
estimators based on the two models. Section 4 contains the application to the Medicaid
data. In section 5, we present theory and methods of estimation and performance evaluation
for a general class of models, and section 6 concludes with a discussion.
2. Two Competing Random Effects Models
Our approach to inference is design-based, but the estimators we will develop are motivated as BLUPs for two competing random effects models. We do not take the point of
view that the random effects models generate the data, and in subsequent sections we will
present design-based estimates of bias, variance, and MSE.
Both random effects models let Yi = θi + ei as in (1), with independent, approximately
normal, mean zero ei with variances σi2 , but the θi are modeled as random, i.e.
θi = µ + δi ,
with µ a fixed effect and δi , i = 1, . . . , n, independent normal mean zero random effects with
variances Ti .
Model 1
Model 1 sets
Ti = τ 2 , i = 1, . . . , n.
This leads to the BLUPs
µ̂ =
i
Σni=1 τ 2Y+σ
2
i
1
Σni=1 τ 2 +σ
2
(2)
,
i
and
δ̂i =
τ2
(Yi − µ̂),
τ 2 + σi2
where the δ̂i satisfy the constraint Σni=1 δ̂i = 0. Thus we have the following estimator of θs ,
obtained as Σni=1 wi (µ̂ + δ̂i ):
E1 = Σni=1 ηi Yi , where
8
ηi
γi
!
Σn γj σ 2 wj
= γi τ wi + j=1n j
Σj=1 γj
1
= 2
.
τ + σi2
2
"
and
(3)
Note that Σni=1 ηi = 1, and that E1 approaches Ea or Eb as τ 2 → ∞ or τ 2 → 0. Model 1
corresponds to the basic one-way mixed effects ANOVA model used by DerSimonian and
Laird [1] for meta-analysis, in which the wi are typically equal to 1/n to reflect the equally
weighted mean of study-specific effects. With that choice, E1 = µ̂. But in general, the
estimator is also a function of the δ̂i .
Model 2
Model 2 sets
Ti = τ 2 /wi , i = 1, . . . , n.
(4)
This leads to
E2 = µ̂ = Σni=1 ηi Yi , with
1
ηi =
τ2
+σi2
wi
Σni=1 τ 2 1
,
(5)
+σi2
wi
and
δ̂i =
!
1
wi
+ 2
2
σi
τ
"−1 !
"
Yi − µ̂
.
σi2
Note that Σni=1 ηi = 1 and that Σni=1 wi δ̂i = 0. Model 2 was briefly discussed by Brumback
and Brumback [17] in the present context but with estimated weights, and by Ghosh and
Maiti [18] in the context of multilevel modeling under noninformative sampling of clusters.
Like E1 , this estimator approaches Ea or Eb as τ 2 → ∞ or τ 2 → 0.
The Advantage of Model 2 versus Model 1
Model 2 parameterizes µ as the weighted average, θs , of the θi , rather than as the unweighted average, θ̄ = (1/n)Σi θi . We saw that E2 equals µ̂ in (5). The distribution specified
for the δi implies that the θi corresponding to larger wi will be closer to µ; this information
together with the constraint on the δ̂i induces us to interpret µ as θs and δi as θi − θs .
However, in Model 1, µ represents θ̄, and δi represents θi − θ̄. Thus, Model 2 seems more
9
appropriate for our goal of estimating θs . However, one might use the data to negotiate
between Model 1 versus Model 2 and hence E1 versus E2 ; as we show in Appendix 1, if at
least one wi is distinct from the rest, a “supermodel” that encompasses Model 1 and Model 2
is identifiable. This is counterintuitive for the case of n = 2, because there is no information
left, after estimating µ and τ 2 , for discerning between var(δi ) = τ 2 or τ 2 /wi.
3. Differences in the Estimators
First we show that for n = 2, E1 = E2 when the τ 2 of Model 2 equals the τ 2 of Model
1 multiplied by 2w1 (1 − w1 ); in particular, when the τ 2 for each estimator is selected to
minimize MSE, the two estimators will coincide. Second, we show that the estimators are
distinct for n > 2. Thus, it can happen that E2 will lead to a smaller MSE than E1 for any
selection of τ 2 in E1 , and vice versa. In short, the models lead to the same estimators when
n = 2, but to generally different estimators when n > 2.
n=2
A natural question is whether we can use our adaptive estimators to approximately
minimize mean-squared error (MSE) over all estimators of θs of the form Σni=1 ηi Yi, where
the ηi are positive weights that sum to one. This question has a simple answer when n = 2,
but not for n > 2. We consider the MSE of ηY1 + (1 − η)Y2 for known η, computed based
on the fixed effects model (1):
#
$
#
$
#
$
η 2 (θ1 − θ2 )2 + (σ12 + σ22 ) − 2η w1 (θ1 − θ2 )2 + σ22 + w12 (θ1 − θ2 )2 + σ22 .
This function of η is minimized when
η = η∗ =
where s2θ =
1
Σn (θ
n−1 i=1 i
2w1 s2θ + σ22
w1 (θ1 − θ2 )2 + σ22
=
,
(θ1 − θ2 )2 + (σ12 + σ22 )
2s2θ + (σ12 + σ22 )
− θ̄)2 , which equals (θ1 − θ2 )2 /2 when n = 2.
We observe that the weight θ̂1s associates with Y1 is
η# =
2w1 τ 2 + σ22
,
2τ 2 + σ12 + σ22
10
(6)
which equals η ∗ when τ 2 = s2θ . According to Model 1, conditionally on µ, var(θi ) = τ 2 ;
thus, when n = 2, the risk-minimizing choice of τ 2 is a logical choice based on Model 1 and
coincides with substituting s2θ for var(θi ).
Next we consider Model 2. We observe that the weight θ̂2s associates with Y1 is
η ## =
w1 τ 2 + w1 (1 − w1 )σ22
,
τ 2 + w1 (1 − w1 )(σ12 + σ22 )
which equals η ∗ when τ 2 = 2s2θ w1 (1 − w1 ). According to Model 2, conditionally on µ,
1 n
1
τ2
Σi=1 var(θi ) = Σni=1 .
n
n
wi
(7)
Approximating n1 Σni=1 var(θi ) by s2θ and noting that 12 Σ2i=1 w1i = 1/(2w1 (1−w1 )) leads us again
to τ 2 = 2s2θ w1 (1 − w1 ). Thus, when n = 2, the risk-minimizing choice of τ 2 is also a logical
choice based on Model 2.
In practice, s2θ must be estimated from the data, and the resulting “risk-minimizing”
estimator loses its optimality; in section 5 we show that E1 and E2 are admissible for any
choices of τ 2 . The preceding derivations have shown that, for n = 2, E1 = E2 when the
choice of τ 2 for Model 2 equals the choice of τ 2 for Model 1 multiplied by 2w1 (1 − w1 ).
n>2
When there are more than two strata, E1 and E2 are quite distinct. To demonstrate this,
we present a counterexample to the claim that one can always choose τ12 for Model 1 and τ22
for Model 2 such that the resulting estimators are identical.
Counterexample Dataset
We let (w1 , w2 , w3 ) = (1/2, 1/3, 1/6), all σi2 = 1, and (Y1 , Y2 , Y3 ) = (2.35, 0, 2.35).
Figure 1 graphs the resulting (η1 , η2 , η3 ) as a function of τ 2 for each of the two estimators,
and shows us the reason we can not find τ12 and τ22 to make the estimators identical: the
graph for η2 associated with E2 lies strictly above that for η2 associated with E1 ; this results
in E2 < E1 for all choices of τ12 and τ22 – see Figure 2.
(Figures 1 and 2 About Here)
Although the graphs are convincing, some algebra shows conclusively what the graphs
lead us to believe. For the counterexample dataset, we find that for Model 1, µ̂ = 13 Σ3i=1 Yi =
11
Ȳ for all τ 2 ∈ (0, ∞); δ̂i =
τ2
(Yi
τ 2 +1
− Ȳ ); and δ̂1 = δ̂3 = −δ̂2 /2. Therefore, E1 = Ȳ for all τ 2 .
For Model 2, on the other hand, we show in Appendix 2 that the η2 associated with Y2 = 0
is greater than 1/3 for all τ 2 , and thus E2 < Ȳ for all τ 2 .
4. Application to the Medicaid Data
Table 2 compares Ea , Eb , E1 , and E2 for the three Medicaid targets, T 1 − T 3. Section 5
will explain for general random effects models our proposal for selecting τ 2 and computing
root mean squared errors, but Table 2 focuses only on Models 1 and 2. In Table 2, the
column of estimates was generated using our design-based approach for selecting the τ 2 of
Model 1 or 2. Specifically we used the optimize function in R to minimize the estimated
MSE of the respective estimator as a function of τ 2 . The estimators Ea , Eb , E1 , and E2 are
each weighted averages of the Yi , having the form Σni=1 ηi Yi , where ηi can be a function of τ 2 .
For each estimator, we approximate the MSE using Σni=1 (ηi − wi )Yi as our estimate of bias
and Σni=1 ηi2 s2i as our estimate of variance. These estimators of bias and variance are design-
consistent, and hence correspond to a design-based approach to inference. Table 2 presents
the square root of these MSEs. Approximate 95% confidence intervals were computed as
the estimator plus or minus 1.96 times the square root of the estimated variance. These
confidence intervals will typically be biased when based on Eb and slightly biased when
based on E1 or E2 ; hence, the coverage properties will not be accurate for the interval based
on Eb unless the θi are homogeneous, but the coverage properties for the intervals based on
E1 and E2 will be approximately correct when the ni are large.
Design-based estimates of variance typically assume that the σi2 are known, but the
estimates above also assume that τ 2 is known. Thus we also present simulated MSEs by
generating 10,000 datasets each with Yi∗ , i = 1, . . . , n simulated as independent N(Yi , si )
random variables. For each dataset, we estimate Ea∗ , Eb∗ , E1∗ , and E2∗ . The simulated MSE
is then reported as the mean of the 10,000 squared errors, where one error is the estimate
minus Σni=1 wi Yi . Table 2 reports the square root of these simulated MSEs.
(Table 2 About Here)
The adaptive nature of E1 and E2 is evident in the comparison of the RMSEs. For targets
12
T1 and T3, the estimators adapt towards homogeneity and Eb , whereas for target T2, they
adapt towards heterogeneity and Ea . Of particular interest is that the adaptation in all three
cases is in a direction opposite to that suggested by the naive test-then-estimate approach
applied in the introduction. Recall that the P −values for testing the null of homogeneity
were < 0.0001, 0.078, and 0.046 for T1, T2, and T3. These results show clearly that it
can be better to incorporate the weights into the assessment of homogeneity. Standard
tests of homogeneity do not do this; moreover, model-based methods of selecting τ 2 do not
incorporate the weights into the choice either. We see that despite P −values being below
0.05 for T1 and T3, estimator E1 based on Model 1 has a simulated RMSE that is 90% that
of Ea for T1 and 91% that of Ea for T3. However, a P −value higher than 0.05 does not
necessarily mean that E1 or E2 will outperform Ea , as we see with T2.
It is particularly striking that the P −value for testing homogeneity was < 0.0001 for T1,
and yet the RMSE for E1 is 90% of that for Ea . Plan 1 in Duval county is very different from
the other four less managed plans regarding item A3, but its weight is only 2.1% of the total.
Thus, homogeneity was overturned, when for the purposes of estimating θs , it should not
have been. This also highlights the advantage of selecting τ 2 to minimize estimated mean
squared error of the estimator of θs , rather than to accurately reflect heterogeneity of the
plans. Model-based estimators of θs select τ 2 in the latter way.
To compare our method of selecting τ 2 with model-based approaches, we also implemented the iterative method described in Little[12] to estimate target T1. This method
effectively uses Model 1 but selects τ 2 as the solution to the fixed point equation τ 2 =
1
Σn ν (Y
n−1 i=1 i i
− K)2 , where νi = τ 2 /(τ 2 + σi2 ) and K = (Σi νi Yi)/(Σi νi ). When the σi2 are
small and all of the θi are homogeneous except for a single outlier associated with a small
weight, this method will estimate a larger τ 2 than our method. Thus, we would expect it
to have a larger simulated RMSE than that of our E1 for target T1, and indeed it does. Its
simulated RMSE is 0.077, close to that of Ea . This demonstrates that because the modelbased approach for selecting τ 2 does not incorporate the wi , it can therefore perform worse
than our approach.
As expected for E1 and E2 , the simulated RMSEs, which incorporate variability due to
13
selecting τ 2 based on the data, are generally larger than the approximate RMSEs, which
do not. It is interesting to see that E1 outperforms E2 for summary T1, but that the
two estimators perform similarly for T3. Nevertheless, the novel model could potentially
be better for some applications. It is difficult to conceive of subject-matter considerations
which might lead one to prefer one model over the other. It seems to us that the best way
to choose is to try both models, and to select the one with a smaller simulated RMSE.
5. Theory, Estimation, and Performance Evaluation
Theory
Models 1 and 2 are special cases of the general random effects model
Y |θ, µ, δ ∼ N(θ, Σ)
θ|µ, δ = ZW µ + δ
δ ∼ N(hW , TW )
µ ∼ U p (−∞, ∞),
(8)
with Y = (Y1 , . . . , Yn )T ; W = (w1 , . . . , wn )T ; θ = (θ1 , . . . , θn )T ; Σ a symmetric positive
definite known matrix; ZW an n × p matrix, possibly dependent upon W ; the vector hW and
the symmetric positive definite matrix TW functions of W and of parameters; U p (−∞, ∞)
the improper uniform distribution for a vector of p independent parameters; and δ and µ
independent of one another given W . For the remainder of the paper we will suppress the
subscripts on ZW , hW , and TW . In this framework, we have modeled the fixed effects as
random with uninformative priors.
Under mild restrictions on Z, T , and Σ (Appendix 3), the conditional distribution of
(µT , δ T )T given Y is multivariate normal with mean (µ̂T , δ̂ T )T solving
Z T Σ−1 Zµ + Z T Σ−1 δ = Z T Σ−1 Y
Σ−1 Zµ + (Σ−1 + T −1 )δ = Σ−1 Y + T −1 h.
(9)
Multiplying the second of these equations by Z T and comparing to the first, we find that δ̂
satisfies the p-dimensional linear constraint
Z T T −1 (δ̂ − h) = 0.
14
(10)
This follows naturally from our setup, which uses the n+p parameters in µ and δ to represent
the n-dimensional parameter θ. We observed in Section 2 that this constraint influences our
interpretation for µ, analogous to how constraints in models with nonrandom µ and δ would
do.
The equations at (9) can be re-expressed to solve for δ̂ independently of µ̂ as follows:
#
$
#
$
I − (Σ−1 + T −1 )−1 Σ−1 PZΣ δ̂ = (Σ−1 + T −1 )−1 Σ−1 (I − PZΣ )Y − ΣT −1 h ,
Z T T −1 δ̂ = Z T T −1 h
(11)
with PZΣ = Z(Z T Σ−1 Z)−1 Z T Σ−1 . In Appendix 4 we show that, again under mild restrictions
on Z, T , and Σ, the solution to these n + p equations in n unknowns exists uniquely as
δ̂ = (A + T −1 )−1 (AY + T −1 h),
with A = (I − PZΣ )T Σ−1 (I − PZΣ ) = Σ−1 (I − PZΣ ).
(12)
It follows from (9) that µ̂ equals
µ̂ = (Z T Σ−1 Z)−1 (Z T Σ−1 )(Y − δ̂).
(13)
The posterior distribution of θs given Y has mean
E = W T (Z µ̂ + δ̂)
#
$
= W T (I − PZΣ )δ̂ + PZΣ Y ,
(14)
which can also be expressed as
#
$
E = W T (I − B)Y + BPZΣ+T (Y − h) + Bh ,
(15)
where B = Σ(Σ + T )−1 (Appendix 5).
The posterior mean E is an admissible estimator of θs under squared-error loss for any
particular specification of the general random effects model (8) ([19] chapter 5). We turn E
into an adaptive estimator by letting the data guide our selection of the parameters implicit
in h and T . For example, if we specify h = 0 and T = τ 2 T0 with T0 a known matrix
15
(possibly dependent upon W), so that τ 2 tunes the eigenvalues of T towards 0 or ∞. When
the eigenvalues of T tend towards ∞, the first equation at (11) tends towards
(I − PZΣ )δ̂ = (I − PZΣ )Y,
which implies that Z µ̂ + δ̂ → Y (see equation (14)), so that
E → W T Y = Ea ;
(16)
whereas, when the eigenvalues tend towards 0, δ̂ → h (see the first equation at (11), in which
(Σ−1 + T −1 ) ≈ T −1 ), so that
#
$
E → W T PZΣ Y + (I − PZΣ )h ,
(17)
which equals Eb when (a) Z = 1p (the vector of p ones), (b) Σ equals the diagonal matrix with
entries σ12 , . . . , σn2 (denoted henceforth as Σd ), and (c) h = 0. If we choose τ 2 to minimize the
estimated MSE of E, we will adapt towards (16) or (17) depending on which limit optimizes
the bias-variance trade-off relative to the other. We observe that Ea is minimax and that
the right-hand side of (17) is admissible under squared-error loss ([19] chapter 5.)
We consider one more particular specification of the model. Model 3 sets T0 equal to In ,
sets h equal to a parametric function of W , e.g. (W − W̄ )β, where W̄ ≡ ((1/n)Σni=1 wi ) 1n
and β is a scalar parameter, and keeps Z and Σ the same as in Models 1 and 2. This leads
to
µ̂ =
i
Σni=1 σY2i −h
+τ 2
i
1
Σni=1 σ2 +τ
2
,
i
where hi is the ith element of h, and
δ̂i =
!
1
1
+ 2
2
σi
τ
"−1 !
"
Yi − µ̂ hi
+ 2 .
σi2
τ
Thus,
E3 =
ηi
γi
Σni=1
!
!
ηi Yi +
wi σi2 γi (hi
Σnj=1 γj σj2 wj
= γ i τ wi +
Σnj=1 γj
1
= 2
.
τ + σi2
2
16
"
Σnj=1 γj hj
− n
) , where
Σj=1 γj
"
, and
(18)
We observe that this estimator is not a simple weighted average of the Yi except when hi ≡ 0.
In practice, one would often choose to let Zµ subsume h; however, it is worth recalling that
E3 is admissible under squared-error loss no matter what value we choose for h. We note
that E3 → Ea as τ 2 → ∞, and that it approaches

Eb + Σni=1 wi hi −
Σni=1 σh2i
i
Σni=1 σ12
i

,
as τ 2 → 0. Due to space constraints, we leave the development of this approach for future
work.
Estimation and Performance Evaluation
We develop and present techniques for estimation and performance evaluation assuming
h = 0 and T = τ 2 T0 in the random effects model at (8). We choose τ 2 to minimize estimated
MSE of the estimator based on the fixed effects sampling model Y |θ, µ, δ ∼ N(θ, Σ). Because
the MSE generally depends upon θ, we will consistently estimate it by substituting Y for
θ. Our data-based choice of τ 2 combined with our estimate of θ lead to a design-consistent
estimator of θs .
The estimators for θ and θs are
θ̂ = HY and
E = W T HY, where
)
H = PZΣ + (I − PZΣ ) Σ−1 (I − PZΣ ) +
1 −1
T
τ2 0
*−1
Σ−1 (I − PZΣ ).
The actual bias and variance of E are given by
biasa = W T (H − I)θ and
vara = W T HΣH T W.
(19)
ˆ = W T (H − I)Y
bias
(20)
We estimate the bias as
ˆ 2 + vara . Because mse
and the MSE as mse
ˆ = bias
ˆ depends upon τ 2 in a complicated way, we
use the optimize() function within Version 2.3.0 of the statistical programming language R
ˆ
[20] to minimize mse
ˆ as a function of τ 2 ; in turn, we use the minimizing τ̂ 2 to evaluate mse.
17
A difficulty with our estimated MSEs is that they are derived assuming that τ 2 is known,
when in fact we are selecting it based on the data. We therefore rely on the bootstrap [21]
for more accurate answers. Specifically, we propose to generate 10,000 datasets with Y ∗
simulated according to the fixed effects sampling model Y ∗ ∼ N(Y, Σ). For each dataset, we
estimate Ea∗ ,Eb∗ , E1∗ , E2∗ , or another adaptive estimator. The simulated MSE is then reported
as the mean of the 10,000 squared errors, where one error is the estimate minus Σni=1 wi Yi .
Table 2 compares the square root of the simulated MSEs for the four estimators that are
the focus of this paper. This can be viewed as a simulation study of the performance of the
four estimators under three different conditions, corresponding to targets T1, T2, and T3;
see section 4 for a summary.
6. Discussion
We have developed and compared adaptive estimators of a weighted average of stratumspecific parameters based on two random effects models, the DerSimonian and Laird [1]
model (Model 1) for the parameters and the novel model briefly touched on by Brumback
and Brumback [17] and Ghosh and Maiti [18] (Model 2). Although our proposed estimators
are motivated by random effects models, our approach to inference is design-based, as the
adaptive estimators are design-consistent and our estimated MSEs are as well. Moreover,
our approach to selecting the tuning parameter is also design-based, because, rather than
selecting it to accurately represent heterogeneity of the strata, as would a model-based
approach, we select it to minimize a design-based estimate of MSE.
In our investigation, we have uncovered the importance that the implicit constraints
satisfied by BLUPs have for interpreting the parameters µ and δi of these models. We have
also discovered that Model 1 and Model 2 lead to distinctly different estimators of θs when
the number of strata exceeds two. In our application to the Medicaid data, we found that the
estimator based on Model 1 (E1 ) had a slightly smaller RMSE than the one based on Model 2
(E2 ). However, other datasets could lead to opposite results. Both estimators outperformed
the weighted average of stratum means (Ea ) in estimating targets T1 and T3, despite that
the test of homogeneity rejected the null in both instances. It is particularly striking that
18
the P −value for testing homogeneity was < 0.0001 for T1, and yet the RMSE for E1 was
90% of that for Ea . This clearly demonstrates that it is suboptimal to test homogeneity
without considering the relative weights. Plan 1 in Duval county was very different from
the other four less managed plans regarding item A3, but its weight was only 2.1% of the
total. Thus, homogeneity was overturned, when for the purposes of estimating θs , it should
not have been. This also highlights the advantage of selecting τ 2 to minimize estimated
mean squared error of the estimator of θs , rather than to accurately reflect heterogeneity
of the plans. Model-based estimators of θs select τ 2 in the latter way. When we applied a
model-based estimator for T1 which paralleled E1 except for the method of selecting τ 2 , we
found that its RMSE was noticeably larger (97% of the RMSE of Ea rather than 90%.)
In this paper, we have considered only the case of known wi . In several related applications, the wi are unknown and need to be estimated from the sample. Examples include
adjustment for nonresponse bias, missing data, or confounding. Further research could study
the effects of applying our estimators with estimated wi . A related topic is that of estimating
summary odds ratios under heterogeneity. Greenland [22] discusses relative advantages of
the Miettinen [23] and Mantel-Haenszel [24] summary odds ratio estimators when one cannot
assume homogeneity; these summary measures can each be expressed as a weighted sum of
stratum-specific odds ratios, with data dependent weights.
We have also focused mainly on two adaptive estimators, E1 and E2 , both of which set the
design matrix ZW of (8) equal to the vector of ones. One could generalize ZW to depend, for
example, on the stratum weights, as we did with estimator (18); future work could develop
this approach.
We have only briefly discussed estimation of θp , but our estimators of θs are also consistent
for this quantity, when the (θi , wi ) represent a simple random sample of a larger group of
strata. The method of selecting τ 2 would need to be reconsidered, but surely it would still be
preferable to incorporate the observed wi into the choice. It would be interesting to compare
the resulting procedure to the early one of Pfeffermann and Nathan [3], as well as to the
latter work of Pfeffermann et al. [14] and others currently active in multilevel modeling of
complex survey data.
19
Acknowledgements
George Casella was supported by NSF-DEB-0540745. Malay Ghosh was partially supported by NSF Grant SES-0631426. We thank the Florida Agency for Health Care Administration for providing the Medicaid data, and two anonymous reviewers for helpful comments
that led to an improved version of this manuscript.
Appendix
Appendix 1.
Here we show that if at least one wi is distinct from the rest, and if we condition on µ, a
“supermodel” that encompasses Model 1 and Model 2 is identifiable.
Specifically, under either Model 1 or 2 we consider the density of Y1 , . . . , Yn |µ, τ 2 , and we
introduce a new parameter ψ that equals 1 or 2 for Model 1 or 2. This leads to a supermodel,
indexed by parameters (ψ, µ, τ 2 ), such that for ψ equal to either 1 or 2, the distribution Pψ,µ,τ 2
of Y1 , . . . , Yn is that of n independent normal random variables with mean µ; for ψ = 1 the
variances are σi2 + τ 2 , but for ψ = 2, the variances are σi2 + τ 2 /wi . To show that the supermodel is
identifiable, we suppose it is not and derive a contradiction. We suppose that there exists (ψ1 , µ1 , τ12 )
unequal to (ψ2 , µ2 , τ22 ) such that Pψ1 ,µ1 ,τ 2 = Pψ2 ,µ2 ,τ 2 . If ψ1 = ψ2 , then (µ1 , τ12 ) must equal (µ2 , τ22 )
1
2
because the mean and variance parameters of the corresponding multivariate normal distribution
are identifiable. It thus suffices to consider the case of ψ1 = 1 and ψ2 = 2. Nonidentifiability
implies that µ1 = µ2 , because the marginal normal mean of Y1 when ψ = 1 must be the same
as that when ψ = 2. We furthermore learn that σi2 + τ12 must equal σi2 + τ22 /wi for i = 1, . . . , n,
because the marginal normal standard deviations of the Yi must also be the same for ψ = 1 as for
ψ = 2. Consequently,
τ12 = τ22 /wi for all i.
Because T is positive definite, τ12 and τ22 must be nonzero. Thus, unless all wi are equal, we have
a contradiction.
Appendix 2.
For Model 2 and the counterexample dataset,
η2 =
1
2τ 2 +1
+
1
3τ 2 +1
1
3τ 2 +1
20
+
.
1
6τ 2 +1
We need to show that η2 > 1/3 for all τ 2 ∈ (0, ∞). Multiplying numerator and demoninator by
3τ 2 + 1, it suffices to show that
3τ 2 + 1 3τ 2 + 1
τ2
3τ 2
+
=
1
+
+
1
−
< 2.
2τ 2 + 1 6τ 2 + 1
2τ 2 + 1
6τ 2 + 1
Assuming τ 2 ∈ (0, ∞), it is equivalent to show that
1
3
< 2
,
+1
6τ + 1
2τ 2
which is true because 6τ 2 + 1 < 6τ 2 + 3.
Appendix 3.
Here we verify (9). Let β = (µT , δT )T , X = [Z|In ], In be the n × n identity matrix, C be the
block diagonal matrix with first block the p × p matrix of zeroes and second block equal to T −1 ,
and m be the vector (0Tp , hT )T , with 0p as the vector of p zeroes. Suppose β has prior given by (8),
and that Y given β is MVN with mean Xβ and variance Σ. Then minus two times the log of the
posterior distribution of β given Y equals (up to a constant)
(Y − Xβ)T Σ−1 (Y − Xβ) + (β − m)T C(β − m),
(21)
which in turn equals (up to a constant)
β T (X T Σ−1 X + C)β − 2β T (X T Σ−1 Y + Cm).
Assuming that (X T Σ−1 X + C) is invertible (this implies mild restrictions on Z, T , and Σ), we can
complete the square with
(X T Σ−1 Y + Cm)T (X T Σ−1 X + C)−1 (X T Σ−1 Y + Cm),
so that minus two times the log of the posterior equals (up to a constant)
(β − b)T V −1 (β − b),
with
b = (X T Σ−1 X + C)−1 (X T Σ−1 Y + Cm) and
V
= (X T Σ−1 X + C)−1 .
Thus the posterior of β = (µT , δT )T is multivariate normal with mean b solving (9).
21
Appendix 4.
Here we show that the posterior mean of δ given Y is the solution to (12). By the properties
of the multivariate normal distribution, it must also equal the δ̂ of (9); therefore, we know that it
solves (11).
Minus two times the log of the joint distribution of Y, δ, µ is (up to a constant)
(Y − Zµ − δ)T Σ−1 (Y − Zµ − δ) + (δ − h)T T −1 (δ − h).
Recall that PZΣ = Z(Z T Σ−1 Z)−1 Z T Σ−1 , and note that PZΣ Z = Z. Write
(Y − Zµ − δ)T Σ−1 (Y − Zµ − δ) =
#
$T
[PZΣ (Y − δ) − PZΣ Zµ] + [(I − PZΣ )(Y − δ)]
#
$
Σ−1 [PZΣ (Y − δ) − PZΣ Zµ] + [(I − PZΣ )(Y − δ)] .
Expand the quadratic form according to the terms in square brackets, and note that the cross term
is zero since (I − PZΣ )T Σ−1 PZΣ = 0. So this part of the log is
(PZΣ Zµ − PZΣ (Y − δ))T Σ−1 (PZΣ Zµ − PZΣ (Y − δ)) + ((I − PZΣ )(Y − δ))T Σ−1 ((I − PZΣ )(Y − δ)).
Only the first term has µ, and if we examine it and use the fact that PZΣ Z = Z again, we have
(PZΣ Zµ − PZΣ (Y − δ))T Σ−1 (PZΣ Zµ − PZΣ (Y − δ)) = (µ − µ̂δ )T (Z T Σ−1 Z)(µ − µ̂δ ),
where µ̂δ = (Z T Σ−1 Z)−1 Z T Σ−1 (Y − δ). Thus we have that the conditional distribution of µ|Y, δ is
µ|Y, δ ∼ N (µ̂δ , (Z T Σ−1 Z)−1 ).
In this form, we can now integrate out µ from the joint distribution to get the marginal distribution
of Y, δ. This has minus two times the log equal to (up to a constant)
((I − PZΣ )(Y − δ))T Σ−1 ((I − PZΣ )(Y − δ)) + (δ − h)T T −1 (δ − h).
Define A = (I − PZΣ )T Σ−1 (I − PZΣ ). Minus two times the log is then (up to a constant)
(δ − Y )T A(δ − Y ) + (δ − h)T T −1 (δ − h).
Factor this as
(δ − δ̂)T (A + T −1 )(δ − δ̂) + Y T AY + hT T −1 h − δ̂T (A + T −1 )δ̂,
22
where
δ̂ = (A + T −1 )−1 (AY + T −1 h),
as in (12). Thus, as we wished to show, the conditional distribution of δ given Y is
δ|Y ∼ N (δ̂, (A + T −1 )−1 ).
Appendix 5.
From the identity
(Y − θ)T Σ−1 (Y − θ) + (θ − Zµ − h)T T −1 (θ − Zµ − h)
#
$
= θ T (Σ−1 + T −1 )θ − 2θ T Σ−1 Y + T −1 (Zµ + h) + Y T Σ−1 Y + (Zµ + h)T T −1 (Zµ + h),
it follows that
E(θ|Y, µ) = T (Σ + T )−1 Y + Σ(Σ + T )−1 (Zµ + h).
Noting that Y |µ ∼ N (Zµ + h, Σ + T ), from the identity
#
$
(Y −Zµ−h)T (Σ+T )−1 (Y −Zµ−h) = µT Z T (Σ + T )−1 Z µ−2µT Z T (Σ+T )−1 (Y −h)+(Y −h)T (Σ+T )−1 (Y −h),
it follows that
#
E(µ|Y ) = Z T (Σ + T )−1 Z
Hence
$−1
Z T (Σ + T )−1 (Y − h).
E(θ|Y ) = (I − B)Y + BPZΣ+T (Y − h) + Bh,
where B = Σ(Σ + T )−1 . The equation at (15) follows.
References
1. DerSimonian R, Laird N. Meta-Analysis in Clinical Trials. Controlled Clinical Trials
1986; 7: 177-188.
2. DuMouchel WH, Duncan GJ. Using Sample Survey Weights in Multiple Regression
Analyses of Stratified Samples. Journal of the American Statistical Association 1983;
78: 535-543.
3. Pfeffermann D, Nathan G. Regression Analysis of Data from a Cluster Sample. Journal
of the American Statistical Association 1981; 76: 681-689.
4. Robinson GK. That BLUP is a good thing: The estimation of random effects (with
discussion.) Statistical Science 1991; 6: 15-32.
23
5. Fay RE, Herriot RA. Estimates of Income for Small Places: An Application of JamesStein Procedures to Census Data. Journal of the American Statistical Association
1979; 74: 269-277.
6. James W, Stein C. Estimation with Quadratic Loss. In Proceedings of the Fourth
Berkeley Symposium of Mathematical Statistics and Probability, Vol. 1 University of
California Press: Berkeley, 1961; 361-379.
7. Ghosh M, Rao JNK. Small Area Estimation: An Appraisal (with Discussion). Statistical Science 1994; 9: 55-93.
8. Little RJ. To Model or Not To Model? Competing Modes of Inference for Finite
Population Sampling. Journal of the American Statistical Association 2004; 99: 546556.
9. Gelman A. Struggles with Survey Weighting and Regression Modeling (with discussion). Statistical Science 2007; in press.
10. Elliott MR, Little RJA. Model-based Alternatives to Trimming Survey Weights. Journal of Official Statistics 2000; 16: 191-209.
11. Lazzeroni LC, Little RJA. Random-effects Models for Smoothing Poststratification
Weights. Journal of Official Statistics 1998; 14: 61-78.
12. Little RJA. Inference with Survey Weights. Journal of Official Statistics 1991; 7:
405-424.
13. Rabe-Hesketh S, Skrondal A. Multilevel Modelling of Complex Survey Data. Journal
of the Royal Statistical Society, Series A 2006; 169: 805-827.
14. Pfeffermann D, Skinner CJ, Holmes DJ, Goldstein H, Rasbash J. Weighting for Unequal
Selection Probabilities in Multilevel Models. Journal of the Royal Statistical Society,
Series B 1998; 60: 23-40.
15. Liang KY, Zeger SL. Longitudinal Data Analysis Using Generalized Linear Models.
Biometrika, 1986; 73: 13-22.
16. Williamson JM, Datta S, Satten GA. Marginal Analyses of Clustered Data When
Cluster Size is Informative. Biometrics 2003; 59: 36-42.
17. Brumback BA, Brumback LC. Comment on ”Semiparametric Estimation of a Treatment Effect in a Pretest-Posttest Study with Missing Data,” by Davidian M, Tsiatis
A, and Leon S, Statistical Science 2005; 20: 284-289.
18. Ghosh M, Maiti T. Discussion on the Papers by Firth and Bennett and Pfeffermann
et al. Journal of the Royal Statistical Society, Series B 1998; 60: 41-56.
19. Lehmann EL and Casella G. Theory of Point Estimation. Springer-Verlag: New York,
1998.
24
20. The R Development Core Team. R 2.3.0 - A Language and Environment 2006.
http://www.r-project.org/. Accessed on August 22, 2007.
21. Efron B, Tibshirani RJ. An Introduction to the Bootstrap. Chapman and Hall: London,
1993.
22. Greenland S. Interpretation and Estimation of Summary Ratios Under Heterogeneity.
Statistics in Medicine 1982; 1: 217-227. DOI: 10.1002/sim.2547
23. Miettinen OS. Standardization of Risk Ratios. American Journal of Epidemiology
1972; 96: 383-388.
24. Mantel N, Haenszel WH. Statistical Aspects of the Analysis of Data from Retrospective
Studies of Disease. Journal of the National Cancer Institute 1959; 22: 719-748.
25
Download