Estimating a Weighted Average of Stratum-Specific Parameters Babette A. Brumback1∗ , Larry H. Winner2 , George Casella2 , Malay Ghosh2 , Allyson Hall3 , Jianyi Zhang4 , Lorna Chorba4 , and Paul Duncan3 1 3 Department of Epidemiology and Biostatistics Department of Health Services Research, Management, and Policy 4 Florida Center for Medicaid and the Uninsured College of Public Health and Health Professions 2 Department of Statistics College of Liberal Arts and Sciences University of Florida Gainesville, FL 32611, USA ∗ corresponding author e-mail: brumback@phhp.ufl.edu phone: 352-273-5366 fax: 352-273-5365 January 30, 2008 1 SUMMARY This article investigates estimators of a weighted average of stratum-specific univariate parameters and compares them in terms of a design-based estimate of mean squared error. The research is motivated by a stratified survey sample of Florida Medicaid beneficiaries, in which the parameters are population stratum means and the weights are known and determined by the population sampling frame. Assuming heterogeneous parameters, it is common to estimate the weighted average with the weighted sum of sample stratum means; under homogeneity, one ignores the known weights in favor of precision weighting. Adaptive estimators arise from random effects models for the parameters. We propose adaptive estimators motivated from these random effects models, but we compare their design-based performance. We further propose selecting the tuning parameter to minimize a design-based estimate of mean squared error. This differs from the model-based approach of selecting the tuning parameter to accurately represent the heterogeneity of stratum means. Our design-based approach effectively downweights strata with small weights in the assessment of homogeneity, which can lead to a smaller mean squared error. We compare the standard random effects model with identically distributed parameters to a novel alternative, which models the variances of the parameters as inversely proportional to the known weights. An advantage of the alternative model is that it parameterizes the mean as the weighted, as opposed to the unweighted, average of the parameters. We show that for two strata, the two models produce equivalent estimators, but for more than two strata, they do not. We also present theoretical and computational details for estimators based on a general class of random effects models. The methods are applied to estimate average satisfaction with health plan and care among Florida beneficiaries just prior to Medicaid reform. KEY WORDS: random effects model; weighted average; stratified sample; adaptive estimation; heterogeneity; shrinkage 2 1. Introduction This article investigates estimators of a weighted average of stratum-specific univariate parameters and compares them in terms of a design-based estimate of mean squared error. Specifically, we consider estimation of θs = Σni=1 wi θi , where n is the number of strata, θi , i = 1, . . . , n are the stratum-specific parameters and wi , i = 1, . . . , n are the weights, which are positive and sum to one. Although many applications of statistics require estimating a weighted average, this article is motivated by a stratified survey sample of Medicaid beneficiaries in two Florida counties to measure satisfaction with medical care. The website ahca.myflorida.com/Medicaid provides information about the Medicaid program in Florida. Surveyed participants each belong to one of five Medicaid health care plans in Duval County (encompassing Jacksonville) or of fourteen in Broward County (Fort Lauderdale). In order to maximize power for plan-to-plan comparisons, equal sample sizes of 315 participants were targeted per plan; however, aggregate summaries are also of interest. In this article, we focus on estimating aggregate summaries within county for managed and less managed health care plans considered separately. Two survey items are analyzed: A3 rates overall health plan satisfaction on a scale from 0-10, and A6 rates overall health care satisfaction on the same scale. Differential nonresponse within plan by age, gender, and race/ethnicity is adjusted for using logistic regression and inverse weighting. Plan-level summaries of the data produced by SAS Proc Surveymeans are presented in Table 1. The weights in Table 1 are known and represent the relative sizes of the health care plans within each county. Managed plans correspond to managed=1, less managed plans to managed=0. To compare estimators of a weighted average, we target (T1) the average A3 rating among less managed health plans in Duval County, (T2) the average A6 rating among less managed health plans in Broward county, and (T3) the average A6 rating among managed health plans in Broward county. The plan weights are renormalized so that the sum of the weights associated with a given target equals one. (Table 1 About Here) 3 Probably the most common estimator of θs is the weighted sum of the stratum-specific sample means Yi , i = 1, . . . , n: Ea = Σni=1 wi Yi , which is the best linear unbiased estimator (BLUE) of θs assuming the model Yi = θi + ei , (1) where the θi are fixed effects, and the ei are independent, approximately normal, mean zero error terms with variances σi2 . However, if the stratum-specific population means are nearly homogeneous (θi = θ0 , i = 1, . . . , n), then an estimator with lower mean-squared error weights the stratum-specific sample means proportionally to their inverse-variances 1/σi2 , i = 1, . . . , n: Eb = Σni=1 σY2i i Σni=1 σ12 , i where in practice, we will substitute the sample standard errors si for σi , i = 1, . . . , n. This estimator is the BLUE of θs assuming the model Yi = θ0 + ei , where θ0 is a fixed effect, and the ei are as before. In practice, homogeneity is usually not known a priori, and thus one strategy is to first test the null hypothesis of homogeneity using a one-way analysis of variance, and then to select either Ea or Eb depending on whether the test rejects or not. For each one of the Medicaid targets, we used SAS Proc Surveyreg to test homogeneity of the relevant plans. This procedure incorporates the nonresponse weights into the test, but not the stratum weights. For T1, T2, and T3, the p-values were < 0.0001, 0.078, and 0.046, respectively. Following the strategy we have just outlined, then, we would choose Ea as the estimator for T1 and T3, but Eb as the estimator for T2. As we shall see, these decisions are each incorrect, because they fail to incorporate the stratum weights; plans with smaller weights should not count as much. Adaptive estimators arising from random effects models represent a smooth alternative to the test-then-estimate approach, and it is also possible to let the stratum weights guide 4 the procedure. In this article, we compare the standard DerSimonian and Laird [1] random effects model with a novel alternative, which models the variances of the θi as inversely proportional to the wi . An advantage of the alternative model is that it parameterizes the mean as the weighted, as opposed to the unweighted, average of the parameters. We will compare the performance of competing estimators using a design-based estimate of mean squared error. That is, we will evaluate the competing estimators of θs under a sampling model that conditions on the θi but not on the ei , i = 1, . . . , n. Thus, bias, variance, and mean-squared error will be computed under a model that resamples the ei but leaves the θi fixed. As mean-squared error typically depends on the values of θi , we will use the design-based estimator of it that substitutes Yi for θi . The tuning parameter for an adaptive estimator will be chosen to minimize this estimated MSE. Our method of estimating the MSE will steer the adaptive estimators slightly more towards Ea than Eb , but this is preferable to the other way around. Whereas in this article, our inferences are confined to θs , the weighted mean of θi for our sample of n strata, one might instead be interested in the weighted mean of θi for a larger population of N strata (in which case, the strata are usually called clusters), that is, in p θp = ΣN i=1 wi θi , where the wip are the weights normalized to the population rather than to the sample of n clusters. In this case, the sampling distribution of competing estimators would be based on resampling both the θi and the ei . The standard reason for using random effects models is to account for resampling the θi . Our adaptive estimators of θs are also design-consistent for θp when the sampled clusters are a simple random sample of the population of clusters. In the earlier literature, DuMouchel and Duncan [2] considered estimation of θs in a more general context; reduced to our problem, their results answer the question of when to use Ea or Eb . They developed a test based on the difference statistic Ea − Eb , an approach that goes beyond the testing of homogeneity employed above by incorporating the stratum weights wi into the decision. Our method of selecting the tuning parameter to minimize a design-based estimate of MSE induces a smoothed version of the DuMouchel and Duncan test-then-estimate approach. Approaches that select the tuning parameter to accurately 5 reflect stratum heterogeneity induce a smoothed version of the ordinary test-then-estimate approach employed above. Pfeffermann and Nathan [3] focused on estimating θp in a more general context; reduced to our problem, their method corresponds to using the DerSimonian and Laird [1] model for the θi and estimating θp with Σni=1 wi θ̂i , where the θ̂i are best linear unbiased predictors (BLUPs, see [4]). Following Fay and Herriot [5], who adapted the James-Stein estimator [6] and applied it to simultaneously estimate income in several areas with populations less than 1,000, the literature on small-area estimation is rife with random effects models for the θi ; see, for example, the review article by Ghosh and Rao [7] and the references therein. However, the focus in that literature is on simultaneous estimation of the θi rather than on their weighted sum. The survey sampling literature has also used random effects models to estimate θs ; Little [8] reviews this literature and explains the difference between design-based and model-based inference. An estimator is design consistent if it converges in probability to the parameter it estimates when the probability calculations are based on the selection mechanism. A similar comment applies to design unbiasedness. Design-based inference involves (a) using a design-unbiased or approximately design-unbiased (e.g. design-consistent) estimator, and (b) choosing a variance estimator that is design-unbiased or approximately design-unbiased. On the other hand, model-based inference assumes an explicit model for unsampled data, and uses this model to predict the nonsampled values and hence the posterior distribution of the estimand. For examples of model-based inference on θs , see [9-12]. The perspective of the present article is best viewed as design-based. Our adaptive estimators, though motivated by random effects models, are consistent for θs and hence approximately design-unbiased, and our approach to variance estimation enlists a design-unbiased estimator. Furthermore, we select the variance parameter for the random effects models to minimize a design-unbiased estimate of MSE. Model-based approaches would select this parameter to accurately reflect the heterogeneity of strata. Our design-based approach has the advantage of effectively downweighting strata associated with small weights in the assessment of homogeneity. Thus, although a given subset of plans could differ substantially from the rest, if the combined 6 weight associated with it were small, our method would aggregate the plans as though they were homogeneous. This can lead to a smaller MSE. Another contribution of the present article is its investigation of a novel random effects model, which models the variance of θi as inversely proportional to the wi . Finally, the focus of the present article is more general than that of survey sampling; the weights need not be design weights, but rather could be specified for other reasons. Another related literature is that of multilevel modeling of complex survey data, the title of a recent paper by Rabe-Hesketh and Skrondal [13]. Following Pfeffermann et al. [14], this literature concentrates on estimating θp but in a more general context; when reduced to our problem, the basic idea is to model the distribution of the θi in the entire population using a multilevel model that incorporates θp as a parameter. The next step is to relate the data, generated under informative sampling, to the census model using either weighting likelihoods (i.e. pseudolikelihoods, as in [13]) or weighted iterative generalized least squares algorithms (as in [14]). In the present paper, we apply random effects models for the sample (rather than the census) of θi , but rather than including θs , or more generally, θp , as a direct parameter in the models, we estimate it indirectly as Σni=1 wi θ̂i , where the θ̂i are BLUPs. However, as we show in section 2, our novel model, which models the variances of the θi as inversely proportional to the wi , effectively incorporates θs as a direct parameter. The last literature we will consider is that of generalized estimating equations (GEE [15]), where the focus is on estimating the “population average” of clustered data. In the present paper, we reduce each of the clusters to a single statistic Yi , which associates with a weight wi or wip , depending on whether θs or θp is the target of inference. In a recent paper, Williamson, Datta, and Satten [16] address the problem of GEE estimation with so-called informative cluster size; when reduced to our setting with the θi specified as cluster means rather than as general parameters (such as odds ratios, for example), what they propose is to estimate θp for the special case of all wip = 1 , N by using θ̂as with wi = 1 n and the Yi encoding the sample means of the clusters. However, their methodology diverges from ours for general parameters, such as odds ratios; thus our approach represents an alternative worth developing further. 7 The present article is organized as follows. In section 2, we present the two random effects models and discuss the advantage of the novel alternative model, and section 3 compares estimators based on the two models. Section 4 contains the application to the Medicaid data. In section 5, we present theory and methods of estimation and performance evaluation for a general class of models, and section 6 concludes with a discussion. 2. Two Competing Random Effects Models Our approach to inference is design-based, but the estimators we will develop are motivated as BLUPs for two competing random effects models. We do not take the point of view that the random effects models generate the data, and in subsequent sections we will present design-based estimates of bias, variance, and MSE. Both random effects models let Yi = θi + ei as in (1), with independent, approximately normal, mean zero ei with variances σi2 , but the θi are modeled as random, i.e. θi = µ + δi , with µ a fixed effect and δi , i = 1, . . . , n, independent normal mean zero random effects with variances Ti . Model 1 Model 1 sets Ti = τ 2 , i = 1, . . . , n. This leads to the BLUPs µ̂ = i Σni=1 τ 2Y+σ 2 i 1 Σni=1 τ 2 +σ 2 (2) , i and δ̂i = τ2 (Yi − µ̂), τ 2 + σi2 where the δ̂i satisfy the constraint Σni=1 δ̂i = 0. Thus we have the following estimator of θs , obtained as Σni=1 wi (µ̂ + δ̂i ): E1 = Σni=1 ηi Yi , where 8 ηi γi ! Σn γj σ 2 wj = γi τ wi + j=1n j Σj=1 γj 1 = 2 . τ + σi2 2 " and (3) Note that Σni=1 ηi = 1, and that E1 approaches Ea or Eb as τ 2 → ∞ or τ 2 → 0. Model 1 corresponds to the basic one-way mixed effects ANOVA model used by DerSimonian and Laird [1] for meta-analysis, in which the wi are typically equal to 1/n to reflect the equally weighted mean of study-specific effects. With that choice, E1 = µ̂. But in general, the estimator is also a function of the δ̂i . Model 2 Model 2 sets Ti = τ 2 /wi , i = 1, . . . , n. (4) This leads to E2 = µ̂ = Σni=1 ηi Yi , with 1 ηi = τ2 +σi2 wi Σni=1 τ 2 1 , (5) +σi2 wi and δ̂i = ! 1 wi + 2 2 σi τ "−1 ! " Yi − µ̂ . σi2 Note that Σni=1 ηi = 1 and that Σni=1 wi δ̂i = 0. Model 2 was briefly discussed by Brumback and Brumback [17] in the present context but with estimated weights, and by Ghosh and Maiti [18] in the context of multilevel modeling under noninformative sampling of clusters. Like E1 , this estimator approaches Ea or Eb as τ 2 → ∞ or τ 2 → 0. The Advantage of Model 2 versus Model 1 Model 2 parameterizes µ as the weighted average, θs , of the θi , rather than as the unweighted average, θ̄ = (1/n)Σi θi . We saw that E2 equals µ̂ in (5). The distribution specified for the δi implies that the θi corresponding to larger wi will be closer to µ; this information together with the constraint on the δ̂i induces us to interpret µ as θs and δi as θi − θs . However, in Model 1, µ represents θ̄, and δi represents θi − θ̄. Thus, Model 2 seems more 9 appropriate for our goal of estimating θs . However, one might use the data to negotiate between Model 1 versus Model 2 and hence E1 versus E2 ; as we show in Appendix 1, if at least one wi is distinct from the rest, a “supermodel” that encompasses Model 1 and Model 2 is identifiable. This is counterintuitive for the case of n = 2, because there is no information left, after estimating µ and τ 2 , for discerning between var(δi ) = τ 2 or τ 2 /wi. 3. Differences in the Estimators First we show that for n = 2, E1 = E2 when the τ 2 of Model 2 equals the τ 2 of Model 1 multiplied by 2w1 (1 − w1 ); in particular, when the τ 2 for each estimator is selected to minimize MSE, the two estimators will coincide. Second, we show that the estimators are distinct for n > 2. Thus, it can happen that E2 will lead to a smaller MSE than E1 for any selection of τ 2 in E1 , and vice versa. In short, the models lead to the same estimators when n = 2, but to generally different estimators when n > 2. n=2 A natural question is whether we can use our adaptive estimators to approximately minimize mean-squared error (MSE) over all estimators of θs of the form Σni=1 ηi Yi, where the ηi are positive weights that sum to one. This question has a simple answer when n = 2, but not for n > 2. We consider the MSE of ηY1 + (1 − η)Y2 for known η, computed based on the fixed effects model (1): # $ # $ # $ η 2 (θ1 − θ2 )2 + (σ12 + σ22 ) − 2η w1 (θ1 − θ2 )2 + σ22 + w12 (θ1 − θ2 )2 + σ22 . This function of η is minimized when η = η∗ = where s2θ = 1 Σn (θ n−1 i=1 i 2w1 s2θ + σ22 w1 (θ1 − θ2 )2 + σ22 = , (θ1 − θ2 )2 + (σ12 + σ22 ) 2s2θ + (σ12 + σ22 ) − θ̄)2 , which equals (θ1 − θ2 )2 /2 when n = 2. We observe that the weight θ̂1s associates with Y1 is η# = 2w1 τ 2 + σ22 , 2τ 2 + σ12 + σ22 10 (6) which equals η ∗ when τ 2 = s2θ . According to Model 1, conditionally on µ, var(θi ) = τ 2 ; thus, when n = 2, the risk-minimizing choice of τ 2 is a logical choice based on Model 1 and coincides with substituting s2θ for var(θi ). Next we consider Model 2. We observe that the weight θ̂2s associates with Y1 is η ## = w1 τ 2 + w1 (1 − w1 )σ22 , τ 2 + w1 (1 − w1 )(σ12 + σ22 ) which equals η ∗ when τ 2 = 2s2θ w1 (1 − w1 ). According to Model 2, conditionally on µ, 1 n 1 τ2 Σi=1 var(θi ) = Σni=1 . n n wi (7) Approximating n1 Σni=1 var(θi ) by s2θ and noting that 12 Σ2i=1 w1i = 1/(2w1 (1−w1 )) leads us again to τ 2 = 2s2θ w1 (1 − w1 ). Thus, when n = 2, the risk-minimizing choice of τ 2 is also a logical choice based on Model 2. In practice, s2θ must be estimated from the data, and the resulting “risk-minimizing” estimator loses its optimality; in section 5 we show that E1 and E2 are admissible for any choices of τ 2 . The preceding derivations have shown that, for n = 2, E1 = E2 when the choice of τ 2 for Model 2 equals the choice of τ 2 for Model 1 multiplied by 2w1 (1 − w1 ). n>2 When there are more than two strata, E1 and E2 are quite distinct. To demonstrate this, we present a counterexample to the claim that one can always choose τ12 for Model 1 and τ22 for Model 2 such that the resulting estimators are identical. Counterexample Dataset We let (w1 , w2 , w3 ) = (1/2, 1/3, 1/6), all σi2 = 1, and (Y1 , Y2 , Y3 ) = (2.35, 0, 2.35). Figure 1 graphs the resulting (η1 , η2 , η3 ) as a function of τ 2 for each of the two estimators, and shows us the reason we can not find τ12 and τ22 to make the estimators identical: the graph for η2 associated with E2 lies strictly above that for η2 associated with E1 ; this results in E2 < E1 for all choices of τ12 and τ22 – see Figure 2. (Figures 1 and 2 About Here) Although the graphs are convincing, some algebra shows conclusively what the graphs lead us to believe. For the counterexample dataset, we find that for Model 1, µ̂ = 13 Σ3i=1 Yi = 11 Ȳ for all τ 2 ∈ (0, ∞); δ̂i = τ2 (Yi τ 2 +1 − Ȳ ); and δ̂1 = δ̂3 = −δ̂2 /2. Therefore, E1 = Ȳ for all τ 2 . For Model 2, on the other hand, we show in Appendix 2 that the η2 associated with Y2 = 0 is greater than 1/3 for all τ 2 , and thus E2 < Ȳ for all τ 2 . 4. Application to the Medicaid Data Table 2 compares Ea , Eb , E1 , and E2 for the three Medicaid targets, T 1 − T 3. Section 5 will explain for general random effects models our proposal for selecting τ 2 and computing root mean squared errors, but Table 2 focuses only on Models 1 and 2. In Table 2, the column of estimates was generated using our design-based approach for selecting the τ 2 of Model 1 or 2. Specifically we used the optimize function in R to minimize the estimated MSE of the respective estimator as a function of τ 2 . The estimators Ea , Eb , E1 , and E2 are each weighted averages of the Yi , having the form Σni=1 ηi Yi , where ηi can be a function of τ 2 . For each estimator, we approximate the MSE using Σni=1 (ηi − wi )Yi as our estimate of bias and Σni=1 ηi2 s2i as our estimate of variance. These estimators of bias and variance are design- consistent, and hence correspond to a design-based approach to inference. Table 2 presents the square root of these MSEs. Approximate 95% confidence intervals were computed as the estimator plus or minus 1.96 times the square root of the estimated variance. These confidence intervals will typically be biased when based on Eb and slightly biased when based on E1 or E2 ; hence, the coverage properties will not be accurate for the interval based on Eb unless the θi are homogeneous, but the coverage properties for the intervals based on E1 and E2 will be approximately correct when the ni are large. Design-based estimates of variance typically assume that the σi2 are known, but the estimates above also assume that τ 2 is known. Thus we also present simulated MSEs by generating 10,000 datasets each with Yi∗ , i = 1, . . . , n simulated as independent N(Yi , si ) random variables. For each dataset, we estimate Ea∗ , Eb∗ , E1∗ , and E2∗ . The simulated MSE is then reported as the mean of the 10,000 squared errors, where one error is the estimate minus Σni=1 wi Yi . Table 2 reports the square root of these simulated MSEs. (Table 2 About Here) The adaptive nature of E1 and E2 is evident in the comparison of the RMSEs. For targets 12 T1 and T3, the estimators adapt towards homogeneity and Eb , whereas for target T2, they adapt towards heterogeneity and Ea . Of particular interest is that the adaptation in all three cases is in a direction opposite to that suggested by the naive test-then-estimate approach applied in the introduction. Recall that the P −values for testing the null of homogeneity were < 0.0001, 0.078, and 0.046 for T1, T2, and T3. These results show clearly that it can be better to incorporate the weights into the assessment of homogeneity. Standard tests of homogeneity do not do this; moreover, model-based methods of selecting τ 2 do not incorporate the weights into the choice either. We see that despite P −values being below 0.05 for T1 and T3, estimator E1 based on Model 1 has a simulated RMSE that is 90% that of Ea for T1 and 91% that of Ea for T3. However, a P −value higher than 0.05 does not necessarily mean that E1 or E2 will outperform Ea , as we see with T2. It is particularly striking that the P −value for testing homogeneity was < 0.0001 for T1, and yet the RMSE for E1 is 90% of that for Ea . Plan 1 in Duval county is very different from the other four less managed plans regarding item A3, but its weight is only 2.1% of the total. Thus, homogeneity was overturned, when for the purposes of estimating θs , it should not have been. This also highlights the advantage of selecting τ 2 to minimize estimated mean squared error of the estimator of θs , rather than to accurately reflect heterogeneity of the plans. Model-based estimators of θs select τ 2 in the latter way. To compare our method of selecting τ 2 with model-based approaches, we also implemented the iterative method described in Little[12] to estimate target T1. This method effectively uses Model 1 but selects τ 2 as the solution to the fixed point equation τ 2 = 1 Σn ν (Y n−1 i=1 i i − K)2 , where νi = τ 2 /(τ 2 + σi2 ) and K = (Σi νi Yi)/(Σi νi ). When the σi2 are small and all of the θi are homogeneous except for a single outlier associated with a small weight, this method will estimate a larger τ 2 than our method. Thus, we would expect it to have a larger simulated RMSE than that of our E1 for target T1, and indeed it does. Its simulated RMSE is 0.077, close to that of Ea . This demonstrates that because the modelbased approach for selecting τ 2 does not incorporate the wi , it can therefore perform worse than our approach. As expected for E1 and E2 , the simulated RMSEs, which incorporate variability due to 13 selecting τ 2 based on the data, are generally larger than the approximate RMSEs, which do not. It is interesting to see that E1 outperforms E2 for summary T1, but that the two estimators perform similarly for T3. Nevertheless, the novel model could potentially be better for some applications. It is difficult to conceive of subject-matter considerations which might lead one to prefer one model over the other. It seems to us that the best way to choose is to try both models, and to select the one with a smaller simulated RMSE. 5. Theory, Estimation, and Performance Evaluation Theory Models 1 and 2 are special cases of the general random effects model Y |θ, µ, δ ∼ N(θ, Σ) θ|µ, δ = ZW µ + δ δ ∼ N(hW , TW ) µ ∼ U p (−∞, ∞), (8) with Y = (Y1 , . . . , Yn )T ; W = (w1 , . . . , wn )T ; θ = (θ1 , . . . , θn )T ; Σ a symmetric positive definite known matrix; ZW an n × p matrix, possibly dependent upon W ; the vector hW and the symmetric positive definite matrix TW functions of W and of parameters; U p (−∞, ∞) the improper uniform distribution for a vector of p independent parameters; and δ and µ independent of one another given W . For the remainder of the paper we will suppress the subscripts on ZW , hW , and TW . In this framework, we have modeled the fixed effects as random with uninformative priors. Under mild restrictions on Z, T , and Σ (Appendix 3), the conditional distribution of (µT , δ T )T given Y is multivariate normal with mean (µ̂T , δ̂ T )T solving Z T Σ−1 Zµ + Z T Σ−1 δ = Z T Σ−1 Y Σ−1 Zµ + (Σ−1 + T −1 )δ = Σ−1 Y + T −1 h. (9) Multiplying the second of these equations by Z T and comparing to the first, we find that δ̂ satisfies the p-dimensional linear constraint Z T T −1 (δ̂ − h) = 0. 14 (10) This follows naturally from our setup, which uses the n+p parameters in µ and δ to represent the n-dimensional parameter θ. We observed in Section 2 that this constraint influences our interpretation for µ, analogous to how constraints in models with nonrandom µ and δ would do. The equations at (9) can be re-expressed to solve for δ̂ independently of µ̂ as follows: # $ # $ I − (Σ−1 + T −1 )−1 Σ−1 PZΣ δ̂ = (Σ−1 + T −1 )−1 Σ−1 (I − PZΣ )Y − ΣT −1 h , Z T T −1 δ̂ = Z T T −1 h (11) with PZΣ = Z(Z T Σ−1 Z)−1 Z T Σ−1 . In Appendix 4 we show that, again under mild restrictions on Z, T , and Σ, the solution to these n + p equations in n unknowns exists uniquely as δ̂ = (A + T −1 )−1 (AY + T −1 h), with A = (I − PZΣ )T Σ−1 (I − PZΣ ) = Σ−1 (I − PZΣ ). (12) It follows from (9) that µ̂ equals µ̂ = (Z T Σ−1 Z)−1 (Z T Σ−1 )(Y − δ̂). (13) The posterior distribution of θs given Y has mean E = W T (Z µ̂ + δ̂) # $ = W T (I − PZΣ )δ̂ + PZΣ Y , (14) which can also be expressed as # $ E = W T (I − B)Y + BPZΣ+T (Y − h) + Bh , (15) where B = Σ(Σ + T )−1 (Appendix 5). The posterior mean E is an admissible estimator of θs under squared-error loss for any particular specification of the general random effects model (8) ([19] chapter 5). We turn E into an adaptive estimator by letting the data guide our selection of the parameters implicit in h and T . For example, if we specify h = 0 and T = τ 2 T0 with T0 a known matrix 15 (possibly dependent upon W), so that τ 2 tunes the eigenvalues of T towards 0 or ∞. When the eigenvalues of T tend towards ∞, the first equation at (11) tends towards (I − PZΣ )δ̂ = (I − PZΣ )Y, which implies that Z µ̂ + δ̂ → Y (see equation (14)), so that E → W T Y = Ea ; (16) whereas, when the eigenvalues tend towards 0, δ̂ → h (see the first equation at (11), in which (Σ−1 + T −1 ) ≈ T −1 ), so that # $ E → W T PZΣ Y + (I − PZΣ )h , (17) which equals Eb when (a) Z = 1p (the vector of p ones), (b) Σ equals the diagonal matrix with entries σ12 , . . . , σn2 (denoted henceforth as Σd ), and (c) h = 0. If we choose τ 2 to minimize the estimated MSE of E, we will adapt towards (16) or (17) depending on which limit optimizes the bias-variance trade-off relative to the other. We observe that Ea is minimax and that the right-hand side of (17) is admissible under squared-error loss ([19] chapter 5.) We consider one more particular specification of the model. Model 3 sets T0 equal to In , sets h equal to a parametric function of W , e.g. (W − W̄ )β, where W̄ ≡ ((1/n)Σni=1 wi ) 1n and β is a scalar parameter, and keeps Z and Σ the same as in Models 1 and 2. This leads to µ̂ = i Σni=1 σY2i −h +τ 2 i 1 Σni=1 σ2 +τ 2 , i where hi is the ith element of h, and δ̂i = ! 1 1 + 2 2 σi τ "−1 ! " Yi − µ̂ hi + 2 . σi2 τ Thus, E3 = ηi γi Σni=1 ! ! ηi Yi + wi σi2 γi (hi Σnj=1 γj σj2 wj = γ i τ wi + Σnj=1 γj 1 = 2 . τ + σi2 2 16 " Σnj=1 γj hj − n ) , where Σj=1 γj " , and (18) We observe that this estimator is not a simple weighted average of the Yi except when hi ≡ 0. In practice, one would often choose to let Zµ subsume h; however, it is worth recalling that E3 is admissible under squared-error loss no matter what value we choose for h. We note that E3 → Ea as τ 2 → ∞, and that it approaches Eb + Σni=1 wi hi − Σni=1 σh2i i Σni=1 σ12 i , as τ 2 → 0. Due to space constraints, we leave the development of this approach for future work. Estimation and Performance Evaluation We develop and present techniques for estimation and performance evaluation assuming h = 0 and T = τ 2 T0 in the random effects model at (8). We choose τ 2 to minimize estimated MSE of the estimator based on the fixed effects sampling model Y |θ, µ, δ ∼ N(θ, Σ). Because the MSE generally depends upon θ, we will consistently estimate it by substituting Y for θ. Our data-based choice of τ 2 combined with our estimate of θ lead to a design-consistent estimator of θs . The estimators for θ and θs are θ̂ = HY and E = W T HY, where ) H = PZΣ + (I − PZΣ ) Σ−1 (I − PZΣ ) + 1 −1 T τ2 0 *−1 Σ−1 (I − PZΣ ). The actual bias and variance of E are given by biasa = W T (H − I)θ and vara = W T HΣH T W. (19) ˆ = W T (H − I)Y bias (20) We estimate the bias as ˆ 2 + vara . Because mse and the MSE as mse ˆ = bias ˆ depends upon τ 2 in a complicated way, we use the optimize() function within Version 2.3.0 of the statistical programming language R ˆ [20] to minimize mse ˆ as a function of τ 2 ; in turn, we use the minimizing τ̂ 2 to evaluate mse. 17 A difficulty with our estimated MSEs is that they are derived assuming that τ 2 is known, when in fact we are selecting it based on the data. We therefore rely on the bootstrap [21] for more accurate answers. Specifically, we propose to generate 10,000 datasets with Y ∗ simulated according to the fixed effects sampling model Y ∗ ∼ N(Y, Σ). For each dataset, we estimate Ea∗ ,Eb∗ , E1∗ , E2∗ , or another adaptive estimator. The simulated MSE is then reported as the mean of the 10,000 squared errors, where one error is the estimate minus Σni=1 wi Yi . Table 2 compares the square root of the simulated MSEs for the four estimators that are the focus of this paper. This can be viewed as a simulation study of the performance of the four estimators under three different conditions, corresponding to targets T1, T2, and T3; see section 4 for a summary. 6. Discussion We have developed and compared adaptive estimators of a weighted average of stratumspecific parameters based on two random effects models, the DerSimonian and Laird [1] model (Model 1) for the parameters and the novel model briefly touched on by Brumback and Brumback [17] and Ghosh and Maiti [18] (Model 2). Although our proposed estimators are motivated by random effects models, our approach to inference is design-based, as the adaptive estimators are design-consistent and our estimated MSEs are as well. Moreover, our approach to selecting the tuning parameter is also design-based, because, rather than selecting it to accurately represent heterogeneity of the strata, as would a model-based approach, we select it to minimize a design-based estimate of MSE. In our investigation, we have uncovered the importance that the implicit constraints satisfied by BLUPs have for interpreting the parameters µ and δi of these models. We have also discovered that Model 1 and Model 2 lead to distinctly different estimators of θs when the number of strata exceeds two. In our application to the Medicaid data, we found that the estimator based on Model 1 (E1 ) had a slightly smaller RMSE than the one based on Model 2 (E2 ). However, other datasets could lead to opposite results. Both estimators outperformed the weighted average of stratum means (Ea ) in estimating targets T1 and T3, despite that the test of homogeneity rejected the null in both instances. It is particularly striking that 18 the P −value for testing homogeneity was < 0.0001 for T1, and yet the RMSE for E1 was 90% of that for Ea . This clearly demonstrates that it is suboptimal to test homogeneity without considering the relative weights. Plan 1 in Duval county was very different from the other four less managed plans regarding item A3, but its weight was only 2.1% of the total. Thus, homogeneity was overturned, when for the purposes of estimating θs , it should not have been. This also highlights the advantage of selecting τ 2 to minimize estimated mean squared error of the estimator of θs , rather than to accurately reflect heterogeneity of the plans. Model-based estimators of θs select τ 2 in the latter way. When we applied a model-based estimator for T1 which paralleled E1 except for the method of selecting τ 2 , we found that its RMSE was noticeably larger (97% of the RMSE of Ea rather than 90%.) In this paper, we have considered only the case of known wi . In several related applications, the wi are unknown and need to be estimated from the sample. Examples include adjustment for nonresponse bias, missing data, or confounding. Further research could study the effects of applying our estimators with estimated wi . A related topic is that of estimating summary odds ratios under heterogeneity. Greenland [22] discusses relative advantages of the Miettinen [23] and Mantel-Haenszel [24] summary odds ratio estimators when one cannot assume homogeneity; these summary measures can each be expressed as a weighted sum of stratum-specific odds ratios, with data dependent weights. We have also focused mainly on two adaptive estimators, E1 and E2 , both of which set the design matrix ZW of (8) equal to the vector of ones. One could generalize ZW to depend, for example, on the stratum weights, as we did with estimator (18); future work could develop this approach. We have only briefly discussed estimation of θp , but our estimators of θs are also consistent for this quantity, when the (θi , wi ) represent a simple random sample of a larger group of strata. The method of selecting τ 2 would need to be reconsidered, but surely it would still be preferable to incorporate the observed wi into the choice. It would be interesting to compare the resulting procedure to the early one of Pfeffermann and Nathan [3], as well as to the latter work of Pfeffermann et al. [14] and others currently active in multilevel modeling of complex survey data. 19 Acknowledgements George Casella was supported by NSF-DEB-0540745. Malay Ghosh was partially supported by NSF Grant SES-0631426. We thank the Florida Agency for Health Care Administration for providing the Medicaid data, and two anonymous reviewers for helpful comments that led to an improved version of this manuscript. Appendix Appendix 1. Here we show that if at least one wi is distinct from the rest, and if we condition on µ, a “supermodel” that encompasses Model 1 and Model 2 is identifiable. Specifically, under either Model 1 or 2 we consider the density of Y1 , . . . , Yn |µ, τ 2 , and we introduce a new parameter ψ that equals 1 or 2 for Model 1 or 2. This leads to a supermodel, indexed by parameters (ψ, µ, τ 2 ), such that for ψ equal to either 1 or 2, the distribution Pψ,µ,τ 2 of Y1 , . . . , Yn is that of n independent normal random variables with mean µ; for ψ = 1 the variances are σi2 + τ 2 , but for ψ = 2, the variances are σi2 + τ 2 /wi . To show that the supermodel is identifiable, we suppose it is not and derive a contradiction. We suppose that there exists (ψ1 , µ1 , τ12 ) unequal to (ψ2 , µ2 , τ22 ) such that Pψ1 ,µ1 ,τ 2 = Pψ2 ,µ2 ,τ 2 . If ψ1 = ψ2 , then (µ1 , τ12 ) must equal (µ2 , τ22 ) 1 2 because the mean and variance parameters of the corresponding multivariate normal distribution are identifiable. It thus suffices to consider the case of ψ1 = 1 and ψ2 = 2. Nonidentifiability implies that µ1 = µ2 , because the marginal normal mean of Y1 when ψ = 1 must be the same as that when ψ = 2. We furthermore learn that σi2 + τ12 must equal σi2 + τ22 /wi for i = 1, . . . , n, because the marginal normal standard deviations of the Yi must also be the same for ψ = 1 as for ψ = 2. Consequently, τ12 = τ22 /wi for all i. Because T is positive definite, τ12 and τ22 must be nonzero. Thus, unless all wi are equal, we have a contradiction. Appendix 2. For Model 2 and the counterexample dataset, η2 = 1 2τ 2 +1 + 1 3τ 2 +1 1 3τ 2 +1 20 + . 1 6τ 2 +1 We need to show that η2 > 1/3 for all τ 2 ∈ (0, ∞). Multiplying numerator and demoninator by 3τ 2 + 1, it suffices to show that 3τ 2 + 1 3τ 2 + 1 τ2 3τ 2 + = 1 + + 1 − < 2. 2τ 2 + 1 6τ 2 + 1 2τ 2 + 1 6τ 2 + 1 Assuming τ 2 ∈ (0, ∞), it is equivalent to show that 1 3 < 2 , +1 6τ + 1 2τ 2 which is true because 6τ 2 + 1 < 6τ 2 + 3. Appendix 3. Here we verify (9). Let β = (µT , δT )T , X = [Z|In ], In be the n × n identity matrix, C be the block diagonal matrix with first block the p × p matrix of zeroes and second block equal to T −1 , and m be the vector (0Tp , hT )T , with 0p as the vector of p zeroes. Suppose β has prior given by (8), and that Y given β is MVN with mean Xβ and variance Σ. Then minus two times the log of the posterior distribution of β given Y equals (up to a constant) (Y − Xβ)T Σ−1 (Y − Xβ) + (β − m)T C(β − m), (21) which in turn equals (up to a constant) β T (X T Σ−1 X + C)β − 2β T (X T Σ−1 Y + Cm). Assuming that (X T Σ−1 X + C) is invertible (this implies mild restrictions on Z, T , and Σ), we can complete the square with (X T Σ−1 Y + Cm)T (X T Σ−1 X + C)−1 (X T Σ−1 Y + Cm), so that minus two times the log of the posterior equals (up to a constant) (β − b)T V −1 (β − b), with b = (X T Σ−1 X + C)−1 (X T Σ−1 Y + Cm) and V = (X T Σ−1 X + C)−1 . Thus the posterior of β = (µT , δT )T is multivariate normal with mean b solving (9). 21 Appendix 4. Here we show that the posterior mean of δ given Y is the solution to (12). By the properties of the multivariate normal distribution, it must also equal the δ̂ of (9); therefore, we know that it solves (11). Minus two times the log of the joint distribution of Y, δ, µ is (up to a constant) (Y − Zµ − δ)T Σ−1 (Y − Zµ − δ) + (δ − h)T T −1 (δ − h). Recall that PZΣ = Z(Z T Σ−1 Z)−1 Z T Σ−1 , and note that PZΣ Z = Z. Write (Y − Zµ − δ)T Σ−1 (Y − Zµ − δ) = # $T [PZΣ (Y − δ) − PZΣ Zµ] + [(I − PZΣ )(Y − δ)] # $ Σ−1 [PZΣ (Y − δ) − PZΣ Zµ] + [(I − PZΣ )(Y − δ)] . Expand the quadratic form according to the terms in square brackets, and note that the cross term is zero since (I − PZΣ )T Σ−1 PZΣ = 0. So this part of the log is (PZΣ Zµ − PZΣ (Y − δ))T Σ−1 (PZΣ Zµ − PZΣ (Y − δ)) + ((I − PZΣ )(Y − δ))T Σ−1 ((I − PZΣ )(Y − δ)). Only the first term has µ, and if we examine it and use the fact that PZΣ Z = Z again, we have (PZΣ Zµ − PZΣ (Y − δ))T Σ−1 (PZΣ Zµ − PZΣ (Y − δ)) = (µ − µ̂δ )T (Z T Σ−1 Z)(µ − µ̂δ ), where µ̂δ = (Z T Σ−1 Z)−1 Z T Σ−1 (Y − δ). Thus we have that the conditional distribution of µ|Y, δ is µ|Y, δ ∼ N (µ̂δ , (Z T Σ−1 Z)−1 ). In this form, we can now integrate out µ from the joint distribution to get the marginal distribution of Y, δ. This has minus two times the log equal to (up to a constant) ((I − PZΣ )(Y − δ))T Σ−1 ((I − PZΣ )(Y − δ)) + (δ − h)T T −1 (δ − h). Define A = (I − PZΣ )T Σ−1 (I − PZΣ ). Minus two times the log is then (up to a constant) (δ − Y )T A(δ − Y ) + (δ − h)T T −1 (δ − h). Factor this as (δ − δ̂)T (A + T −1 )(δ − δ̂) + Y T AY + hT T −1 h − δ̂T (A + T −1 )δ̂, 22 where δ̂ = (A + T −1 )−1 (AY + T −1 h), as in (12). Thus, as we wished to show, the conditional distribution of δ given Y is δ|Y ∼ N (δ̂, (A + T −1 )−1 ). Appendix 5. From the identity (Y − θ)T Σ−1 (Y − θ) + (θ − Zµ − h)T T −1 (θ − Zµ − h) # $ = θ T (Σ−1 + T −1 )θ − 2θ T Σ−1 Y + T −1 (Zµ + h) + Y T Σ−1 Y + (Zµ + h)T T −1 (Zµ + h), it follows that E(θ|Y, µ) = T (Σ + T )−1 Y + Σ(Σ + T )−1 (Zµ + h). Noting that Y |µ ∼ N (Zµ + h, Σ + T ), from the identity # $ (Y −Zµ−h)T (Σ+T )−1 (Y −Zµ−h) = µT Z T (Σ + T )−1 Z µ−2µT Z T (Σ+T )−1 (Y −h)+(Y −h)T (Σ+T )−1 (Y −h), it follows that # E(µ|Y ) = Z T (Σ + T )−1 Z Hence $−1 Z T (Σ + T )−1 (Y − h). E(θ|Y ) = (I − B)Y + BPZΣ+T (Y − h) + Bh, where B = Σ(Σ + T )−1 . The equation at (15) follows. References 1. DerSimonian R, Laird N. Meta-Analysis in Clinical Trials. Controlled Clinical Trials 1986; 7: 177-188. 2. DuMouchel WH, Duncan GJ. Using Sample Survey Weights in Multiple Regression Analyses of Stratified Samples. Journal of the American Statistical Association 1983; 78: 535-543. 3. Pfeffermann D, Nathan G. Regression Analysis of Data from a Cluster Sample. Journal of the American Statistical Association 1981; 76: 681-689. 4. Robinson GK. That BLUP is a good thing: The estimation of random effects (with discussion.) Statistical Science 1991; 6: 15-32. 23 5. Fay RE, Herriot RA. Estimates of Income for Small Places: An Application of JamesStein Procedures to Census Data. Journal of the American Statistical Association 1979; 74: 269-277. 6. James W, Stein C. Estimation with Quadratic Loss. In Proceedings of the Fourth Berkeley Symposium of Mathematical Statistics and Probability, Vol. 1 University of California Press: Berkeley, 1961; 361-379. 7. Ghosh M, Rao JNK. Small Area Estimation: An Appraisal (with Discussion). Statistical Science 1994; 9: 55-93. 8. Little RJ. To Model or Not To Model? Competing Modes of Inference for Finite Population Sampling. Journal of the American Statistical Association 2004; 99: 546556. 9. Gelman A. Struggles with Survey Weighting and Regression Modeling (with discussion). Statistical Science 2007; in press. 10. Elliott MR, Little RJA. Model-based Alternatives to Trimming Survey Weights. Journal of Official Statistics 2000; 16: 191-209. 11. Lazzeroni LC, Little RJA. Random-effects Models for Smoothing Poststratification Weights. Journal of Official Statistics 1998; 14: 61-78. 12. Little RJA. Inference with Survey Weights. Journal of Official Statistics 1991; 7: 405-424. 13. Rabe-Hesketh S, Skrondal A. Multilevel Modelling of Complex Survey Data. Journal of the Royal Statistical Society, Series A 2006; 169: 805-827. 14. Pfeffermann D, Skinner CJ, Holmes DJ, Goldstein H, Rasbash J. Weighting for Unequal Selection Probabilities in Multilevel Models. Journal of the Royal Statistical Society, Series B 1998; 60: 23-40. 15. Liang KY, Zeger SL. Longitudinal Data Analysis Using Generalized Linear Models. Biometrika, 1986; 73: 13-22. 16. Williamson JM, Datta S, Satten GA. Marginal Analyses of Clustered Data When Cluster Size is Informative. Biometrics 2003; 59: 36-42. 17. Brumback BA, Brumback LC. Comment on ”Semiparametric Estimation of a Treatment Effect in a Pretest-Posttest Study with Missing Data,” by Davidian M, Tsiatis A, and Leon S, Statistical Science 2005; 20: 284-289. 18. Ghosh M, Maiti T. Discussion on the Papers by Firth and Bennett and Pfeffermann et al. Journal of the Royal Statistical Society, Series B 1998; 60: 41-56. 19. Lehmann EL and Casella G. Theory of Point Estimation. Springer-Verlag: New York, 1998. 24 20. The R Development Core Team. R 2.3.0 - A Language and Environment 2006. http://www.r-project.org/. Accessed on August 22, 2007. 21. Efron B, Tibshirani RJ. An Introduction to the Bootstrap. Chapman and Hall: London, 1993. 22. Greenland S. Interpretation and Estimation of Summary Ratios Under Heterogeneity. Statistics in Medicine 1982; 1: 217-227. DOI: 10.1002/sim.2547 23. Miettinen OS. Standardization of Risk Ratios. American Journal of Epidemiology 1972; 96: 383-388. 24. Mantel N, Haenszel WH. Statistical Aspects of the Analysis of Data from Retrospective Studies of Disease. Journal of the National Cancer Institute 1959; 22: 719-748. 25