Appendix 1. EM Algorithm for Estimating Stratum

advertisement
Appendix 1. EM Algorithm for Estimating Stratum-Level Bayesian
Hyperparameters
The EM algorithm is used to estimate the unknown population parameters 𝛽𝑠,𝑝 and Σ𝑠,𝑝 from the
following setup,
𝛽̂𝑐𝑠,𝑝 ~ 𝑀𝑉𝑁(𝛽𝑐𝑠,𝑝 , 𝑉̂𝑐𝑠,𝑝 )
𝛽𝑐𝑠,𝑝 ~ 𝑀𝑉𝑁(𝛽𝑠,𝑝 𝑍𝑐𝑠 , Σ𝑠,𝑝 )
where 𝑝 = (1,2, … , 𝑃) is used to index the set of parameters associated with the π‘π‘‘β„Ž synthetic variable of
interest and the π‘π‘‘β„Ž regression model from which the direct estimates 𝛽̂𝑐𝑠 and 𝑉̂𝑐𝑠 were obtained in Step 1.
The E step consists of solving the following expectations,
∗
−1
−1 −1 Μ‚ −1 Μ‚
−1
̂𝑐𝑠,𝑝
𝛽𝑐𝑠,𝑝
= 𝐸(𝛽𝑐𝑠,𝑝 ) = [(V
+ Σ𝑠,𝑝
𝛽𝑠,𝑝 𝑍𝑐𝑠 )]
) (V𝑐𝑠,𝑝 𝛽𝑐𝑠,𝑝 + Σ𝑠,𝑝
𝑇 ∗
𝑇
𝑇
−1
−1 −1
∗
∗
̂𝑐𝑠,𝑝
+ Σ𝑠,𝑝
[𝛽𝑐𝑠,𝑝 (𝛽𝑐𝑠,𝑝 ) ] = 𝐸[𝛽𝑐𝑠,𝑝 𝛽𝑐𝑠,𝑝
] = (V
) + 𝛽𝑐𝑠,𝑝
(𝛽𝑐𝑠,𝑝
)
Once these expectations are computed they are then incorporated into the maximization (M-step) of
the unknown hyperparameters 𝛽𝑠,𝑝 and Σ̂𝑠,𝑝 using the following equations,
𝐢𝑠
𝐢𝑠
∗
𝑇)
(𝑍𝑐𝑠 𝑍𝑐𝑠
𝛽̂𝑠,𝑝 = [∑𝑐=1
𝑍𝑐𝑠 )][∑𝑐=1
(𝛽𝑐𝑠,𝑝
]
𝑆
−1
, and
𝐢𝑠
𝑇
∗
∗
Σ̂𝑠,𝑝 = [∑ [∑(𝛽𝑐𝑠,𝑝
− 𝛽̂𝑠,𝑝 𝑍𝑐𝑠 )(𝛽𝑐𝑠,𝑝
− 𝛽̂𝑠,𝑝 𝑍𝑐𝑠 ) ]⁄𝐢𝑠 ]
𝑠=1 𝑐=1
After convergence the maximum likelihood estimates are incorporated into the posterior distribution of
𝛽𝑐𝑠,𝑝 shown in equation [7].
Appendix 2. EM Algorithm for Estimating Overall Bayesian Hyperparameters
The EM algorithm is used to estimate the unknown population parameters 𝛽𝑝 and Ω𝑝 from the
following setup,
𝛽̂𝑠,𝑝 ~ 𝑀𝑉𝑁(𝛽𝑠,𝑝 , 𝑉̂𝑠,𝑝 )
𝛽𝑠,𝑝 ~ 𝑀𝑉𝑁(𝛽𝑝 𝑍𝑠 , Ω𝑝 )
where 𝑝 = (1,2, … , 𝑃) is used to index the set of parameters associated with the π‘π‘‘β„Ž synthetic variable of
interest and the π‘π‘‘β„Ž regression model from which the hyperparameter estimates 𝛽̂𝑠 and 𝑉̂𝑠 were obtained
via the EM algorithm.
The E step consists of solving the following expectations,
−1
∗
−1
−1 Μ‚
̂𝑠,𝑝
̂𝑠,𝑝
𝛽𝑠,𝑝
= 𝐸(𝛽𝑠,𝑝 ) = [(V
+ Ω𝑝−1 ) (V
𝛽𝑠,𝑝 + Ω𝑝−1 𝛽𝑝 𝑍𝑠 )]
𝑇 ∗
−1
𝑇
−1
̂𝑠,𝑝
+ Ω𝑝−1 )
[𝛽𝑠,𝑝 (𝛽𝑠,𝑝 ) ] = 𝐸[𝛽𝑠,𝑝 𝛽𝑠,𝑝
] = (V
𝑇
∗
∗
+ 𝛽𝑠,𝑝
(𝛽𝑠,𝑝
)
Once these expectations are computed they are then incorporated into the maximization (M-step) of
Μ‚ 𝑝 using the following equations,
the unknown hyperparameters 𝛽𝑝 and Ω
∗
𝛽̂𝑝 = 𝛽𝑠,𝑝
𝑍𝑠 (𝑍𝑠 𝑍𝑠𝑇 )−1 , and
𝑆
𝐢𝑠
𝑇
∗
∗
Μ‚ 𝑝 = [∑ [∑(𝛽𝑠,𝑝
Ω
− 𝛽̂𝑝 𝑍𝑠 )(𝛽𝑠,𝑝
− 𝛽̂𝑝 𝑍𝑠 ) ]⁄𝑆]
𝑠=1 𝑐=1
After convergence the maximum likelihood estimates are incorporated into the posterior distribution of
𝛽𝑠,𝑝 shown in equation [8].
Appendix 3. Simulation Study
This section evaluates the repeated sampling properties of the small area inferences based on a simulation
application. In this simulation, the 2003-2005 NHIS data is treated as a population from which
subsamples are drawn. 500 random subsamples are drawn from each PSU with replacement. Each
subsample accounts for approximately 30% of the total sample in each PSU. Each NHIS subsample is
used as the basis for constructing a synthetic population from which 100 synthetic samples are drawn. A
total of 50,000 synthetic data sets are generated.
Two types of inferences can be obtained from the synthetic data: conditional and unconditional.
Conditional synthetic inferences are obtained from synthetic samples that are based on a single observed
sample drawn from the observed population. This is the situation most commonly encountered in practice
where a survey is carried out on a single population-based sample and the synthetic data is generated
conditional on that sample. Unconditional inferences are obtained from synthetic samples that are based
on multiple, or repeated, population-based samples. Unconditional inferences are not feasible in practice
but can be achieved through simulation.
To obtain conditional inferences, 500 sets of 10 synthetic samples are randomly selected (with
replacement) from each of the 100 synthetic samples generated conditional on each of the 500 NHIS
subsamples. For each set of 10 synthetic samples, a synthetic estimate and associated confidence interval
is obtained for each variable in each PSU using the combining rule equations [1] and [2] in Section 2.2.
To obtain unconditional inferences, 100 sets of 10 synthetic samples are randomly selected with
replacement across each of the 100 NHIS subsamples and estimates are obtained again using the relevant
combining rules.
We use two measures to evaluate the validity of the synthetic data estimates. The first one is
confidence interval coverage (CIC). For conditional inferences, CIC is defined as the proportion of times
that the synthetic data confidence interval [πΏπ‘žΜ‚π‘€ ,𝑠𝑦𝑛 , π‘ˆ π‘žΜ‚π‘€ ,𝑠𝑦𝑛 ] contains the actual estimate π‘¦Μ‚π‘Žπ‘π‘‘ :
𝑄𝐢𝐼𝐢 = 𝐼(π‘¦Μ‚π‘Žπ‘π‘‘ ∈ [πΏπ‘žΜ‚π‘€ ,𝑠𝑦𝑛 , π‘ˆ π‘žΜ‚π‘€ ,𝑠𝑦𝑛 ])
where 𝐼(βˆ™) is an indicator function. 𝑄𝐢𝐼𝐢 = 1 if πΏπ‘žΜ‚π‘€ ,𝑠𝑦𝑛 ≤ π‘¦Μ‚π‘Žπ‘π‘‘ ≤ π‘ˆ π‘žΜ‚π‘€ ,𝑠𝑦𝑛 and 𝑄𝐴 = 0 otherwise.
For unconditional inferences, the only difference is that CIC is calculated as the proportion of times
that the synthetic data confidence interval contains the “true” population value π‘Œπ‘π‘œπ‘ , i.e., πΏπ‘žΜ‚π‘€ ,𝑠𝑦𝑛 ≤
π‘Œπ‘π‘œπ‘ ≤ π‘ˆ π‘žΜ‚π‘€ ,𝑠𝑦𝑛 .
The second evaluative measure is referred to as the confidence interval overlap (CIO; Karr et al.,
2006). CIO is defined as the average relative overlap between the synthetic and actual data confidence
intervals. For every estimate the average overlap is calculated by,
1 π‘ˆπ‘œπ‘£π‘’π‘Ÿ −πΏπ‘œπ‘£π‘’π‘Ÿ
2 π‘ˆπ‘Žπ‘π‘‘ −πΏπ‘Žπ‘π‘‘
𝑄𝐢𝐼𝑂 = (
+
π‘ˆπ‘œπ‘£π‘’π‘Ÿ −πΏπ‘œπ‘£π‘’π‘Ÿ
)
π‘ˆπ‘ π‘¦π‘› −𝐿𝑠𝑦𝑛
,
where π‘ˆπ‘Žπ‘π‘‘ and πΏπ‘Žπ‘π‘‘ denote the upper and the lower bound of the confidence interval for the actual
estimate π‘¦Μ‚π‘Žπ‘π‘‘ , π‘ˆπ‘ π‘¦π‘› and 𝐿𝑠𝑦𝑛 denote the upper and the lower bound of the confidence interval for the
synthetic data estimate π‘žΜ‚π‘€ , and π‘ˆπ‘œπ‘£π‘’π‘Ÿ and πΏπ‘œπ‘£π‘’π‘Ÿ denote the upper and lower bound of the confidence
interval overlap between the actual and synthetic data estimates. 𝑄𝐢𝐼𝑂 can take on any value between 0
and 1. A value of 0 means that there is no overlap between the two intervals and a value of 1 means the
synthetic data interval completely covers the actual data interval. Calculating the confidence interval
overlap is only possible for conditional, not unconditional, inferences. This measure yields a more
accurate assessment of data utility in the sense that it accounts for the significance level of the estimate.
That is, estimates with low significance might still have a high confidence interval overlap and therefore a
high data utility even if their point estimates differ considerably from each other.
A3.1 Validity of Univariate Estimates
Table 5 shows CIC and CIO values for estimated means obtained from sampled PSUs. The conditional
CIC and CIO values are high, ranging from 0.91-0.99 and 0.92-0.99, respectively. Furthermore, all of the
unconditional CIC values correspond closely to the true CIC values. The same pattern holds true for
estimates obtained from nonsampled counties. For conditional and unconditional inferences (Table 6), all
CIC and CIO values equal 0.99 reflecting the large amount of variation in the synthetic data estimates
resulting in wide confidence intervals. Overall, the simulation results suggest that the synthetic data
method yields reasonably valid univariate inferences for sampled and nonsampled small areas.
Table A1. Simulation-Based Confidence Interval Results for Estimated Means Obtained from
Synthetic Data Sets Across all Sampled PSUs
Conditional
Unconditional
Inference
Inference
CIC
CIO
CIC
CIC (Actual)
BMI
0.99
0.99
0.99
0.97
Age
0.91
0.92
0.99
0.98
Smoker
0.99
0.98
0.99
0.98
Moderate activity
0.99
0.99
0.99
0.98
Male
0.99
0.98
0.99
0.98
Hypertension
0.99
0.97
0.99
0.97
Fair/poor health status
0.99
0.92
0.99
0.97
Abbrevations: CIC – Confidence Interval Coverage; CIO – Confidence Interval Overlap
Table A2. Simulation-Based Confidence Interval Results for Estimated Means Obtained from
Synthetic Data Sets Across all Nonsampled PSUs
Conditional
Unconditional
Inference
Inference
CIC
CIO
CIC
CIC (Actual)
BMI
0.99
0.99
0.99
0.99
Age
0.99
0.99
0.99
0.99
Smoker
0.99
0.99
0.99
0.99
Moderate activity
0.99
0.99
0.99
0.99
Male
0.99
0.99
0.99
0.99
Hypertension
0.99
0.99
0.99
0.99
Fair/poor health status
0.99
0.99
0.99
0.99
Abbrevations: CIC – Confidence Interval Coverage; CIO – Confidence Interval Overlap
A3.2 Validity of Multivariate Estimates
Simulation results for multivariate estimands are shown in Tables 7 and 8 for sampled and nonsampled
areas, respectively. These tables show average CIC and CIO values for regression coefficients for the
dependent variable log(BMI) estimated within each PSU (or county). For the sampled PSUs, the
conditional CIC and CIO values are high and range from 0.98-0.99 and 0.94-0.99, respectively, indicating
good confidence interval coverage and overlap for these multivariate estimands. The unconditional CIC
values equal 0.99, which either meets or exceeds the true CIC values obtained from the actual data. For
the nonsampled counties, the confidence interval coverage and overlap is similarly high for all coefficient
estimates, ranging from 0.98-0.99. As was the case for the univariate estimands, the analytic validity of
the multivariate synthetic data estimands seems to be generally high from a repeated sampling
perspective.
Table A3. Simulation-Based Confidence Interval Results for Linear Regression Coefficients
Obtained from Synthetic Data Sets Across all Sampled PSUs
Conditional
Unconditional
Inference
Inference
CIC
CIO
CIC
CIC (Actual)
Regression of BMI(log) on
Intercept
0.99
0.98
0.99
0.97
Age
0.99
0.98
0.99
0.97
Smoker
0.99
0.98
0.99
0.98
Moderate activity
0.99
0.98
0.99
0.97
Male
0.99
0.98
0.99
0.98
Hypertension
0.99
0.99
0.99
0.98
Fair/poor health
0.99
0.94
0.99
0.96
Abbrevations: CIC – Confidence Interval Coverage; CIO – Confidence Interval Overlap
Table A4. Simulation-Based Confidence Interval Results for Linear Regression Coefficients
Obtained from Synthetic Data Sets Across all Nonsampled PSUs
Conditional
Unconditional
Inference
Inference
CIC
CIO
CIC
CIC (Actual)
Regression of
BMI(log) on
0.99
0.99
0.99
0.99
Intercept
0.99
0.99
0.99
0.99
Age
0.99
0.99
0.99
0.99
Smoker
0.99
0.99
0.99
0.99
Moderate activity
0.99
0.99
0.99
0.99
Male
0.99
0.99
0.99
0.99
Hypertension
0.99
0.98
0.99
0.99
Fair/poor health
Abbrevations: CIC – Confidence Interval Coverage; CIO – Confidence Interval Overlap
Download