Appendix 1. EM Algorithm for Estimating Stratum

Appendix 1. EM Algorithm for Estimating Stratum-Level Bayesian Hyperparameters The EM algorithm is used to estimate the unknown population parameters 𝛽𝑠,𝑝 and Σ𝑠,𝑝 from the following setup, 𝛽̂𝑐𝑠,𝑝 ~ 𝑀𝑉𝑁(𝛽𝑐𝑠,𝑝 , 𝑉̂𝑐𝑠,𝑝 ) 𝛽𝑐𝑠,𝑝 ~ 𝑀𝑉𝑁(𝛽𝑠,𝑝 𝑍𝑐𝑠 , Σ𝑠,𝑝 ) where 𝑝 = (1,2, … , 𝑃) is used to index the set of parameters associated with the 𝑝𝑡ℎ synthetic variable of interest and the 𝑝𝑡ℎ regression model from which the direct estimates 𝛽̂𝑐𝑠 and 𝑉̂𝑐𝑠 were obtained in Step 1. The E step consists of solving the following expectations, ∗ −1 −1 −1 ̂ −1 ̂ −1 ̂𝑐𝑠,𝑝 𝛽𝑐𝑠,𝑝 = 𝐸(𝛽𝑐𝑠,𝑝 ) = [(V + Σ𝑠,𝑝 𝛽𝑠,𝑝 𝑍𝑐𝑠 )] ) (V𝑐𝑠,𝑝 𝛽𝑐𝑠,𝑝 + Σ𝑠,𝑝 𝑇 ∗ 𝑇 𝑇 −1 −1 −1 ∗ ∗ ̂𝑐𝑠,𝑝 + Σ𝑠,𝑝 [𝛽𝑐𝑠,𝑝 (𝛽𝑐𝑠,𝑝 ) ] = 𝐸[𝛽𝑐𝑠,𝑝 𝛽𝑐𝑠,𝑝 ] = (V ) + 𝛽𝑐𝑠,𝑝 (𝛽𝑐𝑠,𝑝 ) Once these expectations are computed they are then incorporated into the maximization (M-step) of the unknown hyperparameters 𝛽𝑠,𝑝 and Σ̂𝑠,𝑝 using the following equations, 𝐶𝑠 𝐶𝑠 ∗ 𝑇) (𝑍𝑐𝑠 𝑍𝑐𝑠 𝛽̂𝑠,𝑝 = [∑𝑐=1 𝑍𝑐𝑠 )][∑𝑐=1 (𝛽𝑐𝑠,𝑝 ] 𝑆 −1 , and 𝐶𝑠 𝑇 ∗ ∗ Σ̂𝑠,𝑝 = [∑ [∑(𝛽𝑐𝑠,𝑝 − 𝛽̂𝑠,𝑝 𝑍𝑐𝑠 )(𝛽𝑐𝑠,𝑝 − 𝛽̂𝑠,𝑝 𝑍𝑐𝑠 ) ]⁄𝐶𝑠 ] 𝑠=1 𝑐=1 After convergence the maximum likelihood estimates are incorporated into the posterior distribution of 𝛽𝑐𝑠,𝑝 shown in equation [7]. Appendix 2. EM Algorithm for Estimating Overall Bayesian Hyperparameters The EM algorithm is used to estimate the unknown population parameters 𝛽𝑝 and Ω𝑝 from the following setup, 𝛽̂𝑠,𝑝 ~ 𝑀𝑉𝑁(𝛽𝑠,𝑝 , 𝑉̂𝑠,𝑝 ) 𝛽𝑠,𝑝 ~ 𝑀𝑉𝑁(𝛽𝑝 𝑍𝑠 , Ω𝑝 ) where 𝑝 = (1,2, … , 𝑃) is used to index the set of parameters associated with the 𝑝𝑡ℎ synthetic variable of interest and the 𝑝𝑡ℎ regression model from which the hyperparameter estimates 𝛽̂𝑠 and 𝑉̂𝑠 were obtained via the EM algorithm. The E step consists of solving the following expectations, −1 ∗ −1 −1 ̂ ̂𝑠,𝑝 ̂𝑠,𝑝 𝛽𝑠,𝑝 = 𝐸(𝛽𝑠,𝑝 ) = [(V + Ω𝑝−1 ) (V 𝛽𝑠,𝑝 + Ω𝑝−1 𝛽𝑝 𝑍𝑠 )] 𝑇 ∗ −1 𝑇 −1 ̂𝑠,𝑝 + Ω𝑝−1 ) [𝛽𝑠,𝑝 (𝛽𝑠,𝑝 ) ] = 𝐸[𝛽𝑠,𝑝 𝛽𝑠,𝑝 ] = (V 𝑇 ∗ ∗ + 𝛽𝑠,𝑝 (𝛽𝑠,𝑝 ) Once these expectations are computed they are then incorporated into the maximization (M-step) of ̂ 𝑝 using the following equations, the unknown hyperparameters 𝛽𝑝 and Ω ∗ 𝛽̂𝑝 = 𝛽𝑠,𝑝 𝑍𝑠 (𝑍𝑠 𝑍𝑠𝑇 )−1 , and 𝑆 𝐶𝑠 𝑇 ∗ ∗ ̂ 𝑝 = [∑ [∑(𝛽𝑠,𝑝 Ω − 𝛽̂𝑝 𝑍𝑠 )(𝛽𝑠,𝑝 − 𝛽̂𝑝 𝑍𝑠 ) ]⁄𝑆] 𝑠=1 𝑐=1 After convergence the maximum likelihood estimates are incorporated into the posterior distribution of 𝛽𝑠,𝑝 shown in equation [8]. Appendix 3. Simulation Study This section evaluates the repeated sampling properties of the small area inferences based on a simulation application. In this simulation, the 2003-2005 NHIS data is treated as a population from which subsamples are drawn. 500 random subsamples are drawn from each PSU with replacement. Each subsample accounts for approximately 30% of the total sample in each PSU. Each NHIS subsample is used as the basis for constructing a synthetic population from which 100 synthetic samples are drawn. A total of 50,000 synthetic data sets are generated. Two types of inferences can be obtained from the synthetic data: conditional and unconditional. Conditional synthetic inferences are obtained from synthetic samples that are based on a single observed sample drawn from the observed population. This is the situation most commonly encountered in practice where a survey is carried out on a single population-based sample and the synthetic data is generated conditional on that sample. Unconditional inferences are obtained from synthetic samples that are based on multiple, or repeated, population-based samples. Unconditional inferences are not feasible in practice but can be achieved through simulation. To obtain conditional inferences, 500 sets of 10 synthetic samples are randomly selected (with replacement) from each of the 100 synthetic samples generated conditional on each of the 500 NHIS subsamples. For each set of 10 synthetic samples, a synthetic estimate and associated confidence interval is obtained for each variable in each PSU using the combining rule equations [1] and [2] in Section 2.2. To obtain unconditional inferences, 100 sets of 10 synthetic samples are randomly selected with replacement across each of the 100 NHIS subsamples and estimates are obtained again using the relevant combining rules. We use two measures to evaluate the validity of the synthetic data estimates. The first one is confidence interval coverage (CIC). For conditional inferences, CIC is defined as the proportion of times that the synthetic data confidence interval [𝐿𝑞̂𝑀 ,𝑠𝑦𝑛 , 𝑈 𝑞̂𝑀 ,𝑠𝑦𝑛 ] contains the actual estimate 𝑦̂𝑎𝑐𝑡 : 𝑄𝐶𝐼𝐶 = 𝐼(𝑦̂𝑎𝑐𝑡 ∈ [𝐿𝑞̂𝑀 ,𝑠𝑦𝑛 , 𝑈 𝑞̂𝑀 ,𝑠𝑦𝑛 ]) where 𝐼(∙) is an indicator function. 𝑄𝐶𝐼𝐶 = 1 if 𝐿𝑞̂𝑀 ,𝑠𝑦𝑛 ≤ 𝑦̂𝑎𝑐𝑡 ≤ 𝑈 𝑞̂𝑀 ,𝑠𝑦𝑛 and 𝑄𝐴 = 0 otherwise. For unconditional inferences, the only difference is that CIC is calculated as the proportion of times that the synthetic data confidence interval contains the “true” population value 𝑌𝑝𝑜𝑝 , i.e., 𝐿𝑞̂𝑀 ,𝑠𝑦𝑛 ≤ 𝑌𝑝𝑜𝑝 ≤ 𝑈 𝑞̂𝑀 ,𝑠𝑦𝑛 . The second evaluative measure is referred to as the confidence interval overlap (CIO; Karr et al., 2006). CIO is defined as the average relative overlap between the synthetic and actual data confidence intervals. For every estimate the average overlap is calculated by, 1 𝑈𝑜𝑣𝑒𝑟 −𝐿𝑜𝑣𝑒𝑟 2 𝑈𝑎𝑐𝑡 −𝐿𝑎𝑐𝑡 𝑄𝐶𝐼𝑂 = ( + 𝑈𝑜𝑣𝑒𝑟 −𝐿𝑜𝑣𝑒𝑟 ) 𝑈𝑠𝑦𝑛 −𝐿𝑠𝑦𝑛 , where 𝑈𝑎𝑐𝑡 and 𝐿𝑎𝑐𝑡 denote the upper and the lower bound of the confidence interval for the actual estimate 𝑦̂𝑎𝑐𝑡 , 𝑈𝑠𝑦𝑛 and 𝐿𝑠𝑦𝑛 denote the upper and the lower bound of the confidence interval for the synthetic data estimate 𝑞̂𝑀 , and 𝑈𝑜𝑣𝑒𝑟 and 𝐿𝑜𝑣𝑒𝑟 denote the upper and lower bound of the confidence interval overlap between the actual and synthetic data estimates. 𝑄𝐶𝐼𝑂 can take on any value between 0 and 1. A value of 0 means that there is no overlap between the two intervals and a value of 1 means the synthetic data interval completely covers the actual data interval. Calculating the confidence interval overlap is only possible for conditional, not unconditional, inferences. This measure yields a more accurate assessment of data utility in the sense that it accounts for the significance level of the estimate. That is, estimates with low significance might still have a high confidence interval overlap and therefore a high data utility even if their point estimates differ considerably from each other. A3.1 Validity of Univariate Estimates Table 5 shows CIC and CIO values for estimated means obtained from sampled PSUs. The conditional CIC and CIO values are high, ranging from 0.91-0.99 and 0.92-0.99, respectively. Furthermore, all of the unconditional CIC values correspond closely to the true CIC values. The same pattern holds true for estimates obtained from nonsampled counties. For conditional and unconditional inferences (Table 6), all CIC and CIO values equal 0.99 reflecting the large amount of variation in the synthetic data estimates resulting in wide confidence intervals. Overall, the simulation results suggest that the synthetic data method yields reasonably valid univariate inferences for sampled and nonsampled small areas. Table A1. Simulation-Based Confidence Interval Results for Estimated Means Obtained from Synthetic Data Sets Across all Sampled PSUs Conditional Unconditional Inference Inference CIC CIO CIC CIC (Actual) BMI 0.99 0.99 0.99 0.97 Age 0.91 0.92 0.99 0.98 Smoker 0.99 0.98 0.99 0.98 Moderate activity 0.99 0.99 0.99 0.98 Male 0.99 0.98 0.99 0.98 Hypertension 0.99 0.97 0.99 0.97 Fair/poor health status 0.99 0.92 0.99 0.97 Abbrevations: CIC – Confidence Interval Coverage; CIO – Confidence Interval Overlap Table A2. Simulation-Based Confidence Interval Results for Estimated Means Obtained from Synthetic Data Sets Across all Nonsampled PSUs Conditional Unconditional Inference Inference CIC CIO CIC CIC (Actual) BMI 0.99 0.99 0.99 0.99 Age 0.99 0.99 0.99 0.99 Smoker 0.99 0.99 0.99 0.99 Moderate activity 0.99 0.99 0.99 0.99 Male 0.99 0.99 0.99 0.99 Hypertension 0.99 0.99 0.99 0.99 Fair/poor health status 0.99 0.99 0.99 0.99 Abbrevations: CIC – Confidence Interval Coverage; CIO – Confidence Interval Overlap A3.2 Validity of Multivariate Estimates Simulation results for multivariate estimands are shown in Tables 7 and 8 for sampled and nonsampled areas, respectively. These tables show average CIC and CIO values for regression coefficients for the dependent variable log(BMI) estimated within each PSU (or county). For the sampled PSUs, the conditional CIC and CIO values are high and range from 0.98-0.99 and 0.94-0.99, respectively, indicating good confidence interval coverage and overlap for these multivariate estimands. The unconditional CIC values equal 0.99, which either meets or exceeds the true CIC values obtained from the actual data. For the nonsampled counties, the confidence interval coverage and overlap is similarly high for all coefficient estimates, ranging from 0.98-0.99. As was the case for the univariate estimands, the analytic validity of the multivariate synthetic data estimands seems to be generally high from a repeated sampling perspective. Table A3. Simulation-Based Confidence Interval Results for Linear Regression Coefficients Obtained from Synthetic Data Sets Across all Sampled PSUs Conditional Unconditional Inference Inference CIC CIO CIC CIC (Actual) Regression of BMI(log) on Intercept 0.99 0.98 0.99 0.97 Age 0.99 0.98 0.99 0.97 Smoker 0.99 0.98 0.99 0.98 Moderate activity 0.99 0.98 0.99 0.97 Male 0.99 0.98 0.99 0.98 Hypertension 0.99 0.99 0.99 0.98 Fair/poor health 0.99 0.94 0.99 0.96 Abbrevations: CIC – Confidence Interval Coverage; CIO – Confidence Interval Overlap Table A4. Simulation-Based Confidence Interval Results for Linear Regression Coefficients Obtained from Synthetic Data Sets Across all Nonsampled PSUs Conditional Unconditional Inference Inference CIC CIO CIC CIC (Actual) Regression of BMI(log) on 0.99 0.99 0.99 0.99 Intercept 0.99 0.99 0.99 0.99 Age 0.99 0.99 0.99 0.99 Smoker 0.99 0.99 0.99 0.99 Moderate activity 0.99 0.99 0.99 0.99 Male 0.99 0.99 0.99 0.99 Hypertension 0.99 0.98 0.99 0.99 Fair/poor health Abbrevations: CIC – Confidence Interval Coverage; CIO – Confidence Interval Overlap

Appendix 1. EM Algorithm for Estimating Stratum

Related documents

Products

Support

Appendix 1. EM Algorithm for Estimating Stratum

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib