Variance Estimation in Complex Surveys Drew Hardin Kinfemichael Gedif So far.. Variance for estimated mean and total under SRS, Stratified, Cluster (single, multi-stage), etc. Variance for estimating a ratio of two means under SRS (we used linearization method) What about other cases? Variance for estimators that are not linear combinations of means and totals – Ratios Variance for estimating other statistic from complex surveys – Median, quantiles, functions of EMF, etc. Other approaches are necessary Outline Variance Estimation Methods – Linearization – Random Group Methods – Balanced Repeated Replication (BRR) – Resampling techniques Jackknife, Bootstrap Adapting to complex surveys ‘Hot’ research areas Reference Linearization (Taylor Series Methods) We have seen this before (ratio estimator and other courses). Suppose our statistic is non-linear. It can often be approximated using Taylor’s Theorem. We know how to calculate variances of linear functions of means and totals. Linearization (Taylor Series Methods) Linearize h(c1, c 2, c3,...., ck ) ˆ ˆ ˆ ˆ h(t 1, t 2, t 3,..., t k ) h(t1, t 2,..., tk ) t 1, t 2 ,.. tk (tˆj tj ) cj j 1 Calculate Variance k h ˆ ˆ V h(t1 ,..., t k ) tˆ1 2 h ˆ ( t1 ,... t k ) V (t1 ) tˆk h h Cov(tˆi , tˆ j ) ˆ tˆ j i j t i 2 ˆ ( t1 ,... t k ) V (t k ) Linearization (Taylor Series) Methods – Pro: Can be applied in general sampling designs Theory is well developed Software is available – Con: Finding partial derivatives may be difficult Different method is needed for each statistic The function of interest may not be expressed a smooth function of population totals or means Accuracy of the linearization approximation Random Group Methods Based on the concept of replicating the survey design Not usually possible to merely go and replicate the survey However, often the survey can be divided into R groups so that each group forms a miniature versions of the survey Random Group Methods Stratum 1 Stratum 2 Stratum 3 Stratum 4 Stratum 5 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8 Treat as miniature sample Unbiased Estimator (Average of Samples) R ~ 2 ˆ ( r ) 1 ~ Vˆ1 ( ) r 1 R R 1 Slightly Biased Estimator (All Data) R 2 ˆ ˆ ( ) r 1 r 1 ˆ V2 R R 1 Random Group Methods Pro: – Easy to calculate – General method (can also be used for non smooth functions) Con: – Assumption of independent groups (problem when N is small) – Small number of groups (particularly if one strata is sampled only a few times) – Survey design must be replicated in each random group (presence of strata and clusters remain the same) Resampling and Replication Methods Balanced Repeated Replication (BRR) – Special case when nh=2 Jackknife (Quenouille (1949) Tukey (1958)) Bootstrap (Efron (1979) Shao and Tu (1995)) These methods Extend the idea of random group method Allows replicate groups to overlap Are all purpose methods Asymptotic properties ?? Balanced Repeated Replication Suppose we had sampled 2 per stratum There are 2H ways to pick 1 from each stratum. Each combination could treated as a sample. Pick R samples. Balanced Repeated Replication Which samples should we include? – Assign each value either 1 or –1 within the stratum – Select samples that are orthogonal to one another to create balance – You can use the design matrix for a fraction factorial – Specify a vector ar of 1,-1 values for each stratum Estimator R 1 ˆ ˆ VBRR ( ) ˆ(a r) ˆ R r 1 2 Balanced Repeated Replication Pro – Relatively few computations – Asymptotically equivalent to linearization methods for smooth functions of population totals and quantiles – Can be extended to use weights Con – 2 psu per sample Can be extended with more complex schemes The Jackknife SRS-with replacement Quenoule (1949); Tukey (1958); Shao and Tu (1995) Let ˆi be the estimator of after omitting the ith n observation ˆ ~ ~ ~ J i / n where i n ˆ (n 1)ˆ i Jackknife estimate i 1 ˆ Jackknife estimator of the V ( ) l n n 1 ˆ i ˆ ) 2 VJ (ˆ) ( n i 1 n where ˆ ˆ i / n i 1 n 1 ~i ~ 2 ( J ) n(n 1) i 1 For Stratified SRS without replacement Jones (1974) The Jackknife stratified multistage design In stratum h, delete one PSU at a time Let ˆ( hi) be the estimator of the same form as ˆ when PSU i of stratum h is omitted Jackknife estimate: y hi h ' h Wh ' yh ' Wh (nh yh yhi ) /( nh 1) where ˆ hi g ( y hi ) Or using pseudovalues ~ ( hi) nhˆ (nh 1)ˆ ( hi) ~(I ) L nh ~ ( hi) J h 1 i 1 ~ ( II ) / n ; J 1 L 1 L h1 nh nh ~ ( hi) i 1 The Jackknife stratified multistage design Different formulae for V (ˆ) nh n 1 ) ˆ ( hi) ˆ method ) 2 VL (ˆ) h ( n h 1 i 1 h L L L h 1 h 1 method can be ˆ ( h ) , ˆ, ˆ ( hi ) / n, or ˆ ( h ) / L Where ˆ Using the pseudovalues nh n 1 ) ~ ( hi) ~ ( j ) 2 h ˆ VL ( ) ( J ) nh i 1 h 1 L j I , II The Jackknife Asymptotics Krewski and Rao (1981) Based on the concept of a sequence of finite populations with L strata in L Under conditions C1-C6 given in the paper L n1/ 2 (ˆ ) d N (0, 2 ) ii ) nVmethod (ˆ) 2 ˆ iii ) Tmethod d N (0,1) Vmethod (ˆ) i) Where method is the estimator used (Linearization, BRR, Jackknife) L 1 The Bootstrap Naïve bootstrap Efron (1979); Rao and Wu (1988); Shao and Tu (1995) Resample y with replacement in stratum h * nh hi i 1 yh*(b ) nh 1 y i *(b ) hi , y *(b ) h yh*(b ) , and ˆ*(b ) g ( y * ) Estimate: b 1,2,..., B Variance: VˆNBS (ˆ* ) E* (ˆ* E* (ˆ* )) 2 B 1 ˆ*(b ) ˆ*. ) – Or approximate by Vˆ * (ˆ* ) ( NBS B 1 b 1 The estimator is not a consistent estimator of the variance of a general nonlinear statistics The Bootstrap Naïve bootstrap For ˆ* Wh yh* y * 2 W nh 1 2 * h sh Var ( y ) nh nh Comparing with The ratio bounded nh Var ( y * ) Var ( y ) Var ( y ) Wh2 nh sh2 does not converge to 1for a The Bootstrap Modified bootstrap Resample Calculate: y * mh hi i 1 , mh 1 ~ yhi yh with replacement in stratum h m1h/ 2 * ( y y) hi 1/ 2 (nh 1) mh L ~ ~ ~ ~ yh yhi / mh , y Wh ~ yh , g ( ~ y) i 1 h Variance: Can be approximated with Monte Carlo For the linear case, it reduces to the customary unbiased variance estimator ~ ~ ~ * VˆMBS ( * ) E* ( * E* ( * )) 2 mh < nh More on bootstrap The method can be extended to stratified srs without replacement by simply changing ~ yhi to 1/ 2 m * ~ h yhi yh ( 1 f )( y h hi yh ) 1/ 2 (nh 1) For mh=nh-1, this method reduces to the naïve BS For nh=2, mh=1, the method reduces to the random half-sample replication method For nh>3, choice of mh …see Rao and Wu (1988) Simulation Rao and Wu (1988) Jackknife and Linearization intervals gave substantial bias for nonlinear statistics in one sided intervals The bootstrap performs best for one-sided intervals (especially when mh=nh-1) For two-sided intervals, the three methods have similar performances in coverage probabilities The Jackknife and linearization methods are more stable than the bootstrap B=200 is sufficient ‘Hot’ topics Jackknife with non-smooth functions (Rao and Sitter 1996) Two-phase variance estimation (Graubard and Korn 2002; Rubin-Bleuer and SchiopuKratina 2005) Estimating Function (EF) bootstrap method (Rao and Tausi 2004) Software OSIRIS – BRR, Jackknife SAS – Linearization Stata – Linearization SUDAAN – Linearization, Bootstrap, Jackknife WesVar – BRR, JackKnife, Bootstrap References: Effron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of statistics 7, 1-26. Graubard, B., J., Korn, E., L. (2002). Inference for supper population parameters using sample surveys. Statistical Science, 17, 73-96. Krewski, D., and Rao, J., N., K. (1981). Inference from stratified samples: Properties of linearization, jackknife, and balanced replication methods. The annals of statistics. 9, 1010-1019. Quenouille, M., H.(1949). Problems in plane sampling. Annals of Mathematical Statistics 20, 355-375. Rao, J.,N.,K., and Wu, C., F., J., (1988). Resampling inferences with complex survey data. JASA, 83, 231-241. Rao, J.,N.,K., and Tausi, M. (2004). Estimating function variance estimation under stratified multistage sampling. Communications in statistics. 33:, 2087-2095. Rao, J. N. K., and Sitter, R. R. (1996). Discussion of Shao’s paper.Statistics, 27, pp. 246–247. Rubin-Bleuer, S., and Schiopu-Kratina, I. (2005). On the two-phase framework for joint model and design based framework. Annals of Statistics (to appear) Shao, J., and Tu, (1995). The jackknife and bootstrap. New York: Springer-Verlag. Tukey, J.W. (1958). Bias and confidence in not-quite large samples. Annals of Mathematical Statistics. 29:614. Not referred in the presentation Wolter, K. M. (1985) Introduction to variance estimation. New York: Springer-Verlag. Shao, J. (1996). Resampling Methods in Sample Surveys. Invited paper, Statistics, 27, pp. 203–237, with discussion, 237–254.