Outline for Class meeting 25 (Chapter 9, Lohr, 4/19/04) Variance Estimation I. Often, analytic assessment of the variance of estimators made from probability designs can be done using the simple relationship: V ( ik1 ai tˆi ) ik1 ai2V (tˆi ) 2 ik1 kj i 1 ai a j Cov(tˆi , tˆ j ) A similar relationship holds for estimates of means. (See Lohr, p. 290). If the sample design is a simple random sample of size n, then Cov(tˆi , tˆ j ) N 2 (1 f ) S ij / n , where S ij kN1 ( yik yi )( y jk y j ) /( N 1) . This can be estimated as with variance estimators (by substituting the sample version of the estimator for the population one.) There are two kinds of situations in which this relationship is helpful. A. When the parameter of interest really is a linear combination of totals or means. Example: Visitor’s survey B. When the parameter of interest isn’t really a linear combination of totals or means, but can be approximated by one. This is the Taylor linearization approach discussed previously, leading to V h(tˆ) h(t )2 V (tˆ) , or in the multivariate case 2 2 h h V h(tˆ1 ,..., tˆk ) V (tˆ1 ) ( t ,... t ) ( t ,... t ) V (tˆk ) 1 k 1 k tˆ1 tˆk h h Cov(tˆi , tˆ j ) i j tˆi tˆ j . II. When surveys have very complicated designs, or when non-linear estimators are required, it becomes either hard or impossible to calculate variances using analytical methods, such as those we have discussed. There are alternatives. An excellent reference for all these methods is a book by Kirk Wolter, Introduction to Variance Estimation (1985). A. Random Group Method 1. Denote your estimator by ̂ . The idea here is that if you conducted your survey in R identical replicates, computed your estimator from each one ( ̂ r ), and formed the ~ average of the estimators (called ), then its variance could be estimated unbiasedly by: R ~ 2 (ˆ r ) 1 ~ Vˆ1 ( ) r 1 . R R 1 Another estimator (biased upward slightly) is R 2 (ˆ r ˆ ) 1 Vˆ2 r 1 . R R 1 2. In practice, your sample design is not carried out in identical replicates. But you pretend it was(!), and just divide up your sample into R identical pieces. a. This is easy for a SRS. b. In a cluster sample, you must keep all the units in a PSU together in the same piece to preserve the correlation structure. c. In a stratified multistage design, each random group contains a sample of psu's from each stratum. 3. Note that if k PSU's are sampled from the smallest stratum, you cannot have more than K random groups. B. The jackknife 1. Background (Review for most of you, I think?) The jackknife is an all-purpose tool (hence the name) for calculating variance when the analytics are intractable. It was developed for "regular" (non-finite population) parameter estimation, but can also be used in finite population sampling, but less is known about its performance there. The jackknife was popularized by John Tukey. originally used for bias reduction. a. Denote your estimator by ̂ . (E.g., could be estimator of ratio tˆy / tˆx .) Let ˆ ( j ) be the estimator of the same form as ̂ , but not using observation j. Then n VˆJK (ˆ ) n 1 (ˆ ( j ) ˆ ) 2 . n j 1 b. Why would this work? An example for which you know the answer will illustrate (not prove) that this is sensible. Suppose the estimator is of a total from a SRS. That is, let ˆ tˆ N yi . n tˆ Ny j Then ˆ ( j ) N ( n tˆ y j ) . Thus (ˆ ( j ) ˆ ) 2 n 1 N n 1 2 , and is n tˆ Ny j 2 n 2 n 1 ˆ ˆ N V JK (t ) ( y j y) 2 . n n 1 n( n 1) j 1 j 1 Note that this is the correct variance, except for the fpc. 2. Jackknife in complex designs a. As with other resampling type variance estimators we have discussed, one leaves out a PSU at a time when calculating ˆ ( j ) . When stratification is present, the variance estimation is done separately in each stratum. b. The jackknife approach does not work well for estimators that are not smooth in the data, like estimates of the median or other quantiles. c. A classic reference for the jackknife and the bootstrap (not for finite populations, but in general) is a monograph, The Jackknife, the Bootstrap and Other Resampling Plans (1982), by Bradley Efron. III. Software (http://www.fas.harvard.edu/~stats/survey-soft/METHODS.html) Summary of survey software: Methods SAS Taylor expansion. Stata Taylor-series linearization is used in the survey analysis commands. There are also commands for jackknife and bootstrap variance estimation, although these are not specifically oriented to survey data. SUDAAN The Taylor series linearization method (GEE for regression models) is used combined with variance estimation formulas specific to the sample design. The user does not need to develop special replicate weights since the sample design can be specified directly to the program. Jackknife and Balanced Repeated Replication (BRR) variance estimation is also supported. WesVar Balanced repeated replication (including the Fay method), jackknife (several variants) and other replication methods specified by users through the development of replicate weights (e.g., bootstrap