2. Design of the Dutch business survey

Estimation of changes in repeated surveys and their significance Paul Knottnerus and Arnout van Delden 1. Introduction In many surveys, a changing population is sampled repeatedly so that the level and the change in the level of a characteristic between two occasions can be estimated. Each estimation should take into account the actual size of the population, thus it should allow for births and deaths. For example, in many countries a labour force survey is held in which the population is sampled monthly to estimate the number of unemployed persons and the rate of unemployment. Another example is a monthly business survey to estimate the level of the monthly turnover and the change in that level from a month ago and a year ago; see Konschnik et al. (1985). Variance estimation is needed to judge whether observed changes are statistically significant. Variance estimation is also needed in the design stage of the survey, to determine the optimal sample size and allocation or to determine the optimal estimator. In those repeated surveys, changes are often estimated using a stratification of the population. Businesses are extremely heterogeneous in terms of size and type of economic activity. Therefore, business surveys are usually designed as a stratified simple random sample selected without replacement (SRS); see Smiths et al. (2003). In surveys for households or persons the sample is usually not stratified because households are less heterogeneous in size. Some social surveys, such as labour force surveys, however use poststratification to reduce the variance and bias of the estimator. In deriving the formulas for the variance of an estimated change in a dynamic population with strata, one has to pay attention to various complicating factors. First of all, the estimated changes in level are the result of two components of change; see Holt and Skinner (1989). The first component is the change in the population mean of those units that remain in the same stratum on both occasions. The second component is the change in stratum composition between two occasions resulting from births and deaths in the population and from population units that migrate from one stratum to the other. Due to the migration of population units, the estimated mean of stratum h at occasion t may be correlated with the mean of stratum  at occasion t + 1. Another complicating factor is that the population is sampled repeatedly, resulting in overlapping samples between two occasions. Different rotating panel designs may be used in business surveys. Various authors have derived formulas for design-based variance estimators for the estimation of changes. Kish (1965) derived an expression for the variance of estimated changes based on overlapping samples, assuming a stable (no birth and death) and large population. Tam (1984) 1 removed the assumption of a large population. Lowerre (1979) and Laniel (1987) deal with the variance estimation of change in dynamic populations, but they do not take stratification into account. Hiridoglou et al. (1995) deal with dynamic populations and stratification, but not with changing strata. Nordberg (2000) and Berger (2004) derived formulas for the most complicated situation: a dynamic population with units that move between strata. Nordberg (2000) derives his formulas using inclusion indicators, which requires some algebra. In the present paper, we derive the expressions for SRS sampling in a more straightforward manner. Berger (2004) derives his formulas based on Poison sampling conditional on the sample size per stratum. However, he does not condition on the actual number of sample units that move between strata. In the present paper, we condition on both the sample size and the number of units that move between strata. In order to clarify the variance estimation procedures, we apply them to Dutch business surveys that provide estimates for the monthly turnover for the major Standard Industrial Classification codes. We focus on the estimates for the 12-month growth rates of the monthly turnover, and we will compare the variance of two estimators. The outline of the paper is as follows. Section 2 gives a brief description of the Dutch business survey of monthly turnover, describing the sampling plan and the estimators for the 12-month growth rates. The formula for the variance estimation of the estimators is derived in section 3. Section 4 illustrates the variance estimation procedures by comparing the variance of two estimators of the monthly turnover of Dutch supermarkets in the period 2003–2004. Section 5 concludes with some remarks. Some useful formulas are given in Appendix A. 2. Design of the Dutch business survey Every month, Statistics Netherlands estimates the monthly turnover for some of the major SIC codes. The publication includes the 12-month growth rates of the monthly turnover, i.e. the change in the monthly level of turnover from 12 months ago. Throughout this paper we will refer to this growth rate as the yearly growth rate. In section 2.1 we introduce the sampling design. In section 2.2 we describe two different estimators of the yearly growth rates. 2.1. The sampling design All statistical units are listed in the General Business Register that is maintained by Statistics Netherlands. We refer to the statistical units as establishments. The register is updated monthly for births and deaths, while once a year, on December 31, the size category and the type of economic activity (SIC code) is updated. Every first day of the month, an SRS sample is selected from the GBR to estimate the turnover of the current month. In fact, a rotating sample is used. The sample is stratified by size and by type of economic activity. The probability of selection depends on size and economic activity. The probability of selection decreases with the size of establishment, with the 2 largest establishments being included in the sample with probability 1. For some SICs not only survey data but also data from administrative sources are available. Those data from administrative source are all considered as a separate stratum, with variance 0. The sample is updated in two ways: each month the sample is updated to correct for births and deaths in the population and in January 10 per cent of the sample units are replaced. Monthly update. The sample size ( n ht ) per stratum h for each month t is a fixed proportion ( f h ) of the population size N ht per stratum h nht  f h N ht A reduction in the sample size (deaths) or a net change in N ht leads to an adjustment of the sample size. When necessary, additional units are sampled from the new population units (births) but when the number of new population units is too small new sample units are sampled randomly from already existing establishments. Likewise, when N ht is decreasing, units are removed randomly from the sample. Yearly update. The sample is updated yearly in two steps. In the first step, all population and sample units of December are classified according to the new stratification (new size and SIC-code of January). The resulting sample from a new stratum after correction consists of elements with different inclusions probabilities since the sample from the original strata may have different sampling fractions. To correct for different inclusions probabilities, new units are selected randomly for those units that move to a stratum with greater inclusion probabilities and units are removed randomly in the opposite situation. These yearly corrections are called the shunting procedure. Next, the size is updated for births and deaths in the population of January. The resulting sample can be considered as an SRS sample from the population of January. In the second step, 10 per cent of the sample units are replaced by random selection within each substratum (see section 3.2). 2.2. Two estimators for the yearly change Define O t : level of turnover in month t of all units in the population; g t , s : relative change in the level of turnover between months t and s, thus g t , s  For the corresponding estimates it holds by definition that Oˆ t gˆ t ,t 1  t 1  1, Oˆ Ot  1, t > s. Os (2.1) where a “hat” denotes an estimate. The yearly change in the level of the monthly turnover is currently estimated on the basis of a chain of 12 monthly changes in turnover 3 , t 12 , t 12 Gˆ tpubl  1  gˆ tpubl  11  (1  gˆ t  j , t  j 1 ) j 0 jan Oˆ feb Oˆ old Oˆ t Oˆ t 1  t 1  t  2  . . .  jan  dec  . . . Oˆ Oˆ Oˆ Oˆ new  jan Oˆ t Oˆ old  , Oˆ t 12 Oˆ jan Oˆ t 11 Oˆ t 12 (2.2) ( t  jan ) new In January the level of turnover is estimated twice. The first estimate is done before the yearly sample jan update, denoted by Ôold , and is used to estimate the monthly change of the turnover in January compared to that in December. The second estimate is done after the yearly sample update, denoted by jan , and is used to estimate the monthly change of the turnover in February compared to January. Ônew An alternative estimator for the yearly change in the level of the monthly turnover is the ratio between the monthly turnover of month t compared to month t-12. That is, Oˆ t t , t 12 t , t 12 Gˆ dir  t 12  1  gˆ dir ˆ O (2.3) In section 4 we will compare the variance of the two estimators. 3. Variance of the yearly growth rate of a monthly turnover 3.1. Introduction , t 12 For the published growth rate over 12 months ( gˆ tpubl ) of the monthly turnover we have according to (2.2) , t 12 1  gˆ tpubl  Oˆ tpubl  Oˆ t 12 11  publ (1  gˆ t  j ,t  j 1 )  j 0 jan Oˆ t Oˆ old . jan Oˆ t 12 Oˆ new (3.1) In other words, the January-effect due to the double observation in that month is equal to jan Oˆ t / Oˆ t 12 Oˆ new  . 12 jan Oˆ tpubl / Oˆ tpubl Oˆ old jan jan In the remainder we denote Oold and Onew briefly by O janO and O janN . Using (3.1), we get for the variance of the cumulated growth rate over 12 months ˆ janO  ˆt t    O publ    Oˆ O  var(gˆ publ )  var(1  gˆ publ )  var t 12   var t 12 janN  ˆ ˆ Oˆ     O publ   O  2 janO ˆ t t t    Oˆ janO   Ot  Oˆ    O  O   Oˆ   var t 12    t 12  var janN   2 t 12 cov t 12 , janN . ˆ ˆ ˆ ˆ   O O      O  O O  O  t ,t 12 t ,t 12 4 (3.2) The last approximation is based on a Taylor-series expansion with respect to Oˆ t / Oˆ t 12 and , t 12 Oˆ janO / Oˆ janN . Step by step we now explain how var(gˆ tpubl ) can be estimated. The first term at the right-hand side of (3.2) can be approximated by  Oˆ t  1 Ot var t 12   t 12 2 var(Oˆ t  G t ,t 12Oˆ t 12 ) ( G t ,t 12  t 12  1  g t ,t 12 ) O  Oˆ  (O ) 1  t 12 2 {var(Oˆ t )  (G t ,t 12 ) 2 var(Oˆ t 12 )  2G t ,t 12 cov(Oˆ t 12 , Oˆ t )} (O ) (3.3) The major problem is the estimation of cov(Oˆ t 12 , Oˆ t ) . The next section deals with this problem. The second variance term in (3.2) can be written in a similar form. The last term in (3.2) can be approximated by cov( ˆ janO ˆt O O 1 ˆ t  G t ,t 12O ˆ t 12 , O ˆ janO  O ˆ janN ) , )  t 12 jan cov(O t  12 janN ˆ ˆ O O O O  1 ˆ t ,O ˆ janO )  cov(O ˆ t ,O ˆ janN ) {cov(O O O jan ˆ t 12 , O ˆ janO )  G t ,t 12 cov(O ˆ t 12 , O ˆ janN )}  G t ,t 12 cov(O t 12 (3.4) These four covariances are discussed in section 3.4. 3.2. A static population with dynamic strata This subsection deals with the question of how to estimate cov(Oˆ t 12 , Oˆ t ) . First we look at the case with a standard refreshment of the panel in January and no births and deaths. In addition, stratum corrections may have taken place. That is, establishments have been moved to the correct stratum according to their actual number of employees. We can write the covariance term cov(Oˆ t 12 , Oˆ t ) from (3.3) as cov(Oˆ t 12 , Oˆ t )  cov( H  N ht 12 o ht 12 , h 1  H H  N H N o t t   )  1 t 12 t N h cov(o ht 12 , ot ), (3.5) h 1  1 where oht : the sample mean of the returns in stratum h in month t; N ht : size of stratum h in month t. Note that stratification of the units in month t  12 may differ from the one in month t. To take this into account, define 5 N ht 12,t : number of units in the population that in month t–12 belonged to stratum h and in month t to stratum . Denote the corresponding substratum by U ht 12,t ; ,t nht  12 : 2 number of units in the sample that in month t–12 belonged to stratum h and in month t ,t to stratum  . The corresponding sample is called saht  12 ; 2 ohs 2 : ,t the sample mean of the returns in saht  12 in month s (s =t–12, t) 2 Furthermore, there are the complicating factors of the refreshment and the consequences of the shunting procedure in January. For the impact on the covariances it is important to know that there are two observations in January. The first observation is based on the previous stratification. Subsequently, the second observation occurs after the stratum corrections, the shunting procedure and the refreshment of the panel have been carried out. As discussed in section 2, the shunting procedure can be necessary after the stratum corrections because, for instance, existing establishments move into stratum  from another stratum with a different inclusion probability. When the arrivals in stratum  come from another stratum h with a smaller inclusion probability, additional drawings from stratum U ht 212,t are required. When the inclusion probabilities of the arrivals are larger, some arrivals should be removed randomly in order to get equal inclusion probabilities for all units in the new stratum. Next, define nht 312,t : number of units in the sample from U ht 12,t that are removed from the sample in January due to the shunting and refreshing procedures; oht312 : the sample mean of returns in month t12 of those establishments from U ht 12,t which were in the sample up to (and including) December 31 and removed from the sample after January 1 due to the shunting procedure or refreshment procedure. nht 112,t : number of units from U ht 12,t that are drawn in the sample in January after shunting and refreshing the sample; oht1 : the sample mean of the returns in month t of those establishments in U ht 12,t that are selected in January in the sample due to shunting or refreshing. The refreshment in January means that in every stratum 10 per cent of the units is replaced by other units from the corresponding strata U ht 12,t . Furthermore, define 12 12 12 nhlt 23  nhlt 212.t  nhlt 312, t and oht23  (nhlt 212.t oht212  nhlt 312,t oht312 ) / nhlt 23 nhlt 12  nhlt 112,t  nhlt 212, t and oht12  (nhlt 112,t oht1  nhlt 212,t oht 2 ) / nhlt 12 . Because we can write oht 12 and o t as 6 oht 12  t 12 nhg 23 H n g 1 t 12 h t 12 ohg 23 (3.6) nkt 12 t  ok12 , t k 1 n H ot  we get for the covariances in (3.5) t 12 H  H nhg nkt 12 t  23 t 12   cov o , ok12 hg 23 t  g 1 nht 12  k 1 n   1 t 12 t t  t 12 t cov(nht  12 23oh 23 , nh12oh12 ) . nh n  cov(oht 12 , ot )  (3.7) t 12 t 12 t t 12 t 12 t In the second line we used that cov(nhg 23ohg 23 , ok12 )  0 for k  h and that cov(nhg 23ohg 23 , oh12 )  0 t t 12 t 12 t for g  , because conditioning on nhg 23 , ohg 23 and o h12 have zero covariance. Note that nk12 is fixed because, by construction, nkt 12  nt 12 t 12, t N k N t 12 (k  1, ..., H ). t 12 Furthermore, nht  12 , N ht 12,t , N ht 12 ). 23 follows a hypergeometric distribution with parameters ( nh Making use of the formula for conditional covariances, we can derive an expression for the covariance in (3.7) t 12 t t t 12 t 12 t t t 12 t 12, t cov(nht  12 23oh 23 , nh12oh12 )  E{cov(nh 23oh 23 , nh12oh12 nh 23 , nh 2 )} t 12 t 12 t 12, t t t t 12 t 12, t  cov{E (nht  12 23oh 23 nh 23 , nh 2 ), E ( nh12oh12 nh 23 , nh 2 )}. (3.8) For the first term at the right-hand side we have t 12 t t t 12 t 12,t E{cov(n ht 12 23 o h 23 , n h12 o h12 n h 23 , n h 2 )}  t t 12 t t 12 t 12,t  E{n ht 12 23 n h12 cov(o h 23 , o h12 n h 23 , n h 2 )} t  E{n ht 12 23 n h12 ( ,t n ht 12 / n ht 12 2 23 n ht 12 ,t  S ht  12,t {E (n ht 12 2 )  1 N ht  12,t n ht 12 E (n ht 12 23 ) N ht  12,t ) S ht  12,t } }, (3.9) where we used in the third line (A.1) in Appendix A. The second term at the right-hand side of (3.8) equals t 12 t 12 t 12, t t t t 12 t 12, t cov{E (nht  12 23oh 23 nh 23 , nh 2 ), E ( nh12oh12 nh 23 , nh 2 )}  t 12 t t t 12 t t 12 t  cov(nht  12 23Oh , nh12Oh )  Oh Oh cov(nh 23 , nh12 )  0 . 7 The last covariance is zero because nht 12 is fixed. Therefore, the unconditional covariance is equal to ,t (3.9). However, similar to poststratification we consider nht 12 and nht 12 2 23 as fixed. For the conditional covariance we get according (3.7) and (3.9) the simple expression ,t nht  12 nht  12 nht 12 23 2 (  ) S ht 12, t . t 12, t nht 12nt nht  12 N 23 h cov(oht 12 , ot )  (3.10) For a justification of the use of a conditional (co)variance, see Holt and Smith (1979) and Knottnerus ,t t 12 (2003, pp. 133-6). For expressions for E (nht  12 2 ) and E (nh 23 ), see Knottnerus and Van Delden (2005). Expression (3.10) can be estimated by ,t nht  12 nht  12 nht 12 t 12,t 23 2 (  ) sh 2 nht 12nt nht  12 N ht 12,t 23 côv(oht 12 , ot )  ,t n ht 12 2 1 ,t sht  12  2  (o ,t nht  12 1 2 t 12 h 2i (3.11)  oht212 )(oht  2i  oht 2 ). i 1 Note that (3.11) is also unbiased for estimating the unconditional covariance (3.9) because ,t t 12, t t 12, t E ( sht  12 nht  12 . 2 23 , nh 2 )  S h A refinement of (3.11) is based on more observations so that a possible substantial underestimation of var(gˆ t ,t 12 ) can be avoided. It works as follows. Calculate the correlation coefficient ˆ ht 212,t  sht  2s ,t sht  12 2 t sht  12 2 sh 2 ,t n ht 12 2 1   (o ,t nht  12 1 2 t s h 2i  oht2s ) 2 ( s  0, 12). i 1 An alternative estimator for S ht 12,t now becomes 12, t t Sˆht 12, t  sht 123  ˆ ht 212, t sht  12 23 sh12 sht  12 23 sht 12   n ht 12 23 1 nht  12 23  1 1 nht 12  1  (o t 12 h 23i 12 2  oht23 ) (3.12) i 1 n ht 12  (o t h12i  oht12 ) 2 . i 1 3.3. Dynamic populations (births and deaths) In order to allow for deaths and births we extend (3.6) in a natural way as follows oht 12  H H  t 12 nhg 23 o t 12 t 12 hg 23 g 1 nh nkt 12 t ot  ok12 t k 1 n    12, t nht , out nht 12 nt ,in12, t nt 12 oht ,out (3.13) ot, in , 8 respectively, where nt ,in12,t  nt 12 N t 12 N t ,in12,t 12,t 12,t n ht ,out ~ Hypergeometric (n ht 12 , N ht ,out , N ht 12 ). 12, t Because samples from the substrata for births (say U ht ,in12,t ) and deaths (say U ht ,uit ) can be seen as independent of other samples, they are irrelevant for a further analysis of the covariance. Similar to (3.7) we can write cov(oht 12 , ot ) as cov(oht 12 , ot )  1 nht 12nt t 12 t t cov(nht  12 23oh 23 , nh12oh12 ). (3.14) The rest of the procedure is the same as described in section 3.2; cf. (3.11). 3.4. The remaining covariances In this section we just mention the expression for the estimator of the second covariance term at the right-hand side of (3.4). Taking births and deaths into account, we can estimate this term similarly to (3.7) or (3.14) by côv(Oˆ t , Oˆ janN )  H N janN t Nh h côv(ohjanN , oht ) h 1 côv(ohjanN , oht )  1 nhjanN nht , t janN janN , t t côv{nhhjanN 23 ohh 23 , nhh12 ohh12} ,t janN , t ,t nhhjanN  nhhjanN 23  nhh 2 3 ,t ,t ,t nhhjanN  nhhjanN  nhhjanN 12 1 2 , t janN janN , t t janN , t côv{nhhjanN 23 ohh 23 , nhh12 ohh12}  nhh 23 ( ,t ,t nhhjanN nhhjanN ,t 2 12  ) shhjanN 123 . janN , t janN , t nhh23 N hh ,t The quantity shhjanN in the last line is defined similarly to (3.12) by 123 ,t t ˆ janN,t shhjan123shh shhjanN 123   hh2 12 ,t ( ˆ hhjanN  2 ,t shhjanN 2 ). jan t shh2 shh2 The other three covariances in (3.4) can be estimated in a similar way. 9 4. Variance of the supermarkets change in turnover of Dutch 4.1. Introduction The calculations for the variances and confidence intervals are based on turnover data of Dutch supermarkets of 4-week periods in 2003 and 2004. The population consists of about 3500 statistical units. The turnover data are partly based on administrative files and partly on a stratified sample. The administrative files contain about 950 units. All data from units within the administrative files are put in a separate stratum, with weight 1. A gross sample of about 900 units is drawn from the full list of population units within the GBR, thus also from units that are already present in the administrative files. Consequently, 500 of the 900 sampled units are already present in the administrative files, but those units do not receive a questionnaire. Thus, the net sample contains about 400 units. The sample is stratified by size. All establishments with 50 or more employees are included with probability 1. The other establishments were sampled with decreasing inclusion probability from 1:2 (20–49 employees), to 1: 40 in the smallest size (1 employee). 4.2. Results Table 1 shows that the estimated 95%-margins for the yearly growth rates of the currently used ,t 12 estimator, gˆ tpubl from (2.2), varied between 0,6 and 0,9 (%-point). For example, in the first period (weeks 9–12), the 95%-confidence interval for the yearly growth rate was -1,2 to 0,6 per cent. t , t 12 Surprisingly, the 95%-margins for the alternative estimator, gˆ dir from (2.3), which is a function of t , t 12 two instead of four estimated totals, were slightly larger. The estimated 95%-margins of gˆ dir varied between 0,7 and 1,0 (%-point). , t 12 Table 2 shows why the 95%-margins of the currently used estimator gˆ tpubl are smaller than that of the t , t 12 ˆ t / Oˆ t 12 , Oˆ janO / Oˆ janN ) is negative and its alternative estimator gˆ dir . The covariance term cov(O absolute value is more than half of the value of var(Oˆ janO / Oˆ janN ) , resulting in t , t 12 , t 12 ) . This covariance is negative because both Oˆ t 12 and Ô janO are based on the var(gˆ tpubl ) < var(gˆ dir units before the yearly sample update, whereas Ô t and Ô janN are based on the units after the yearly , t 12 sample update. Thus the ratio Oˆ janO / Oˆ janN included in the currently used estimator gˆ tpubl reduced the t , t 12 , t 12 variance compared to the alternative estimator gˆ dir . Because the estimated bias of both gˆ tpubl and t , t 12 gˆ dir is much smaller than its estimated variance (not shown), the accuracy (mean squared error) the , t 12 t , t 12 estimator gˆ tpubl is also larger than that of gˆ dir . 10 Table 1. Estimated growth rates and 95%-margins for two estimators of the yearly growth rates (2004 to 2003) of the turnover of a 4-week periods of Dutch Supermarkets. Period (weeknumbers) 09–12 13–16 17–20 21–24 25–28 29–32 33–36 37–40 41–44 , t 12 gˆ tpubl t , t 12 gˆ dir % % -0,3 -3,7 1,6 -2,2 0,5 -1,7 -2,2 0,0 -2,3 -0,4 -3,8 1,5 -2,3 0,4 -1,8 -2,3 -0,1 -2,4 95%-margin (%-point) , t 12 gˆ tpubl t , t 12 gˆ dir 0,9 0,9 0,9 0,8 0,6 0,6 0,7 0,6 0,8 1,0 0,9 0,9 0,9 0,7 0,7 0,7 0,7 0,9 Table 2 Estimated variance components Period (weeknumbers) 09–12 13–16 17–20 21–24 25–28 29–32 33–36 37–40 41–44 var(Xˆ ) 10-6 24,2 22,2 21,9 19,4 12,7 12,3 13,6 14,2 19,1 var(Yˆ ) cov( Xˆ , Yˆ ) 10-6 3,5 3,5 3,5 3,5 3,5 3,5 3,5 3,5 3,5 10-6 -2,6 -3,2 -3,2 -2,8 -3,2 -3,0 -2,4 -3,4 -3,8 var(Zˆ ) cor(Oˆ t 12 , Oˆ t ) 10-6 22,5 19,3 18,9 17,2 9,8 9,7 12,3 10,8 14,9 0,18 0,17 0,19 0,20 0,37 0,35 0,32 0,42 0,29 X̂ = Oˆ t / Oˆ t 12 ; Yˆ = Oˆ janO / Oˆ janN ; Ẑ = Xˆ  Yˆ 5. Concluding remarks In the present paper we derived the formula for the variance of the yearly growth rate of a monthly turnover in a dynamic population with changing strata in a straightforward manner. The covariance terms in the formulas are estimated from the correlation, which in turn is estimated from the matched sample. Berger (2004) remarks that the correlation may be overestimated, resulting in a negative bias of the variance of the growth rates. To avoid this bias, Berger (2004) uses conditional Poisson sampling. In the example of the Dutch Supermarkets the bias of the estimated variances is expected to be very small. When we repeatedly draw small samples of 3 units from a stratum of 34 population units, we found that the estimated correlation among the sampling units corresponding to period t versus t–12 was underestimated by about 10 per cent. In practice however, only some of the strata have such small sample sizes, other strata are sampled completely and thus without bias. This means 11 that the covariance of the estimated totals côv(Oˆ t , Oˆ t 12 ) will be underestimated by less than 10 per cent. The correlation between Ô t and Oˆ t 12 was about 0,2. Taking a 10 per cent bias as a worst t , t 12 scenario, that means that the variance of Gˆ dir was overestimated by no more than 1,9 per cent and the corresponding confidence interval by no more than 1,0 per cent. The variance formulas proposed here can be extended in various ways. First of all it can be shown that they can be extended to the situation where the generalised regression estimator or the ratio estimator is used instead of the Horvitz-Thompson estimator. This is useful because in business surveys one often exploits auxiliary data, such as VAT data; see Hiridiglou et al. (1995) and Smiths et al. (2003). Secondly, the variance formulas can be extended to take into account other (nonlinear) functions of totals. One example is the estimation of a chain of 12 growth rates. This is useful for situations where in each month two estimates of the actual level are made, for instance when monthly growth rates are based on the method of matched pairs; see Smiths et al. (2003). Another type of functions of totals is to estimate the yearly change of the sum of 12 months of the present year compared to the corresponding sum of the previous year. Thirdly, the formulas can easily be adjusted to allow for situations of post-stratification. That would be useful for social statistics when estimating a change as the difference between two levels. Finally, here we assumed that the stratum sizes are known. In some situations however, the population totals of the strata are estimated; see e.g. Lowerre (1979). Future research is needed to find out how the formulas can be extended to allow for estimated stratum sizes. The example of the Dutch Supermarkets shows one of the practical applications of the variance formulas: determining which out of two possible estimators has the smallest variance. The results , t 12 show that the variance of the more complicated estimator, gˆ tpubl from (2.2), which corrects for the t , t 12 sample refreshment in January, was slightly smaller than that of the alternative estimator gˆ dir from (2.3). In branches (other SIC codes) where the correlation between the turnover of subsequent months , t 12 t , t 12 is small the opposite may be true: a smaller variance of gˆ dir than of gˆ tpubl . In our example the difference between the two estimators was small. That difference might be more substantial when a larger proportion of the sample size is refreshed in January. Appendix A. Some useful formulas Let s123 denote a mother sample consisting of three mutually disjoint SRS subsamples s1 , s2 and s3 . Let the variable x be observed in s12 and the variable y in s23 . The corresponding sample means are denoted by x12 and y 23 , respectively. Furthermore, it is assumed that s1 and s3 have the same size ( n1  n3 ) . Define  by   n2 / n12 ( n2 / n23 ). Then the covariance between x12 and y 23 is equal to 12 cov( x12 , y23 )  ( S xy  Proof 1 N 1  n12 N (X j  1  1 ) S xy  (  ) S xy N n23 N  X p )(Y j  Yp ). j 1 cov( x12 , y23 )  cov{(1   ) x1  x2 , y2  (1   ) y3}  (1   ) cov( x1 , y23 )  2 cov( x2 , y2 )   (1   ) cov( x2 , y3 )  (1   ) (  2 n2  S xy N  2 ( S xy 1 1  ) S xy   (1   ) n2 N N 1  1  1 ) S xy  (  ) S xy  (  ) S xy . N n12 N n23 N In the third line we used that cov( x1, y23 )  cov( x2 , y3 )  S xy / N . This follows from the conditional covariance formula cov( x2 , y3 )  E{cov( x2 , y3 s2 )}  cov{E ( x2 s2 ), E ( y3 s2 )})  0  cov{x2 ,  Y p  f 2 y2 1  f2 } ( f 2  n2 / N ) S xy S xy f2 f cov( x2 , y2 )   2 (1  f 2 )  . 1  f2 1  f2 n2 N For an alternative proof based on the sampling autocorrelation coefficient, see Knottnerus (2003, p. 375). Likewise, for   n2 / n12 and   n2 / n23 (    ) we get cov( x12 , y23 )  (  n2  1  1  1 ) S xy  (  )S xy  (  )S xy . N n12 N n23 N (A.1) References Berger, Y.G. (2004). Variance estimation for measures of change in probability sampling, The Canadian Journal of Statistics, 32, 451–467. Hiridoglou, M.A., Särndal, C.E. and Binder, D.A. (1995). Weighting and estimation in business surveys, pp. 477–502 in: B.G. Cox (ed.), Business Survey Methods, John Wiley & Sons, New York. Holt, D., and Skinner, C.J. (1989). Components of change in repeated surveys, International Statistical Review, 57, 1–18. Holt, D., and Smith, T.M.F. (1979). Poststratification, Journal of the Royal Statistical Society, A, 142, 33–46. Knottnerus, P. (2003). Sample Survey Theory: Some Pythagorean Perspectives, Springer-Verlag, New York. 13 Knottnerus, P. and Delden, A. van (2005). Confidence margins and run-away effects of the yearly growth rates of monthly sales in the supermarkets (in Dutch), Internal CBS Report, Voorburg. Konschnik, C.A., Monsour, N.J., and Detlefsen, R.E. (1985). Constructing and maintaining frames and samples for business surveys, Proceedings of the Survey Research Methods Section, American Statistical Association, 113–122. Kish, L. (1965). Survey sampling, John Wiley & Sons, New York. Laniel , N. (1987). Variances for a rotating sample from a changing population. Proceedings of the Survey Research Methods Section, American Statistical Association, 496–500. Lowerre, J.M. (1979). Sampling for change, Proceedings of the Survey Research Methods Section, American Statistical Association, 343–347. Nordberg, L. (2000). On variance estimation for measures of change when samples are coordinated by the use of permanent random numbers, Journal of Official Statistics, 16, 363–378. Smiths, P., Pont, M., and Jones, T. (2003). Developments in business survey methodology in the Office for National Statistics, 1994–2000, The Statistician, 52, 257–295. Tam, S.M. (1984). On covariances from overlapping samples, The American Statistician, 38, 288–289. 14

2. Design of the Dutch business survey

Related documents

Products

Support

2. Design of the Dutch business survey

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib