2. Design of the Dutch business survey

advertisement
Estimation of changes in repeated surveys and their
significance
Paul Knottnerus and Arnout van Delden
1. Introduction
In many surveys, a changing population is sampled repeatedly so that the level and the change in the
level of a characteristic between two occasions can be estimated. Each estimation should take into
account the actual size of the population, thus it should allow for births and deaths. For example, in
many countries a labour force survey is held in which the population is sampled monthly to estimate
the number of unemployed persons and the rate of unemployment. Another example is a monthly
business survey to estimate the level of the monthly turnover and the change in that level from a
month ago and a year ago; see Konschnik et al. (1985). Variance estimation is needed to judge
whether observed changes are statistically significant. Variance estimation is also needed in the design
stage of the survey, to determine the optimal sample size and allocation or to determine the optimal
estimator.
In those repeated surveys, changes are often estimated using a stratification of the population.
Businesses are extremely heterogeneous in terms of size and type of economic activity. Therefore,
business surveys are usually designed as a stratified simple random sample selected without
replacement (SRS); see Smiths et al. (2003). In surveys for households or persons the sample is
usually not stratified because households are less heterogeneous in size. Some social surveys, such as
labour force surveys, however use poststratification to reduce the variance and bias of the estimator.
In deriving the formulas for the variance of an estimated change in a dynamic population with strata,
one has to pay attention to various complicating factors. First of all, the estimated changes in level are
the result of two components of change; see Holt and Skinner (1989). The first component is the
change in the population mean of those units that remain in the same stratum on both occasions. The
second component is the change in stratum composition between two occasions resulting from births
and deaths in the population and from population units that migrate from one stratum to the other. Due
to the migration of population units, the estimated mean of stratum h at occasion t may be correlated
with the mean of stratum  at occasion t + 1. Another complicating factor is that the population is
sampled repeatedly, resulting in overlapping samples between two occasions. Different rotating panel
designs may be used in business surveys.
Various authors have derived formulas for design-based variance estimators for the estimation of
changes. Kish (1965) derived an expression for the variance of estimated changes based on
overlapping samples, assuming a stable (no birth and death) and large population. Tam (1984)
1
removed the assumption of a large population. Lowerre (1979) and Laniel (1987) deal with the
variance estimation of change in dynamic populations, but they do not take stratification into account.
Hiridoglou et al. (1995) deal with dynamic populations and stratification, but not with changing strata.
Nordberg (2000) and Berger (2004) derived formulas for the most complicated situation: a dynamic
population with units that move between strata. Nordberg (2000) derives his formulas using inclusion
indicators, which requires some algebra. In the present paper, we derive the expressions for SRS
sampling in a more straightforward manner. Berger (2004) derives his formulas based on Poison
sampling conditional on the sample size per stratum. However, he does not condition on the actual
number of sample units that move between strata. In the present paper, we condition on both the
sample size and the number of units that move between strata.
In order to clarify the variance estimation procedures, we apply them to Dutch business surveys that
provide estimates for the monthly turnover for the major Standard Industrial Classification codes. We
focus on the estimates for the 12-month growth rates of the monthly turnover, and we will compare the
variance of two estimators.
The outline of the paper is as follows. Section 2 gives a brief description of the Dutch business survey
of monthly turnover, describing the sampling plan and the estimators for the 12-month growth rates.
The formula for the variance estimation of the estimators is derived in section 3. Section 4 illustrates
the variance estimation procedures by comparing the variance of two estimators of the monthly
turnover of Dutch supermarkets in the period 2003–2004. Section 5 concludes with some remarks.
Some useful formulas are given in Appendix A.
2. Design of the Dutch business survey
Every month, Statistics Netherlands estimates the monthly turnover for some of the major SIC codes.
The publication includes the 12-month growth rates of the monthly turnover, i.e. the change in the
monthly level of turnover from 12 months ago. Throughout this paper we will refer to this growth rate
as the yearly growth rate. In section 2.1 we introduce the sampling design. In section 2.2 we describe
two different estimators of the yearly growth rates.
2.1. The sampling design
All statistical units are listed in the General Business Register that is maintained by Statistics
Netherlands. We refer to the statistical units as establishments. The register is updated monthly for
births and deaths, while once a year, on December 31, the size category and the type of economic
activity (SIC code) is updated. Every first day of the month, an SRS sample is selected from the GBR
to estimate the turnover of the current month. In fact, a rotating sample is used. The sample is
stratified by size and by type of economic activity. The probability of selection depends on size and
economic activity. The probability of selection decreases with the size of establishment, with the
2
largest establishments being included in the sample with probability 1. For some SICs not only survey
data but also data from administrative sources are available. Those data from administrative source are
all considered as a separate stratum, with variance 0. The sample is updated in two ways: each month
the sample is updated to correct for births and deaths in the population and in January 10 per cent of
the sample units are replaced.
Monthly update. The sample size ( n ht ) per stratum h for each month t is a fixed proportion ( f h ) of the
population size N ht per stratum h
nht  f h N ht
A reduction in the sample size (deaths) or a net change in N ht leads to an adjustment of the sample
size. When necessary, additional units are sampled from the new population units (births) but when
the number of new population units is too small new sample units are sampled randomly from already
existing establishments. Likewise, when N ht is decreasing, units are removed randomly from the
sample.
Yearly update. The sample is updated yearly in two steps. In the first step, all population and sample
units of December are classified according to the new stratification (new size and SIC-code of
January). The resulting sample from a new stratum after correction consists of elements with different
inclusions probabilities since the sample from the original strata may have different sampling
fractions. To correct for different inclusions probabilities, new units are selected randomly for those
units that move to a stratum with greater inclusion probabilities and units are removed randomly in the
opposite situation. These yearly corrections are called the shunting procedure. Next, the size is updated
for births and deaths in the population of January. The resulting sample can be considered as an SRS
sample from the population of January. In the second step, 10 per cent of the sample units are replaced
by random selection within each substratum (see section 3.2).
2.2. Two estimators for the yearly change
Define
O t : level of turnover in month t of all units in the population;
g t , s : relative change in the level of turnover between months t and s, thus g t , s 
For the corresponding estimates it holds by definition that
Oˆ t
gˆ t ,t 1  t 1  1,
Oˆ
Ot
 1, t > s.
Os
(2.1)
where a “hat” denotes an estimate. The yearly change in the level of the monthly turnover is currently
estimated on the basis of a chain of 12 monthly changes in turnover
3
, t 12
, t 12
Gˆ tpubl
 1  gˆ tpubl

11
 (1  gˆ
t  j , t  j 1
)
j 0
jan
Oˆ feb Oˆ old
Oˆ t Oˆ t 1
 t 1  t  2  . . .  jan  dec  . . .
Oˆ
Oˆ
Oˆ
Oˆ
new

jan
Oˆ t
Oˆ old

,
Oˆ t 12 Oˆ jan
Oˆ t 11
Oˆ t 12
(2.2)
( t  jan )
new
In January the level of turnover is estimated twice. The first estimate is done before the yearly sample
jan
update, denoted by Ôold
, and is used to estimate the monthly change of the turnover in January
compared to that in December. The second estimate is done after the yearly sample update, denoted by
jan
, and is used to estimate the monthly change of the turnover in February compared to January.
Ônew
An alternative estimator for the yearly change in the level of the monthly turnover is the ratio between
the monthly turnover of month t compared to month t-12. That is,
Oˆ t
t , t 12
t , t 12
Gˆ dir
 t 12  1  gˆ dir
ˆ
O
(2.3)
In section 4 we will compare the variance of the two estimators.
3. Variance of the yearly growth rate of a monthly turnover
3.1. Introduction
, t 12
For the published growth rate over 12 months ( gˆ tpubl
) of the monthly turnover we have according to
(2.2)
, t 12
1  gˆ tpubl

Oˆ tpubl

Oˆ t 12
11

publ
(1  gˆ t  j ,t  j 1 ) 
j 0
jan
Oˆ t Oˆ old
.
jan
Oˆ t 12 Oˆ new
(3.1)
In other words, the January-effect due to the double observation in that month is equal to
jan
Oˆ t / Oˆ t 12
Oˆ new

.
12
jan
Oˆ tpubl / Oˆ tpubl
Oˆ old
jan
jan
In the remainder we denote Oold
and Onew
briefly by O janO and O janN . Using (3.1), we get for the
variance of the cumulated growth rate over 12 months
ˆ janO 
ˆt
t


 O publ 

 Oˆ O

var(gˆ publ )  var(1  gˆ publ )  var t 12   var t 12 janN 
ˆ
ˆ
Oˆ



 O publ 

O

2
janO
ˆ
t
t
t



Oˆ janO 

Ot
 Oˆ 
  O 
O

 Oˆ

 var t 12    t 12  var janN   2 t 12 cov t 12 , janN .
ˆ
ˆ
ˆ
ˆ


O
O





O
 O
O

O

t ,t 12
t ,t 12
4
(3.2)
The last approximation is based on a Taylor-series expansion with respect to Oˆ t / Oˆ t 12 and
, t 12
Oˆ janO / Oˆ janN . Step by step we now explain how var(gˆ tpubl
) can be estimated. The first term at
the right-hand side of (3.2) can be approximated by
 Oˆ t 
1
Ot
var t 12   t 12 2 var(Oˆ t  G t ,t 12Oˆ t 12 )
( G t ,t 12  t 12  1  g t ,t 12 )
O
 Oˆ
 (O )
1
 t 12 2 {var(Oˆ t )  (G t ,t 12 ) 2 var(Oˆ t 12 )  2G t ,t 12 cov(Oˆ t 12 , Oˆ t )}
(O )
(3.3)
The major problem is the estimation of cov(Oˆ t 12 , Oˆ t ) . The next section deals with this problem. The
second variance term in (3.2) can be written in a similar form. The last term in (3.2) can be
approximated by
cov(
ˆ janO
ˆt O
O
1
ˆ t  G t ,t 12O
ˆ t 12 , O
ˆ janO  O
ˆ janN )
,
)  t 12 jan cov(O
t

12
janN
ˆ
ˆ
O
O
O
O

1
ˆ t ,O
ˆ janO )  cov(O
ˆ t ,O
ˆ janN )
{cov(O
O
O jan
ˆ t 12 , O
ˆ janO )  G t ,t 12 cov(O
ˆ t 12 , O
ˆ janN )}
 G t ,t 12 cov(O
t 12
(3.4)
These four covariances are discussed in section 3.4.
3.2. A static population with dynamic strata
This subsection deals with the question of how to estimate cov(Oˆ t 12 , Oˆ t ) . First we look at the case
with a standard refreshment of the panel in January and no births and deaths. In addition, stratum
corrections may have taken place. That is, establishments have been moved to the correct stratum
according to their actual number of employees. We can write the covariance term cov(Oˆ t 12 , Oˆ t ) from
(3.3) as
cov(Oˆ t 12 , Oˆ t )  cov(
H

N ht 12 o ht 12 ,
h 1

H
H
 N
H
N o
t t
 
)
 1
t 12 t
N
h
cov(o ht 12 , ot ),
(3.5)
h 1  1
where
oht :
the sample mean of the returns in stratum h in month t;
N ht :
size of stratum h in month t.
Note that stratification of the units in month t  12 may differ from the one in month t. To take this
into account, define
5
N ht 12,t : number of units in the population that in month t–12 belonged to stratum h and in month
t to stratum . Denote the corresponding substratum by U ht 12,t ;
,t
nht  12
:
2
number of units in the sample that in month t–12 belonged to stratum h and in month t
,t
to stratum  . The corresponding sample is called saht  12
;
2
ohs 2 :
,t
the sample mean of the returns in saht  12
in month s (s =t–12, t)
2
Furthermore, there are the complicating factors of the refreshment and the consequences of the
shunting procedure in January. For the impact on the covariances it is important to know that there are
two observations in January. The first observation is based on the previous stratification.
Subsequently, the second observation occurs after the stratum corrections, the shunting procedure and
the refreshment of the panel have been carried out. As discussed in section 2, the shunting procedure
can be necessary after the stratum corrections because, for instance, existing establishments move into
stratum  from another stratum with a different inclusion probability. When the arrivals in stratum 
come from another stratum h with a smaller inclusion probability, additional drawings from stratum
U ht 212,t are required. When the inclusion probabilities of the arrivals are larger, some arrivals should be
removed randomly in order to get equal inclusion probabilities for all units in the new stratum. Next,
define
nht 312,t :
number of units in the sample from U ht 12,t that are removed from the sample in January
due to the shunting and refreshing procedures;
oht312 :
the sample mean of returns in month t12 of those establishments from U ht 12,t which
were in the sample up to (and including) December 31 and removed from the sample
after January 1 due to the shunting procedure or refreshment procedure.
nht 112,t :
number of units from U ht 12,t that are drawn in the sample in January after shunting and
refreshing the sample;
oht1 :
the sample mean of the returns in month t of those establishments in U ht 12,t that are
selected in January in the sample due to shunting or refreshing.
The refreshment in January means that in every stratum 10 per cent of the units is replaced by other
units from the corresponding strata U ht 12,t . Furthermore, define
12
12
12
nhlt 23
 nhlt 212.t  nhlt 312, t and oht23
 (nhlt 212.t oht212  nhlt 312,t oht312 ) / nhlt 23
nhlt 12  nhlt 112,t  nhlt 212, t and oht12  (nhlt 112,t oht1  nhlt 212,t oht 2 ) / nhlt 12 .
Because we can write oht 12 and o t as
6
oht 12

t 12
nhg
23
H
n
g 1
t 12
h
t 12
ohg
23
(3.6)
nkt 12 t

ok12 ,
t
k 1 n
H
ot

we get for the covariances in (3.5)
t 12
H
 H nhg
nkt 12 t 
23 t 12

 cov
o
,
ok12
hg
23
t
 g 1 nht 12

k 1 n


1
t 12
t
t
 t 12 t cov(nht  12
23oh 23 , nh12oh12 ) .
nh n

cov(oht 12 , ot )

(3.7)
t 12 t 12
t
t 12 t 12
t
In the second line we used that cov(nhg
23ohg 23 , ok12 )  0 for k  h and that cov(nhg 23ohg 23 , oh12 )  0
t
t 12
t 12
t
for g  , because conditioning on nhg
23 , ohg 23 and o h12 have zero covariance. Note that nk12 is
fixed because, by construction,
nkt 12 
nt 12 t 12, t
N k
N t 12
(k  1, ..., H ).
t 12
Furthermore, nht  12
, N ht 12,t , N ht 12 ).
23 follows a hypergeometric distribution with parameters ( nh
Making use of the formula for conditional covariances, we can derive an expression for the covariance
in (3.7)
t 12 t
t
t 12 t 12 t
t
t 12 t 12, t
cov(nht  12
23oh 23 , nh12oh12 )  E{cov(nh 23oh 23 , nh12oh12 nh 23 , nh 2 )}
t 12 t 12 t 12, t
t
t
t 12 t 12, t
 cov{E (nht  12
23oh 23 nh 23 , nh 2 ), E ( nh12oh12 nh 23 , nh 2 )}.
(3.8)
For the first term at the right-hand side we have
t 12
t
t
t 12
t 12,t
E{cov(n ht 12
23 o h 23 , n h12 o h12 n h 23 , n h 2 )} 
t
t 12
t
t 12
t 12,t
 E{n ht 12
23 n h12 cov(o h 23 , o h12 n h 23 , n h 2 )}
t
 E{n ht 12
23 n h12 (
,t
n ht 12
/ n ht 12
2
23
n ht 12
,t
 S ht  12,t {E (n ht 12
2 )

1
N ht  12,t
n ht 12 E (n ht 12
23 )
N ht  12,t
) S ht  12,t }
},
(3.9)
where we used in the third line (A.1) in Appendix A. The second term at the right-hand side of (3.8)
equals
t 12 t 12 t 12, t
t
t
t 12 t 12, t
cov{E (nht  12
23oh 23 nh 23 , nh 2 ), E ( nh12oh12 nh 23 , nh 2 )} 
t 12 t
t
t 12 t
t 12
t
 cov(nht  12
23Oh , nh12Oh )  Oh Oh cov(nh 23 , nh12 )  0 .
7
The last covariance is zero because nht 12 is fixed. Therefore, the unconditional covariance is equal to
,t
(3.9). However, similar to poststratification we consider nht 12
and nht 12
2
23 as fixed. For the conditional
covariance we get according (3.7) and (3.9) the simple expression
,t
nht  12
nht  12
nht 12
23
2
(

) S ht 12, t .
t 12, t
nht 12nt nht  12
N
23
h
cov(oht 12 , ot ) 
(3.10)
For a justification of the use of a conditional (co)variance, see Holt and Smith (1979) and Knottnerus
,t
t 12
(2003, pp. 133-6). For expressions for E (nht  12
2 ) and E (nh 23 ), see Knottnerus and Van Delden (2005).
Expression (3.10) can be estimated by
,t
nht  12
nht  12
nht 12 t 12,t
23
2
(

) sh 2
nht 12nt nht  12
N ht 12,t
23
côv(oht 12 , ot ) 
,t
n ht 12
2
1
,t
sht  12

2
 (o
,t
nht  12
1
2
t 12
h 2i
(3.11)
 oht212 )(oht  2i  oht 2 ).
i 1
Note that (3.11) is also unbiased for estimating the unconditional covariance (3.9) because
,t
t 12, t
t 12, t
E ( sht  12
nht  12
.
2
23 , nh 2 )  S h
A refinement of (3.11) is based on more observations so that a possible substantial underestimation of
var(gˆ t ,t 12 ) can be avoided. It works as follows. Calculate the correlation coefficient
ˆ ht 212,t 
sht  2s
,t
sht  12
2
t
sht  12
2 sh 2
,t
n ht 12
2
1

 (o
,t
nht  12
1
2
t s
h 2i
 oht2s ) 2
( s  0, 12).
i 1
An alternative estimator for S ht 12,t now becomes
12, t
t
Sˆht 12, t  sht 123
 ˆ ht 212, t sht  12
23 sh12
sht  12
23
sht 12


n ht 12
23
1
nht  12
23  1
1
nht 12  1
 (o
t 12
h 23i
12 2
 oht23
)
(3.12)
i 1
n ht 12
 (o
t
h12i
 oht12 ) 2 .
i 1
3.3. Dynamic populations (births and deaths)
In order to allow for deaths and births we extend (3.6) in a natural way as follows
oht 12

H
H

t 12
nhg
23
o t 12
t 12 hg 23
g 1 nh
nkt 12 t
ot 
ok12
t
k 1 n



12, t
nht , out
nht 12
nt ,in12, t
nt
12
oht ,out
(3.13)
ot, in ,
8
respectively, where
nt ,in12,t 
nt 12
N t 12
N t ,in12,t
12,t
12,t
n ht ,out
~ Hypergeometric (n ht 12 , N ht ,out
, N ht 12 ).
12, t
Because samples from the substrata for births (say U ht ,in12,t ) and deaths (say U ht ,uit
) can be seen as
independent of other samples, they are irrelevant for a further analysis of the covariance. Similar to
(3.7) we can write cov(oht 12 , ot ) as
cov(oht 12 , ot ) 
1
nht 12nt
t 12 t
t
cov(nht  12
23oh 23 , nh12oh12 ).
(3.14)
The rest of the procedure is the same as described in section 3.2; cf. (3.11).
3.4. The remaining covariances
In this section we just mention the expression for the estimator of the second covariance term at the
right-hand side of (3.4). Taking births and deaths into account, we can estimate this term similarly to
(3.7) or (3.14) by
côv(Oˆ t , Oˆ janN ) 
H
N
janN t
Nh
h
côv(ohjanN , oht )
h 1
côv(ohjanN , oht ) 
1
nhjanN nht
, t janN
janN , t t
côv{nhhjanN
23 ohh 23 , nhh12 ohh12}
,t
janN , t
,t
nhhjanN
 nhhjanN
23  nhh 2
3
,t
,t
,t
nhhjanN
 nhhjanN
 nhhjanN
12
1
2
, t janN
janN , t t
janN , t
côv{nhhjanN
23 ohh 23 , nhh12 ohh12}  nhh 23 (
,t
,t
nhhjanN
nhhjanN
,t
2
12

) shhjanN
123 .
janN , t
janN , t
nhh23
N hh
,t
The quantity shhjanN
in the last line is defined similarly to (3.12) by
123
,t
t
ˆ janN,t shhjan123shh
shhjanN
123   hh2
12
,t
( ˆ hhjanN

2
,t
shhjanN
2
).
jan t
shh2 shh2
The other three covariances in (3.4) can be estimated in a similar way.
9
4. Variance of the
supermarkets
change
in
turnover
of
Dutch
4.1. Introduction
The calculations for the variances and confidence intervals are based on turnover data of Dutch
supermarkets of 4-week periods in 2003 and 2004. The population consists of about 3500 statistical
units. The turnover data are partly based on administrative files and partly on a stratified sample. The
administrative files contain about 950 units. All data from units within the administrative files are put
in a separate stratum, with weight 1. A gross sample of about 900 units is drawn from the full list of
population units within the GBR, thus also from units that are already present in the administrative
files. Consequently, 500 of the 900 sampled units are already present in the administrative files, but
those units do not receive a questionnaire. Thus, the net sample contains about 400 units. The sample
is stratified by size. All establishments with 50 or more employees are included with probability 1. The
other establishments were sampled with decreasing inclusion probability from 1:2 (20–49 employees),
to 1: 40 in the smallest size (1 employee).
4.2. Results
Table 1 shows that the estimated 95%-margins for the yearly growth rates of the currently used
,t 12
estimator, gˆ tpubl
from (2.2), varied between 0,6 and 0,9 (%-point). For example, in the first period
(weeks 9–12), the 95%-confidence interval for the yearly growth rate was -1,2 to 0,6 per cent.
t , t 12
Surprisingly, the 95%-margins for the alternative estimator, gˆ dir
from (2.3), which is a function of
t , t 12
two instead of four estimated totals, were slightly larger. The estimated 95%-margins of gˆ dir
varied
between 0,7 and 1,0 (%-point).
, t 12
Table 2 shows why the 95%-margins of the currently used estimator gˆ tpubl
are smaller than that of the
t , t 12
ˆ t / Oˆ t 12 , Oˆ janO / Oˆ janN ) is negative and its
alternative estimator gˆ dir
. The covariance term cov(O
absolute
value
is
more
than
half
of
the
value
of
var(Oˆ janO / Oˆ janN ) ,
resulting
in
t , t 12
, t 12
) . This covariance is negative because both Oˆ t 12 and Ô janO are based on the
var(gˆ tpubl
) < var(gˆ dir
units before the yearly sample update, whereas Ô t and Ô janN are based on the units after the yearly
, t 12
sample update. Thus the ratio Oˆ janO / Oˆ janN included in the currently used estimator gˆ tpubl
reduced the
t , t 12
, t 12
variance compared to the alternative estimator gˆ dir
. Because the estimated bias of both gˆ tpubl
and
t , t 12
gˆ dir
is much smaller than its estimated variance (not shown), the accuracy (mean squared error) the
, t 12
t , t 12
estimator gˆ tpubl
is also larger than that of gˆ dir
.
10
Table 1. Estimated growth rates and 95%-margins for two estimators of the yearly growth rates (2004
to 2003) of the turnover of a 4-week periods of Dutch Supermarkets.
Period
(weeknumbers)
09–12
13–16
17–20
21–24
25–28
29–32
33–36
37–40
41–44
, t 12
gˆ tpubl
t , t 12
gˆ dir
%
%
-0,3
-3,7
1,6
-2,2
0,5
-1,7
-2,2
0,0
-2,3
-0,4
-3,8
1,5
-2,3
0,4
-1,8
-2,3
-0,1
-2,4
95%-margin
(%-point)
, t 12
gˆ tpubl
t , t 12
gˆ dir
0,9
0,9
0,9
0,8
0,6
0,6
0,7
0,6
0,8
1,0
0,9
0,9
0,9
0,7
0,7
0,7
0,7
0,9
Table 2 Estimated variance components
Period
(weeknumbers)
09–12
13–16
17–20
21–24
25–28
29–32
33–36
37–40
41–44
var(Xˆ )
10-6
24,2
22,2
21,9
19,4
12,7
12,3
13,6
14,2
19,1
var(Yˆ ) cov( Xˆ , Yˆ )
10-6
3,5
3,5
3,5
3,5
3,5
3,5
3,5
3,5
3,5
10-6
-2,6
-3,2
-3,2
-2,8
-3,2
-3,0
-2,4
-3,4
-3,8
var(Zˆ )
cor(Oˆ t 12 , Oˆ t )
10-6
22,5
19,3
18,9
17,2
9,8
9,7
12,3
10,8
14,9
0,18
0,17
0,19
0,20
0,37
0,35
0,32
0,42
0,29
X̂ = Oˆ t / Oˆ t 12 ; Yˆ = Oˆ janO / Oˆ janN ; Ẑ = Xˆ  Yˆ
5. Concluding remarks
In the present paper we derived the formula for the variance of the yearly growth rate of a monthly
turnover in a dynamic population with changing strata in a straightforward manner. The covariance
terms in the formulas are estimated from the correlation, which in turn is estimated from the matched
sample. Berger (2004) remarks that the correlation may be overestimated, resulting in a negative bias
of the variance of the growth rates. To avoid this bias, Berger (2004) uses conditional Poisson
sampling. In the example of the Dutch Supermarkets the bias of the estimated variances is expected to
be very small. When we repeatedly draw small samples of 3 units from a stratum of 34 population
units, we found that the estimated correlation among the sampling units corresponding to period t
versus t–12 was underestimated by about 10 per cent. In practice however, only some of the strata
have such small sample sizes, other strata are sampled completely and thus without bias. This means
11
that the covariance of the estimated totals côv(Oˆ t , Oˆ t 12 ) will be underestimated by less than 10 per
cent. The correlation between Ô t and Oˆ t 12 was about 0,2. Taking a 10 per cent bias as a worst
t , t 12
scenario, that means that the variance of Gˆ dir
was overestimated by no more than 1,9 per cent and
the corresponding confidence interval by no more than 1,0 per cent.
The variance formulas proposed here can be extended in various ways. First of all it can be shown that
they can be extended to the situation where the generalised regression estimator or the ratio estimator
is used instead of the Horvitz-Thompson estimator. This is useful because in business surveys one
often exploits auxiliary data, such as VAT data; see Hiridiglou et al. (1995) and Smiths et al. (2003).
Secondly, the variance formulas can be extended to take into account other (nonlinear) functions of
totals. One example is the estimation of a chain of 12 growth rates. This is useful for situations where
in each month two estimates of the actual level are made, for instance when monthly growth rates are
based on the method of matched pairs; see Smiths et al. (2003). Another type of functions of totals is
to estimate the yearly change of the sum of 12 months of the present year compared to the
corresponding sum of the previous year. Thirdly, the formulas can easily be adjusted to allow for
situations of post-stratification. That would be useful for social statistics when estimating a change as
the difference between two levels. Finally, here we assumed that the stratum sizes are known. In some
situations however, the population totals of the strata are estimated; see e.g. Lowerre (1979). Future
research is needed to find out how the formulas can be extended to allow for estimated stratum sizes.
The example of the Dutch Supermarkets shows one of the practical applications of the variance
formulas: determining which out of two possible estimators has the smallest variance. The results
, t 12
show that the variance of the more complicated estimator, gˆ tpubl
from (2.2), which corrects for the
t , t 12
sample refreshment in January, was slightly smaller than that of the alternative estimator gˆ dir
from
(2.3). In branches (other SIC codes) where the correlation between the turnover of subsequent months
, t 12
t , t 12
is small the opposite may be true: a smaller variance of gˆ dir
than of gˆ tpubl
. In our example the
difference between the two estimators was small. That difference might be more substantial when a
larger proportion of the sample size is refreshed in January.
Appendix A. Some useful formulas
Let s123 denote a mother sample consisting of three mutually disjoint SRS subsamples s1 , s2 and s3 .
Let the variable x be observed in s12 and the variable y in s23 . The corresponding sample means are
denoted by x12 and y 23 , respectively. Furthermore, it is assumed that s1 and s3 have the same size
( n1  n3 ) . Define  by   n2 / n12 ( n2 / n23 ). Then the covariance between x12 and y 23 is equal to
12
cov( x12 , y23 )  (
S xy 
Proof
1
N 1

n12
N
(X
j

1

1
) S xy  (
 ) S xy
N
n23 N
 X p )(Y j  Yp ).
j 1
cov( x12 , y23 )  cov{(1   ) x1  x2 , y2  (1   ) y3}
 (1   ) cov( x1 , y23 )  2 cov( x2 , y2 )   (1   ) cov( x2 , y3 )
 (1   )
(

2
n2

S xy
N
 2 (
S xy
1 1
 ) S xy   (1   )
n2 N
N
1
 1

1
) S xy  (
 ) S xy  (
 ) S xy .
N
n12 N
n23 N
In the third line we used that cov( x1, y23 )  cov( x2 , y3 )  S xy / N . This follows from the conditional
covariance formula
cov( x2 , y3 )  E{cov( x2 , y3 s2 )}  cov{E ( x2 s2 ), E ( y3 s2 )})
 0  cov{x2 ,

Y p  f 2 y2
1  f2
}
( f 2  n2 / N )
S xy
S xy
f2
f
cov( x2 , y2 )   2 (1  f 2 )

.
1  f2
1  f2
n2
N
For an alternative proof based on the sampling autocorrelation coefficient, see Knottnerus (2003, p.
375). Likewise, for   n2 / n12 and   n2 / n23 (    ) we get
cov( x12 , y23 )  (

n2

1
 1

1
) S xy  (
 )S xy  (
 )S xy .
N
n12 N
n23 N
(A.1)
References
Berger, Y.G. (2004). Variance estimation for measures of change in probability sampling, The
Canadian Journal of Statistics, 32, 451–467.
Hiridoglou, M.A., Särndal, C.E. and Binder, D.A. (1995). Weighting and estimation in business
surveys, pp. 477–502 in: B.G. Cox (ed.), Business Survey Methods, John Wiley & Sons, New
York.
Holt, D., and Skinner, C.J. (1989). Components of change in repeated surveys, International
Statistical Review, 57, 1–18.
Holt, D., and Smith, T.M.F. (1979). Poststratification, Journal of the Royal Statistical Society, A, 142,
33–46.
Knottnerus, P. (2003). Sample Survey Theory: Some Pythagorean Perspectives, Springer-Verlag, New
York.
13
Knottnerus, P. and Delden, A. van (2005). Confidence margins and run-away effects of the yearly
growth rates of monthly sales in the supermarkets (in Dutch), Internal CBS Report, Voorburg.
Konschnik, C.A., Monsour, N.J., and Detlefsen, R.E. (1985). Constructing and maintaining frames and
samples for business surveys, Proceedings of the Survey Research Methods Section, American
Statistical Association, 113–122.
Kish, L. (1965). Survey sampling, John Wiley & Sons, New York.
Laniel , N. (1987). Variances for a rotating sample from a changing population. Proceedings of the
Survey Research Methods Section, American Statistical Association, 496–500.
Lowerre, J.M. (1979). Sampling for change, Proceedings of the Survey Research Methods Section,
American Statistical Association, 343–347.
Nordberg, L. (2000). On variance estimation for measures of change when samples are coordinated by
the use of permanent random numbers, Journal of Official Statistics, 16, 363–378.
Smiths, P., Pont, M., and Jones, T. (2003). Developments in business survey methodology in the
Office for National Statistics, 1994–2000, The Statistician, 52, 257–295.
Tam, S.M. (1984). On covariances from overlapping samples, The American Statistician, 38, 288–289.
14
Download