SBS Workshop: *Structural Business Statistics on

advertisement
in partnership with
Title:
Guidelines (including options) on how the BR interacts
with the SDWH - WORKING DOCUMENT
WP:
2
Deliverable:
2.2.1
Version: 1.2
Date:
28-2-2012
Autor:
NSI:
Netherlands
Pieter Vlag
ESS - NET
ON MICRO DATA LINKING AND DATA WAREHOUSING
IN PRODUCTION OF BUSINESS STATISTICS
Contents
Contents ...................................................................................................................................... 2
1.
Summary ..................................................................................................................................... 3
2.
Introduction ................................................................................................................................. 4
2.1
Definition of a statistical DataWareHouse (according to the FPA) ........................... 4
2.2
The statistical DataWarehouse: architecture and layers ............................................ 4
2.3
Linking different input data: the population frame ..................................................... 6
2.4
Relationship between the population frame and the Business Register ...................... 6
Statistical units, backbones and default target population ........................................................... 7
3.
3.1
Statistical units and default target population ............................................................ 7
3.2
Backbone “population frame” ................................................................................... 8
3.3
Backbones “turnover” and “employment” ................................................................. 8
3.4
Determination of the default target population ........................................................... 9
3.5
Updating the population ............................................................................................ 11
4.
The largest enterprises ............................................................................................................... 11
5.
Linking the data-sources to the statistical unit .......................................................................... 12
5.1
Position in the statistical process of a DataWareHouse ........................................... 12
5.2
Variation in input units .............................................................................................. 12
5.3
Variation in output units ............................................................................................ 12
5.4
The statistical unit and the process of a statistical-DWH ......................................... 12
5.5
The statistical unit base ............................................................................................. 13
Correction information in the population frame and feedback to SBR ..................................... 14
6.
7.
6.1
The position of the Business Register in a statistical DataWareHouse .................... 14
6.2
Determination of the default target population in the statisical-DWH ..................... 15
6.3
Panel surveys and correcting the population frame .................................................. 16
6.4
Timing of feedback to the unit base and the backbones ............................................ 17
6.5
Timing of feedback to the SBR ................................................................................... 17
Conclusions ............................................................................................................................... 18
2
1. Summary
An important characteristic of a statistical datawarehouse is that data-input come from
different data sources, like surveys, administrative data, accounting data and census data.
However, depending on the data source these input data may refer to different enterprise
units like the establishment, the legal (enterprise) unit or the enterprise group. Moreover
some data sources may cover different populations. More precisely some input data cover all
enterprises with a certain activity but others only big enterprises or other subpopulations.
Hence to use several input data sources for one statistical estimate – which is the aim of a
statistical DataWareHouse - it is crucial to ensure that these data are linked to the same
enterprise units and are compared target population.
Most statistical institutes use the Statistical Business Register (SBR) to derive a) a statistical
enterprise units and b) the target population, e.g. the group of enterprises to which the
statistical estimates refer. We propose that the statistical datawarehouse uses only the
statistical enterprise as unit. The default target population is defined as all enterpises which
have been active during the reference period. The use of only one unit is proposed for sake of
governance and clarity. The default target population is proposed because this default
corresponds with output requirements of the European regulations. Desired estimates about
subpopulation (like large or small enterprises only) can be determined from this default
within. It should be noted that enterprise units (plus some characteristics of these units) and
the corresponding default target population are input sources for the statistical
DataWareHouse and not the Business Register as a whole.
Errors might be detected in statistical units and target populations when linking other input
data to this information. A frequently expected error will be the NACE-code (=activity). If
influental, these errors need to be corrected in the statistical DataWareHouse. In practice
data-linking will be done source by source and not simultaneously for all sources. This is
because some sourcea become available much earlier than othera (and inverse). Hence, to
keep the process restrained one has to think at which step in the statistical proces errors in
statistical units and the target population have to be corrected in the DataWareHouse.
Another question is at which stage of the process this corrected information has to be linked
back to the SBR. In this document it will be argues that this information could be corrected at
the end of the processing phase and feedback to the SBR needs to take place once a year
unless the detected errors are really influential.
3
2.
Introduction
2.1
Definition of a statistical DataWareHouse (according to the FPA)
The main goal of the ESSnet on “micro data linking and data warehousing” is to prepare
recommendations about better use of data that already exist in the statistical system. Its
ultimate aim is:
‘To create fully integrated data sets for enterprise and trade statistics at micro level:
a 'data warehouse' approach to statistics.’
The broad definition of a data warehouse to be used in this ESSnet is therefore:
‘A common conceptual model for managing all available data of interest, enabling the NSI to
(re)use this data to create new data/new outputs, to produce the necessary information and
perform reporting and analysis, regardless of the data’s source.’
Within this ESSnet one workpackage (WP 2) covers all essential methodological elements for
designing, building and implementing the statistical data warehouse.
2.2
The statistical DataWarehouse: architecture and layers
Another workpackage (WP 3) covers all essential architectural and technical elements for
designing, building and implementing the statistical data warehouse. Basically this
workpackage has applied the GSPBM sub-processes to the statistical-DWH concept in order
to provide a Business Architecture of the statistical-DWH. Moreover, it proposes a modular
workflow for the SDWH in order to manage the information flow between data sources and
SDWH central administration. To do this; it uses four functional layers:

data source layer,

integration layer,

interpretation and data analysis layer,

data presentation layer.
Figure 1 shows the GSBPM model. Figure 2 show the relationship between the phases of the
statistical process as defined by the GSBPM and the functional layers as proposed by the
workpackage 3 team. Statistical (enterprise) units and the target population play an important
role in the integration layer of the statistical-DWH, which corresponds with the processing
phase of the GSBPM. This is because in this phase different input sources are linked to each
other. Later, they are weighted to the target population in order to obtain statistical estimates.
In the next chapters of this document, we’ll discuss more precisely at which steps the
population frame and statistical units play a crucial role.
4
Figure 1 A schematic sketch of the GSBPM (Generic Statistical Business Process Model). Note that
the GSBPM divides the statistical process into 9 phases, which are divided into subprocesses.
Figure 2 Relationships between the layers of a statistical DataWareHouse and the statistical processes
according to the GSBPM (Generic Statistical Business Process Model).
5
2.3
Linking different input data: the population frame
One aim of a statistical DataWareHouse is to create a set of fully integrated data about
enterprises. These data may come from different data sources. These data sources are
collected in the collection phase of the “Business Architecture” (fig. 2).
In many countries the different data sources cover different populations. For example, the
Value Added Tax (VAT)-data and corporate tax data do not include the smallest enterprises,
but are quasi-complete for the part of the enterprise population they cover. Some survey
samples include the smallest enterprises but they provide only information about a few
enterprises within a population. Hence, linking these sources is not only a matter of linking
two sources but also a matter of relating all input data to a reference, the so-called target
population. Another factor, one has to take into account is that different sources may have
different units. For example, surveys in the Netherlands are based on statistical units (which
generally correspond with legal units), while VAT-units are based on enterprise groups.
Hence, it has to be agreed to which unit VAT-data and survey-data are linked. Other examples
in other countries can also be given.
Summarising, when linking several input data in a statistical DataWareHouse, one has to
agree about

the default target population, i.e. to which reference are all input data linked.

the enterprise unit to which all input data are matched.
These questions will be addressed in this deliverable. The technical aspects about linking of
several data-sources are described in deliverable 2.3 of the ESSnet on DataWareHousing
(DWH).
2.4
Relationship between the population frame and the Business Register
Member States of the European Union maintain business registers for statistical purposes as a
tool for the preparation and coordination of surveys, as a source of information for the
statistical analysis of the business population and its demography, for the use of
administrative data, and for the identification and construction of statistical units. The
Regulation (EC) No 177/2008 of the European Parliament and the Council (EC) sets out a
common framework for the harmonisation of the national business registers for statistical
purposes and Article 7 of the Regulation asks for the publication of a business register
recommendation manual. The manual aims to explain the reasoning behind the provisions of
the Regulation. It aims to provide the extra information required for the correct and consistent
interpretation of the Regulation in all countries. This second edition of the manual is derived
from the first one published in 2003 and replaces it. The manual has been updated in close
cooperation with the Member States.
The regulation and manual inlictly imply that business register contains at least
 a statistical unit.
6

a name and address of the statistical unit

an activity-code (NACE)

starting and a stopping date of enterprises.
The implication for the statistical DataWareHouse is that the needed population
characteristics, e.g. unit and default target population (see paragraph 2.3), can be derived from
the SBR. Hence, the SBR is indirectly a crucial input for the statistical DataWareHouse. The
complete statistical Business Register itself is not necessarily an input source for the statistical
DataWarehouse, but the population frame derived from it.
3. Statistical units, backbones and default target population
3.1
Statistical units and default target population
Taking into account the (expected ?) recommendations of the ESSnet on Consistency and the
European regulations about the Business, it is proposed that the statistical enterprise unit is
the standard unit to which all datasources are linked in the statistical DataWareHouse.
Taking into account the SBS-regulations, we propose that the default target population is
defined as: all enterprises with a certain activity being active during the reference period. The
activity is derieved from the NACE-code. In other words, for annual statistical this means that
the default target population consists of all active enterprises during the year, including the
starters and stoppers (and the new/stopping units due to merging and splitiing companies).
Note that this document uses the term default target population. This is the target population
of several obligatory statistics, like SBS, and the largest possible population for a reference
period as it includes all enterprises. We propose that the default target populations is used as
standard to check, clean and weight the input data in the processing phase of the
DataWareHouse (see figs. 2/3). On the other hand one aim of the statistical DataWareHouse
is to produce flexible output. Therefore, the statistical DataWareHouse should be able to
produce estimates about

all enterprises being active during the reference period (= standard)

and subpopulations (of this standard)
Examples of subpopulation are: large or small enterprises only or all enterprises active at a
certain date. Formally, these subpopulation also have a target population when estimating
aggregates. This leads to a confusion about the term target population in a statistical
DataWareHouse. To prevent this confusion,

the term default target population is used when when referring to the standard, i.e. all
enterprises being being active.

the term target population has a broader definition. It applies to producing estimates for
both the standard and the subpopulations.
7
Checking, cleaning, integrating and weighting the input data in a statistical DataWareHouse
(SDWH) are further discussed in chapter 5 of this deliverable and in other deliverables of the
ESSnet on DataWareHousing.
3.2 Backbone “population frame”
To determine the default target population in the SDWH, is crucial that the population frame
is derived from the SBR. This population frame consists of all enterprises being the SBR
during the year, regardless whether they are active or not. This input source will be called
‘backbone: population frame’ in the remainder of this document. To derive activity status and
subpopulations, it is recommended that this backbone includes the following information:
1) Frame reference year
2) Statistical enterprises unit, including its national ID and its EGR ID1
3) Name/address of enterprise of the enterprises
4) National ID of the enterprises
5) Date in population (mm/yr)
6) Date out of population (mm/yr)
7) NACE-code
8) Institutional sector code
9) Size class2
Note that the population frame is crucial information to determine the default active
population. However the activity status of enterprises cannot be derived from this backbone
itself. To estimate whether enterprises are active or not, a comparison with VAT and/or
employment is are needed. This will be discussed in the next chapter (3.3). Chapter 3.4 and
3.5 discuss how the active population can be determined in case the statistical
DataWareHouse

is limited to annual statistics

includes short-term statistics, too
3.3 Backbones “turnover” and “employment”
The results of the ESSnet on AdminData showed that VAT and social security data can be
used for turnover and employment estimates when quasi complete. The latter is the case for
quarterly statistics for most European countries on continental Europa and annual statistics for
all European countries. Note however that VAT can only be used for statistical purposes if a)
the data deliverably from the tax office is guaranteed and b) the link with the SBR
established. Assuming that these conditions are fullfilled, it is proposed to include backbones
of VAT-data and employment data as input data in a statistical DataWareHouse.
1
meaningless ID assigned by the EGR system to enterprises, it is advised to include this ID in the
DataWareHouse to enable comparatibility between the country specific estimates
2
could be based on employment data
8
The reasons to include a backbone VAT-data and a backbone social security data in a
statistical DataWareHouse are twofold:

Backbones of VAT and social security data are crucial to create a fully integrated dataset
suitable for flexible outputs

VAT and social security data are crucial to determine the activity status of the enterprises
and implictly to determine the default target population.
These reasons are explained further in the remainder of this section. When (quasi) complete,
VAT and social security data can be used to produce good-quality estimates on turnover and
employment. Therefore, it is very useful to use these estimates as benchmarks when
incorporating results of survey sampling in a statistical DataWareHouse. In other words,
together with the number of enterprises, totals of turnovers and employment should determine
the population frame when weighting survey results (or imputing non-observed enterprises).
This population frame can also be used when relating other datasets to it. Such an approach is
necessary to

create a fully integrated dataset using of several input data

reduce the impact of sampling errors of survey.
The first condition is the aim of a statistical DataWareHouse. The second condition is
required to produce flexible output, especially about subgroups of the defaul target
population.
Several NSIs use VAT- and social security data to determine the activity status of an
enterprise. More precisely, enterprises are considered as active if VAT and/or social security
data are available for the reference period or the previous period (in case of late VAT or late
social security data). This method is preferred over a suvey to determine the activity status,
becuase the latter might be biased due to high non-response rates under the stopped
enterprises. Summarising, VAT and social security data are crucial to determine the activity
status of the enterprises. Indirectly they are crucial to determine the default target population.
Note that several countries have incorporated VAT and social security data in the SBR
already. Even in this case, it is proposed to consider both admin data sources as separate
backbones in the statistical DataWareHouse. The reasons are twofold:

VAT and social security are essential factors in the estimation process within a statistical
DataWareHouse. For sake of transparancy, crucial decisions about these data should
preferably be taken within the statistical DataWareHouse and not outside.

SBR, VAT and social securiy data may provide contradictionay information, especially
when including short-term statistics (STS) in the statistical-DWH (see chapter 3.5).
3.4 Determination of the default target population
3.4.1 Case I: Statistical DataWareHouse is limited to annual statistics
9
The determination of the default target population is relatively easy, if the scope of the
statistical-DataWareHouse is limited to annual statistics. This case is relatively easy because
the required information from the SBR and administrative data (VAT + employment) can be
selected afterwards, i.e. when the year has finished. This is because surveys results and other
datasources with annual data (like accountancy data + combined results of four quarters)
become available after the year has ended. Furthermore, surveys designs about production,
investments etc. are not finalised before the year has ended. As a result, no provisional
populations to link provisional data during the calendar year are needed for the statistical
DataWareHouse.
Therefore, the default target population can be determined by

selecting all enterprises which are recorded in the SBR during the reference year

using the complete annual VAT and employment (admin) dataset to determine the
activity status.
3.4.2 Case 2: the Statistical DataWareHouse includes short-term statistics
The determination of the default target population becomes more complicated when results of
short-term statistics are included in the statistical DataWareHouse. In this case a provisional
population frame for reference year t frame should be constructed at the end of year t-1, i.e.
November or December. This population frame is used to design short-term surveys. It is also
the starting point for the SDWH. This provisional frame is called release 1 and it does
formally not cover the entire population of year t as it does not contain the starting enterprises
yet.
During the year the population frame is regularly updated with new information from the SBR
(especially new enterprises) and the administrative data (VAT + social security data). The
frequency of these updates depends on the updates of the SBR, VAT and social security data.
At the end of year t (or at the beginning of year t+1), a regular population frame for year t can
be constructed. This regular population frame consists of all enterpises in the year and is
called release 2.
The ESSnet of Administrative Data has observed that time-lags do exist between the
registration of starting/stopping enterprises in the SBR and the different admin data sources.
The impact of these time-lags differs per countries, because it depends

on the updates of the SBR, VAT and social security data

the quality the underlying source information
Despite the different impact of the time-lags, the ESSnet on AdminData has shown that these
time-lags do exist in every country and lead to revisions in estimates about active enterprises
on monthly and quarterly base. This effect is enhanced, because the admin data are not
entirely complete on quarterly base. These time-lag and incompleteness issues might be an
consideration for choosing a low-frequency for updating the population frame in a statistical
DataWareHouse For example, quarterly and/or bi-annual updates could be considered.
10
3.5 Updating the population
At the beginning of year t+1 (or latter) additional admin data and survey results for reference
year t become available. Therefore, it cannot be excluded that errors are detected in the
‘release 2’ population determined at the end of the reference year. For this reason, a special
procedure for additional frame error corrections should be developed and a final population
frame is foreseen for July, T+1. How these updates should be incorporated in the statistical
DataWareHouse and the SBR will be discussed in chapter 5.
This updating scheme is schematically presented in figure 3.
FATS frame
population of active
units year T
FATS survey population
year T
FATS frame
population of active
units year T
(revised)
Undercoverage
In both frame
populations
Frame error
procedure
Overcoverage
dec
Release 1
jul
Release 2
jan
apr
nov
Release 3
jul
nov
okt
jul
Release 4
jan
apr
jul
aug
Figure 3 Proposal for the construction of an annual population. Figure copied from van der Ven, 2012. This
figure is an example for FATS but can be generalised for the entire Data WareHouse. Please note that the release
2 in this figure is skipped for the SDWH-procedure.
4. The largest enterprises
The situation described in chapter 3 is applicable to most enterprises. However, an increasing
number of national statistical institutes (NSIs) have created an unit, which is responsable to
create a fully integrated dataset for the largest enterprises or largest enterprise groups which
dominate the economy. If such unit, or such integrated dataset for the largest enterprises,
exists, is could be considered as a backbone ‘largest enterprises’.
This backbone is an input for the statistical DataWareHouse and should be linked to the
backbone “population frame” in the first step of the integration phase of the statistical
DataWareHouse (GSBPM-step 5.1 – see figure 2). In the remainder of the process is similar
for the largest enterprises as well as all other enterprises.
11
5. Linking the data-sources to the statistical unit
5.1 Position in the statistical process of a DataWareHouse
Data-linking between the different sources is the first step in the processing phase of a
statistical DataWareHouse. As the population frame consists of the statistical enterprise unit
only, this step can be desribed more precisely as linking the input data to the statistical units.
Technical aspects of data-linking are described in deliverable 2.4 of the ESSnet on
DataWareHousing. The next chapter of this document addresses the question, which
information is required to link the several input sources to the statistical unit.
5.2 Variation in input units
Accountancy data, tax data (including VAT and social security data) and other data may be
reported for different parts within an enterprise group. These data might be reported for the
enterprise group as whole, the underlying legal units, the underlying legal units, and tax units
consisting of other part of the enterprise groups. The variation in units and the challenge of
linking them depends on the national legislation. Therefore, the impact of this issue differs
per country. The size of the enterprise also determines the variation is units and the
complexity of linking them. For small enterprises one-to-one relationships between the
different units can be assumed, but this assumption cannot be made for medium-sized
enterprises. Nevertheless, whatever the extent of this issues in the individual countries and
whatever the determination of the statistical unit, it cannot be taken for granted that all input
data link automatically to the statistical unit. Hence, the relationship between these ‘input’
units and the statistical unit should be known before the data can be linked.
Data-linking is of less importance when using surveys, because surveys are generally based
on statistical units as they are designed from information of the SBR.
5.3 Variation in output units
Most statistical estimates in enterpise statistics is produced on the statistical unit. Examples
are SBS, STS and most instituional statistical. However, some output is produced on different
units like local untis, LKAU, KAU or enterprises groups. Again the complexity of linking
these units depends per country and size of the enterprises. Nevertheless, one-to-one
relationships between these output units and the statistical enterprise unit cannot be taken for
granted. Hence, relationships between the ‘output’ units and the statistical units should be
known before flexible outputs can be generated.
5.4 The statistical unit and the process of a statistical-DWH
The most simple and transparant statistical process can be generated by
12

Linking all input sources to the statistical enterprise unit at the beginning of the
processing phase (GSBPM-step 5.1 – see figures 2,3).

Performing datacleaning, plausibilty checks and data-integration on statistical units only
(GSBPM steps 5.2-5.6).

Producing statistical output (GSBPM-steps 5.7-5.8) by default on the statistical unit and
the default target population. Flexible outputs on other target populations and other units
are also produced in these steps by using repeated weighting techniques and/or domain
estimates. Technical aspects of these estimation methods are described in deliverable 2.8
of the ESSnet on DataWareHousing.
Note that it is theoretically possible to perform data-analyses and data-cleaning on several
units simultaneously. However, experiences of Statistics Netherlands with cleaning VAT-data
on statistical units and ‘implementing’ these changes on the original VAT-units too, reveal
that the statistical process becomes quite complex. Therefore, it is proposed that

linking to the statistical units is carried out at the beginning of the processing phase only.

the creation of a fully integrated dataset is performed for statistical units only

statistical estimates for other units are produced at the end of the processing phase only

relationships between the different in- and output units on one hand and the statistical
enterprise unit on the other hand should be known (or estimated) beforehand.
5.5 The statistical unit base
As the relationship between the different in- and output units on one hand and the statistical
enterprise units on the other hand should be known (or estimated) before the processing
phase, it is recommended to include this information in a separate input source. This input
source is the so-called unit base. It describes the relationship between the different units and
the statistical enterprise unit. Figure 4 illustrates the content of a unit base.
Note that the exact contents of the unit base depends on

legislation per country

output requirements and desired output of a statistical DataWareHouse

available input data.
It should also be mentioned that the unit base might be updated when other input data (with
different units) become available.
The unit base is closely related to the SBR. However, as its contents are also closely related to
available input data we recommend to consider it as a separate input source. The use of an
unit base is preferred over incorporating a statistical unit in data source. First of all, when
doing the latter the data-linking is implictly done in the collection phase of a statistical
DataWareHouse. Secondly, it is much more efficient and transparant to store the relationship
13
between the different in one source. This is especially the case when deficiences in the datalinking process are detected in a later phase of the statistical process and these deficiences
lead to corrections in the earlier determined relationships between the units.
Figure 4 Example of an unit base.
6. Correction information in the population frame and feedback
to SBR
6.1
The position of the Business Register in a statistical DataWareHouse
The position of the SBR in a statistical DataWareHouse is three-fold. More precisely

the SBR is the input source for the backbone ‘population frame’ of the statistical-DWH

the SBR is closely related to the unit base.

the SBR is the sampling frame for the surveys. Surveys are another important datasources of the statistical-DWH.
The last point implies that deficiences in the population frame, which might be detected
during the statistical process, should be incorporated in the SBR. By not doing this, the same
definciences will return the survey results of next period.
14
The key questions are:

at which step of the statistical proces should the population frame be corrected when
deficiences are detected ?

when are the same corrections made in the backbones and the SBR ?
The position of the SBR and its relationschip with the backbones, unit base and surveys are
illustrated in fig. 5
Figure 5 Illustration of the position of the SBR within a statistical DataWareHouse. The SBR is
basically related to three important input data of the S-DWH; the population frame, unit base and the
surveys. This figure also shows the position of a) data-integration and b) ‘weighting/calculation of
aggregates’ in the statistical process. It shows at which step in the statistical process the feedback to
the SBR preferably takes place. Note that backbones are denoted by brown cylinders and other input
data by grey cylinders.
6.2
Determination of the default target population in the statisical-DWH
As previously mentioned the backbones ‘population frame’, ‘turnover’ and ‘employments’
are used for the determination of the default target population. As these backbones are linked
to each other at the beginning of the processing phase (GSBPM step 5.1), the determination of
the default target population can take place here. We call this the provisional default target
population.
15
This provisional default target population can be used for checking, cleaning and integrating
the data at a mirco-level. During these steps, contradictionary information might be detected
(in practice: will probably be detected). Such contradictionary information may in extremis
lead to the conclusion that errors in the provisional default target population do exist.
Deliverable 2.8 of the ESSnet of DataWareHousing addresses the question how this
conclusion might be drawn, because deliverable 2.8 deals with hierarchy between the
different data sources.
Whatever the methodology for detection, errors in the provisional default target population
might have three possible origins. More specifically, they may be related

to errors in the data-linking

to errors in the VAT- and/or employment data

to errors in the population frame.
The first two point result in an erroneous estimation of the activity status and therefore the
number of active enterprises. It is expected that most errors in the population frame are related
to errors in

the NACE-code

the size class of the enterprise
In other words, other data sources like surveys and admin data indicate that the enterprise has
either another activity as recorded in the SBR or another size as recorded in the SBR.
If the SBR, unit base and VAT/employment backbones are of good quality, the number of
errors in the provisional default target population should be limited. Moreover, data-cleaning
+ data-integration at microlevel are basically independent of the number of active enterprises,
NACE-code and size class. Therefore, it is proposed to use the provisional default target
population for these steps, even after errors have been detected. Another reason for this
proposal is that errors might be detected at several stages of the data-cleaning and integration
process. Therefore, it preferred to collect all detected errors in this part of the process at a first
stage before correcting them in the population.
Errors in the provisional default target population might become influential when weighting
the integrated microdata and calculating aggregates at the end of the processing phase.
Therefore, it recommended correcting all errors in the population just before performing these
steps! Hence the provisional default target population is replaced by the default target
population at (the beginning of) GSBPM-step 5.7 (“weighting”).
6.3 Panel surveys and correcting the population frame
Surveys are another data source to detect errors in the default target population. Especially,
surveys about produced goods, performed services and investments can be very useful to
detect errors in the NACE-code. However, it should be prevented that corrections of NACE16
codes ect. lead to selectivity in the quality of the SBR. Selectivity means in this context: some
parts of the SBR are of better quality than others. To prevent this drawback, one should be
very cautious in the use of (limited) panel survey data to correct information in the SBR.
When using panel surveys, erroneous information leading to incorrect estimates should
preferably be treated as outliers. This warning is not valid for admin data, because these data
cover almost the entire population.
6.4 Timing of feedback to the unit base and the backbones
The unit base and the backbones “population frame”, “turnover”and “employment” have a
crucial role in the linking and estimation process. Therefore, it is advisable that if the default
target population is updated due to errors in these sources, these sources themselves are
updated too. These updates are desirable to ensure that late information or later available input
data are processed with the correct population information. Disadvantage of correcting the
input data is that previous published estimates are revised when re-running the process the
with improved population information. In this case, previous published estimates are defined
by published estimates before the errors in the population were detected. This drawback of
unexpected revisions, can be limited by

developing a good metadata system, i.e. which data belong to which estimate

using the paradigm that the information derieved from the SBR (= source for population
frame) and the backbones “turnover” and “employment” is correct unless otherwise
proven. In other words, population and corresponding input data are corrected only, if the
detected errors are certain and influential

relating the timing of incorporating changes in the input data to the revision moments of
the most important statistical outputs.
Due to time-lags between different data-sources, the revisions caused by corrected
information in the unit base and admin data sources are likely larger if the statistical
DataWareHouse covers short-term statistics, too.
6.5 Timing of feedback to the SBR
It has been argued in the previous chapter that updates in the provisional target populations
due to proven and influential errors in the

backbone “population (frame)

backbones “turnover” and “employment”

unit base
should be accompanied by updates in the corresponding backbones, too. However, the timing
of these updates should correspond with the timing of the revision moment in the most
important estimates.
17
The backbone “population frame” and unit base are strongly related to the SBR. Hence, the
SBR should also be updated, if the population frame and unit base are updated due to proven
and influential errors. However, the timing of updating the SBR is extremely important as the
SBR also acts as a frame for survey sampling including for surveys falling outside the scope
of the statistical DataWareHouse.
The importance of the timing can be best illustrated with an example. If the SBR is used as
sampling frame for an STS-survey of current year t+1 and the SBR is ‘suddenly’ updated
with information from the statistical DataWarehouse from last year t, a sudden – and
concerning timing incorrect - discontinuity in the STS-survey series arises. The question is
whether this discontinuity is desirable. The same applies for surveys falling outside the scope
of the statistical DataWareHouse.
Therefore, it is advisable to develop a strategy for correcting information in the SBR. A
possible strategy is:

For the errors with the biggest impact: corrections in the SBR are simultaneously with the
corrections in the backbone ‘population frame’. However, consultation with the
stakeholders of the most important statistics outside the scope of the statistical
DataWareHouse is trongly recommended.

For less influential errors: corrections in the SBR are carried out at the end of the
calendar year when all surveys are renewed or refreshed.
7.
Conclusions
Two conditions are required for a succesful statistical DataWareHouse. Firstly, the population
is well defined. Secondly, one (type of an) enterprise unit should be used in the statistical
DataWareHouse, because it is – in practice – impossible to create integrated datasets for
several (types of) enterprise units. To ensure these conditions, backbones about “population”,
“turnover”, “employment” and “large enterprises” are required. Furthermore, a unit base is
needed to link the different units of all individual data sources to the statistical enterprise unit.
The SBR is an indirect data source for the statistical DataWareHouse. It is an indirect source
because 1) the backbone “population” is derived from it, 2) the unit base is strongly related to
it and 3) the surveys – another important data source for the statistical DataWareHouse – are
based from it. Hence, when errors in the population are revealed after integrating different
data sources, it is desired that these errors are corrected in the SBR, too. However, the timing
of incorporating these corrections in the SBR is extremely important due to multiple use of
SBR-information in data sources within or beyond the scope of the statistical DataWarehouse.
18
Download