2.2.1 Guidelines on how the BR interacts with the S

advertisement
in partnership with
Title:
WP:
Guidelines (incl. options) on how the BR interacts with the S-DWH
WORKING DOCUMENT
2
Deliverable:
2.2.1
Version: 2.0
Date:
15-3-2013
Autor:
NSI:
Netherlands
Pieter Vlag
ESS - NET
ON MICRO DATA LINKING AND DATA WAREHOUSING
IN PRODUCTION OF BUSINESS STATISTICS
Contents
1.
Summary ..................................................................................................................................... 4
2.
Introduction ................................................................................................................................. 5
2.1
Definition of a statistical DataWareHouse (according to the FPA) ........................... 5
2.2
The statistical DataWarehouse: architecture and layers ............................................ 5
2.3
Linking different data sources: the population frame ................................................. 7
2.4
Relationship between the population frame and the Business Register ...................... 7
Statistical units, backbones and default target population ........................................................... 8
3.
3.1
Statistical units and default target population ............................................................ 8
3.2
Backbone “population frame” ................................................................................... 9
3.3
Backbones “turnover” and “employment” ................................................................. 9
3.4
Determination of the default target population ......................................................... 13
3.5
Updating the population ............................................................................................ 14
4.
The largest enterprises ............................................................................................................... 15
5.
Linking the datasources to the statistical unit............................................................................ 16
5.1
Position in the statistical process of a statistical-DWH ............................................ 16
5.2
Variation in input units .............................................................................................. 16
5.3
Variation in output units ............................................................................................ 17
5.4
The statistical unit and the process of a statistical-DWH ......................................... 17
5.5
The statistical unit base ............................................................................................. 18
2
Correction information in the population frame and feedback to SBR ..................................... 19
6.
7.
6.1
The position of the Business Register in a statistical DataWareHouse .................... 19
6.2
Determination of the default target population in the statisical-DWH ..................... 20
6.3
Panel surveys and correcting the population frame .................................................. 21
6.4
Timing of feedback to the unit base and the backbones ............................................ 22
6.5
Timing of feedback to the SBR ................................................................................... 22
Conclusions ............................................................................................................................... 23
3
1. Summary
An important characteristic of a statistical datawarehouse is that the data come from different
data sources, like surveys, administrative data, accounting data and census data. However,
depending on the data source these input data may refer to different parts of an enterprise
like the establishment, the legal (enterprise) unit or the enterprise group. Moreover different
data sources may cover different enterprise populations. More precisely some input data
cover all enterprises with a certain activity but others big enterprises or other subpopulations
only. Hence to use several input data sources for one statistical estimate – which is the aim of
a statistical DataWareHouse - it is crucial to ensure that these data are linked to the same
enterprise units and are compared with the same target population.
Most statistical institutes use the Statistical Business Register (SBR) to derive a so-called
population frame, which consists of all (enterprise) units with a certain specific activity. We
propose that the statistical DataWareHouse uses only the statistical enterprise unit as unit
when processing the data. The use of only one unit is proposed for sake of governance and
clarity. The default target population is a statistical DataWareHouse is defined as statistical
enterprise units which have been active during the reference period. The default target
population is proposed because this default corresponds with output requirements of the
European regulations. This default target population can also be used to derive estimates
about subpopulations (like large or small enterprises only) in a statistical DataWareHouse. It
should be noted that the populations frame is an input source for the statistical
DataWareHouse and not the Business Register as a whole.
Errors might be detected in statistical units and target populations when linking other data
sources to this information. A frequently expected error will be the NACE-code, which
classifies the kind of activity. If influental, these errors need to be corrected in the statistical
DataWareHouse. As the different source data arrive – in practive – at different times, datalinking will be done source by source and not simultaneously in one time. Hence, to keep the
process restrained one has to think at which step in the statistical proces errors in statistical
units and the target population have to be corrected in the DataWareHouse. Another question
is when this corrected information has to be linked back to the SBR. In this document it will
be argued that this information could be corrected at the end of the processing phase and
feedback to the SBR needs to take place once a year unless the detected errors are really
influential.
4
2.
Introduction
2.1
Definition of a statistical DataWareHouse (according to the FPA)
The main goal of the ESSnet on “micro data linking and data warehousing” is to prepare
recommendations about better use of data that already exist in the statistical system. Its
ultimate aim is:
‘To create fully integrated data sets for enterprise and trade statistics at micro level:
a 'data warehouse' approach to statistics.’
The broad definition of a data warehouse to be used in this ESSnet is therefore:
‘A common conceptual model for managing all available data of interest, enabling the NSI to
(re)use this data to create new data/new outputs, to produce the necessary information and
perform reporting and analysis, regardless of the data’s source.’
One workpackage in this ESSnet (WP 2) covers all essential methodological elements for
designing, building and implementing the statistical data warehouse.
2.2
The statistical DataWarehouse: architecture and layers
Another workpackage (WP 3) covers all essential architectural and technical elements for
designing, building and implementing the statistical DataWareHouse (statistical-DWH).
Basically this workpackage has linked the GSPBM sub-processes to the statistical-DWH
concept in order to provide a Business Architecture for the statistical-DWH. Moreover, it has
proposed a modular workflow for the statistical-DWH in order to manage the information
flow between data sources and the central administration of a DataWareHouse. To do this; it
uses four functional layers:

data source layer,

integration layer,

interpretation and data analysis layer,

data presentation layer.
Figure 1 shows the GSBPM model. Figure 2 show the relationship between the phases of the
statistical process as defined by the GSBPM and the functional layers as proposed by the
workpackage 3 team.
Note that statistical (enterprise) units, which are needed to link several input data, and the
target population, which is needed to relate the input data to statistical estimates, play an
important role in the processing phase of the GSBPM. This processing phase corresponds
with the integration layer of the statistical-DWH.
5
In the next chapters of this document, we’ll discuss more precisely at which steps populations
and statistical units play a crucial role.
Figure 1 A schematic sketch of the GSBPM (Generic Statistical Business Process Model). Note that
the GSBPM divides the statistical process into 9 phases. These phases are divided into subprocesses.
6
Figure 2 Relationships between the layers of a statistical DataWareHouse and the statistical processes
according to the GSBPM (Generic Statistical Business Process Model).
2.3
Linking different data sources: the population frame
Aim of a statistical-DWH is to create a set of fully integrated data about enterprises, which
enables a statistical institute to produce flexible and consistent output. The integrated data
come from different data sources. These data sources are collected in the collection phase of
the “Business Architecture” (fig. 2).
In practine, different data sources may cover different populations in most countries. For
example, the Value Added Tax (VAT)-data and corporate tax data do not include the smallest
enterprises, but are quasi-complete for the part of the enterprise population they cover. Survey
samples include information about the smallest enterprises but generally provide only data for
a limited number of small enterprises in a country. Hence, linking data of several sources is
not only a matter of linking but also a matter of relating all input data to a reference, the socalled default target population. Another factor, one has to take into account is that different
sources may have different units. For example, surveys in the Netherlands are based on
statistical units (which generally corresponds with legal units), while VAT-units are based on
enterprise groups. Hence, when linking VAT-data and survey-data to the target population, it
has to be agreed to which unit these data are linked. Other examples in other countries can
also be given.
Summarising, when linking several input data in a statistical-DWH, one has to agree about

the default target population, i.e. the reference frame to which all data sources are
linked.

the enterprise unit to which all input data are matched.
These questions will be addressed in this deliverable. The technical aspects about linking of
several data-sources are described in deliverable 2.3 of the ESSnet on DataWareHousing
(DWH).
2.4
Relationship between the population frame and the Business Register
Member States of the European Union maintain business registers for statistical purposes as a
tool for the preparation and coordination of surveys, as a source of information for the
statistical analysis of the business population and its demography, for the use of
administrative data, and for the identification and construction of statistical units. The
Regulation (EC) No 177/2008 of the European Parliament and the Council (EC) sets out a
common framework for the harmonisation of the national business registers for statistical
purposes and Article 7 of the Regulation asks for the publication of a business register
recommendation manual. The manual aims to explain the reasoning behind the provisions of
the Regulation. It aims to provide the extra information required for the correct and consistent
interpretation of the Regulation in all countries. This second edition of the manual is derived
7
from the first one published in 2003 and replaces it. The manual has been updated in close
cooperation with the Member States.
The regulation and manual inplictly imply that business register contains at least

a statistical unit.

a name and address of the statistical unit


an activity-code (NACE)
starting and a stopping date of enterprises.
The implication for the statistical-DWH is that the required information about the reference or
population frame, e.g. unit and default target population (see paragraph 2.3), can be derived
from the SBR. Hence, the SBR is indirectly a crucial input for the statistical DataWareHouse.
The complete statistical Business Register itself is not necessarily an input source for the
statistical DataWarehouse, but the population frame derived from it.
3. Statistical units, backbones and default target population
3.1
Statistical units and default target population
Taking into account the expected recommendations of the ESSnet on Consistency and the
European regulations about the Business, it is proposed that the statistical enterprise unit is
the only unit to which all datasources are linked in the statistical DataWareHouse.
Taking into account the SBS-regulations, we propose that the default target population is
defined as: all enterprises with a certain kind of activity being economically active during the
reference period. The NACE-code is used to classify the kind of activity. In other words, for
annual statistical this means that the default target population consists of all active enterprises
during the year, including the starters and stoppers (and the new/stopping units due to
merging and splitting companies).
Note that this document uses the term default target population. This population corresponds
with the target population of several important obligatory statistics, like SBS. Furthermore, it
is the largest possible population for a reference period becuase it includes all enterprises with
some economic activity during (part of) the period. We propose that the default target
populations is used as standard to check, link, clean and weight the input data in the
processing phase of the statistical-DWH (see figs. 2/3). On the other hand one aim of a
datawarehouse is to produce flexible output. Therefore, the statistical-DWH should be able to
produce estimates about subpopulations of this standard, too.
Examples of subpopulation are: large or small enterprises only or all enterprises active at a
certain date. From a pure methodological term of view, these subpopulation have a target
population when estimating population totals, too. Therefore, some confusion about the term
target population in a statistical DataWareHouse may arise. To prevent this confusion,

the term default target population is used when explicitly referring to the standard, i.e. all
enterprises being being active.
8

the term target population has a broader definition. It refers to estimation populations for
both the standard and all subpopulations.
Checking, cleaning, integrating and weighting the input data in a statistical DataWareHouse
(SDWH) are further discussed in chapter 5 of this deliverable. They are more extensively
discussed in other deliverables of the ESSnet on DataWareHousing.
3.2 Backbone “population frame”
To determine the default target population in the SDWH, two kind of datasources are needed

the population frame, i.e. a list of enterprises with a certain kind of activity during a
period.
 information to determine which enterprises of the list really performed economic
activities during a period
As previously mentioned, the population frame is derived from the SBR. This population
frame consists of all enterprises being the SBR during the year, regardless whether they are
active or not. This input source is backbone for the statistical-DWH and will be simply called
‘population frame’ in the remainder of this document. To derive activity status and
subpopulations, it is recommended that this backbone includes the following information:
1) the frame reference year
2) the statistical enterprises unit, including its national ID and its EGR ID1
3) the name and address of the enterprise
4) the national identification number (ID) of the enterprise
5) the date in population (mm/yr)
6) the date out of population (mm/yr)
7) the NACE-code
8) the nstitutional sector code
9) a ize class2
Note that the population frame is crucial information to determine the default active
population. Only the activity status of enterprises cannot be derived from this backbone itself,
mainly because it can often be determined afterwards. To estimate whether enterprises really
carried our economic activities, a comparison with VAT and/or employment data is done.
This will be discussed in the next chapter (3.3). Chapter 3.4 and 3.5 discuss how the
population frame and default active target population can be determined in two specific cases

the statistical-DWH is limited to annual statistics

the statistical-DWH includes short-term statistics, too
3.3 Backbones “turnover” and “employment”
1
meaningless ID assigned by the EGR system to enterprises, it is advised to include this ID in the
DataWareHouse to enable comparatibility between the country specific estimates
2
could be based on employment data
9
The results of the ESSnet on AdminData showed that VAT and social security data can be
used for turnover and employment estimates when quasi complete. The latter is the case for
annual statistics and for quarterly statistics in most European countries on continent. Note
however that VAT and social security data can only be used for statistical purposes if a) the
data transfer from the tax office to the statistical institute is guaranteed and b) the link with
the SBR established. Assuming that these conditions are fullfilled, it is proposed to include
backbones of VAT-data and employment data as input data in a statistical-DWH.
VAT and social security cover almost all enterprises in the domain covered by the SBS and
STS-regulations and are timely available (i.e. earlier than most annual statistics). Therefore,
two reasons do exist to include these datasources in a statistical-DWH and consider them as
backbones.
1.
2.
VAT and social security data are crucial to determine the activity status of the enterprises
and implictly to determine the default target population.
VAT and social security data are crucial to create a fully integrated dataset suitable for
flexible outputs, because these admin data sources contain information about almost all
enterprises (unlike survey which contain only information of a small sample of
enterprises).
The latter reason is explained further in the remainder of this section. When (quasi) complete,
VAT and social security data can be used to produce good-quality estimates of turnover and
employment. Therefore, these estimates can be used benchmarks when incorporating results
of survey sampling in a statistical DataWareHouse. In this case totals of turnovers and
employment define, together with the number of active enterprises, the basic population
characteristics. Other datasets or survey covering more specific parts of the population should
be made consistent with these three main characteristics of the entire population. This
improves the quality of the integrated dataset in a statistical-DWH because more auxiliary
information is used when weighting survey results (or other datasets) or when imputing for
missing values. This quality improvement can be quantified bu the reduction of variance in
varaibles not derieved from the backbones. Every estimator has it variance (e.g. uncertainty),
many literature studies have proven that estimates derived from weighting techniques using
auxiliary information (e.g. ratio or GREG-type estimators) have lower sampling errors than
estimates using no auxiliary information when weighting.
Summarising using backbones for turnover and employment in addition to the population
frame

improves the quality of a fully integrated dataset using of several input data, as two key
variables can be estimated precisely.

reduces the impact of sampling errors of other variables not observed in the backbones.
As the first condition is the aim of a statistical DataWareHouse and the second condition is
required to produce flexible output (especially about subgroups of the defaul target
10
population), this is the first argument to use backbones of employment and turnover in a
statistical-DWH.
Several NSIs use VAT- and social security data to determine the activity status of an
enterprise. More precisely, enterprises are considered as active if VAT and/or social security
data are available for the reference period or the previous period (in case of late VAT or late
social security data). This method is preferred over a suvey to determine the activity status,
becuase the latter might be biased due to high non-response rates under the stopped
enterprises. Summarising, VAT and social security data are crucial to determine whether an
enterprises has been economial active or not. Hence, backbones of turnover and employment
are crucial to determine the default target population. This is the second reason two include
these backbones in a statistical-DWH.
Several countries have incorporated VAT and social security data in the SBR already. Even in
this case, it is proposed to consider both admin data sources as separate backbones in a to be
developed statistical- DWH. The reasons are twofold:

VAT and social security are essential factors in the estimation process within a statistical
DataWareHouse. For sake of transparancy, crucial decisions about linking these data with
the SBR should preferably be taken within the statistical DataWareHouse and not
outside.

SBR, VAT and social security data may provide conflicting information, especially
- in short-term statistics (STS),
- when using a frozen population frame
- when using a non-regularly updated datasources for the SBR.
For sake of transparancy, crucial decisions about dealing with such contradictionay
information should preferably be taken within the statistical-DWH and not outside.
Especially, because other datasources of the datawarehouse may help to decided which
datasourse is correct in case of conflicting information.
A schematic sketch of the positions of the backbones in a statistical-DWH is provided in
figures 3 and 4.
11
Figure 3 Illustration of the position of the SBR and the backbones in a statistical-DWH. Note that the
population, one of the backbones, is derieved from the SBR. The backbones “population”, admin data
based turnover (VAT), admin data based employment are used to described to population
characteristis. All other datasources are integrated to these characteristics in the beginning of the
processing phase. These backbones are also used for weighting when producing outputs at the end of
the processing phase.
12
Figure 3 The position of the SBR in case the VAT and employment data are completely integrated in
the SBR. Note that in this case, crucial parts of the data-linking are carried out outside the statisticalDWH.
3.4 Determination of the default target population
3.4.1 Case I: Statistical DataWareHouse is limited to annual statistics
The determination of the default target population is relatively easy, if the scope of the
statistical-DataWareHouse is limited to annual statistics. This case is relatively easy because
the required information from the SBR and administrative data (VAT + employment) can be
selected afterwards, i.e. when the year has finished. This is because survey designsm surveys
results and other datasources with annual data (like accountancy data + combined results of
four quarters) become available after the year has ended. As a result, no provisional
populations are needed to link provisional data during the calendar year.
Therefore, the default target population can be determined by

selecting all enterprises which are recorded in the SBR during the reference year

using the complete annual VAT and employment (admin) dataset to determine the
activity status.
3.4.2 Case 2: the Statistical DataWareHouse includes short-term statistics
The determination of the default target population becomes more complicated when results of
short-term statistics are included in the statistical DataWareHouse. In this case a provisional
population frame for reference year t frame should be constructed at the end of year t-1, i.e.
13
November or December. This population frame is used to design short-term surveys. It is also
the starting point for the statistical-DWH. This provisional frame is called release 1 and it
does formally not cover the entire population of year t as it does not contain the starting
enterprises yet.
During the year the population frame is regularly updated with new information from the SBR
(especially new enterprises) and the administrative data (VAT + social security data). The
frequency of these updates depends on the updates of the SBR, VAT and social security data.
At the end of year t (or at the beginning of year t+1), a regular population frame for year t can
be constructed. This regular population frame consists of all enterpises in the year and is
called release 2.
The ESSnet of Administrative Data has observed that time-lags do exist between the
registration of starting/stopping enterprises in the SBR and the different admin data sources.
The impact of these time-lags differs per countries, because it depends

on the updates of the SBR, VAT and social security data

the quality the underlying datasources
Despite the different impact of the time-lags, the ESSnet on AdminData has shown that these
time-lags do exist in every country and lead to revisions in estimates about active enterprises
on monthly and quarterly base. This effect is enhanced, because the admin data are not
entirely complete on quarterly base. These time-lag and incompleteness issues might be an
consideration for choosing a low-frequency for updating the population frame in a statistical
DataWareHouse For example, quarterly and/or bi-annual updates could be considered.
3.5 Updating the population
At the beginning of year t+1 (or latter) additional admin data and survey results for reference
year t become available. Therefore, it cannot be excluded that errors are detected in the
‘release 2’ population determined at the end of the reference year. For this reason, a special
procedure for additional frame error corrections should be developed and a final population
frame is foreseen for July, T+1. How these updates should be incorporated in the statistical
DataWareHouse and the SBR will be discussed in chapter 5.
This updating scheme is schematically presented in figure 5.
14
FATS frame
population of active
units year T
FATS survey population
year T
FATS frame
population of active
units year T
(revised)
Undercoverage
In both frame
populations
Frame error
procedure
Overcoverage
dec
Release 1
jul
Release 2
jan
apr
nov
Release 3
jul
okt
jul
Release 4
jan
apr
jul
nov
aug
Figure 5 Proposal for the construction of an annual population in case the statistical-DWH includes short-term
statistics. Figure copied from van der Ven, 2012. Please note that the release 2 in this figure is skipped for the
SDWH-procedure.
4. The largest enterprises
An increasing number of national statistical institutes (NSIs) have created a team, which is
responsable to create a fully integrated dataset for the largest enterprises or largest enterprise
groups which dominate the economy. In contrast to other enterprises, these large enterprises
are systematically analysed at mircolevel and data are made consistent at datasources. Main
motivation to create such a team is the large contribution of the largest to the economy and
some specific statistical estimates.
If such team, or such integrated dataset for the largest enterprises, exists, is could be
considered as a backbone ‘largest enterprises’. The reason for this consideration is that the
dataset

is already integrated

covers all enterprises of a subpopulation (the largest enterprises)

is of good quality and don’t need to be re-analysed again

covers a considerable part of the total estimates and integration other datasources with
this backbone increases the quality of the integrated dataset.
This backbone is an input for the statistical DataWareHouse and should be linked to the
population frame in the first step of the integration phase of the statistical DataWareHouse
(GSBPM-step 5.1 – see figure 2). In the remainder of the process is similar for the largest
enterprises as well as all other enterprises.
15
Figure 3 Illustration of the position of the backbones in a statistical-DWH when the data of the largest
enterprises are integrated at the source.
5. Linking the datasources to the statistical unit
5.1 Position in the statistical process of a statistical-DWH
Data-linking between the different sources is the first step in the processing phase of a
statistical DataWareHouse. As the population frame consists of the statistical enterprise unit
only, this step can be desribed more precisely as linking the input data to the statistical units.
Technical aspects of data-linking are described in deliverable 2.4 of the ESSnet on
DataWareHousing. The next chapter of this document addresses the question, which
information is required to link the several input sources to the statistical unit.
5.2 Variation in input units
Accountancy data, tax data (including VAT and social security data) and other data may be
reported for different parts within an enterprise group. These data might be reported for the
enterprise group as whole, the underlying legal units, the underlying legal units, and tax units
consisting of other part of the enterprise groups. The variation in units and the challenge of
linking them depends on the national legislation. Therefore, the impact of this issue differs
16
per country. The size of the enterprise also determines the variation is units and the
complexity of linking them. For small enterprises one-to-one relationships between the
different units can be assumed, but this assumption cannot be made for medium-sized
enterprises. Nevertheless, whatever the extent of this issues in the individual countries and
whatever the determination of the statistical unit, it cannot be taken for granted that all input
data link automatically to the statistical unit. Hence, the relationship between these ‘input’
units and the statistical unit should be known before the data can be linked.
Data-linking is of less importance when using surveys, because surveys are generally based
on statistical units as they are designed from information of the SBR.
5.3 Variation in output units
Most statistical estimates in enterpise statistics is produced on the statistical unit. Examples
are SBS, STS and most instituional statistical. However, some output is produced on different
units like local untis, LKAU, KAU or enterprises groups. Again the complexity of linking
these units depends per country and size of the enterprises. Nevertheless, one-to-one
relationships between these output units and the statistical enterprise unit cannot be taken for
granted. Hence, relationships between the ‘output’ units and the statistical units should be
known before flexible outputs can be generated.
5.4 The statistical unit and the process of a statistical-DWH
The most simple and transparant statistical process can be generated by

Linking all input sources to the statistical enterprise unit at the beginning of the
processing phase (GSBPM-step 5.1 – see figures 2,3).

Performing datacleaning, plausibilty checks and data-integration on statistical units only
(GSBPM steps 5.2-5.6).
Producing statistical output (GSBPM-steps 5.7-5.8) by default on the statistical unit and
the default target population. Flexible outputs on other target populations and other units
are also produced in these steps by using repeated weighting techniques and/or domain
estimates. Technical aspects of these estimation methods are described in deliverable 2.8
of the ESSnet on DataWareHousing.

Note that it is theoretically possible to perform data-analyses and data-cleaning on several
units simultaneously. However, experiences of Statistics Netherlands with cleaning VAT-data
on statistical units and ‘implementing’ these changes on the original VAT-units too, reveal
that the statistical process becomes quite complex. Therefore, it is proposed that

linking to the statistical units is carried out at the beginning of the processing phase only.

the creation of a fully integrated dataset is performed for statistical units only

statistical estimates for other units are produced at the end of the processing phase only
17

relationships between the different in- and output units on one hand and the statistical
enterprise unit on the other hand should be known (or estimated) beforehand.
5.5 The statistical unit base
As the relationship between the different in- and output units on one hand and the statistical
enterprise units on the other hand should be known (or estimated) before the processing
phase, it is recommended to include this information in a separate input source. This input
source is the so-called unit base. It describes the relationships between the different units and

the statistical enterprise unit, which is used in the process of a statistical unit

the enterprise group, which is the base for tax and legal units in some countries, like the
Netherlands
The relationship between the enterprise group and the statistical (enterprise) unit should of
cource also be included in the unit base. It might be possible that no one-to-one relationships
between the different units are observed. In case one unit of a datasource corresponds with
severals statistical units, the ‘observed’ values of a datasources need to be allocated
(“subdivided”) correctly over the several statistical unit before they can processed further in
the statistical-DWH. To be able to do this, it is recommended to include an estimated share of
all units (including the statistical unit) to the enterprise groups in the unit base. This share
may be based on employment or turnover.
The unit base can be subvided into ‘input’ units, used to link the different dataset to the
statistical unit at the beginning of the processing phase (GSBPM-step 5.1: “intgrate data”) and
‘output unit’ used to produce output on other units than the statistical enterprise at the end of
the processing phase (GSBPM-step 5.7/5.8 “calculate aggregated”).
Figure 7 illustrates the content of a unit base.
Note that the exact contents of the unit base depends on

legislation per country

output requirements and desired output of a statistical DataWareHouse

available input data.
It should also be mentioned that the unit base might be updated when other input data (with
different units) become available.
The unit base is closely related to the SBR. However, as its contents are also closely related to
available input data we recommend to consider it as a separate input source. The use of an
unit base is preferred over incorporating a statistical unit in data source. First of all, when
doing the latter the data-linking is implictly done in the collection phase of a statistical
DataWareHouse. Secondly, it is much more efficient and transparant to store the relationship
between the different in one source. This is especially the case
18

when deficiences in the data-linking process are detected in a later phase of the statistical
process and these deficiences lead to corrections in the earlier determined relationships
between the units.

when additional new datasources are used for the statistical-DWH
Figure 7 Example of an unit base.
6. Correction information in the population frame and feedback
to SBR
6.1
The position of the Business Register in a statistical DataWareHouse
The position of the SBR in a statistical DataWareHouse is three-fold. More precisely

the SBR is the input source for the backbone ‘population frame’ of the statistical-DWH


the SBR is closely related to the unit base.
the SBR is the sampling frame for the surveys. Surveys are another important datasources of the statistical-DWH.
19
The last point implies that deficiences in the population frame, which might be detected
during the statistical process, should be incorporated in the SBR. By not doing this, the same
definciences will return the survey results of next period.
The key questions are:

at which step of the statistical proces should the population frame be corrected when
deficiences are detected ?

when are the same corrections made in the backbones

when are the same corrections based in the SBR (which is a source of some backbones)?
The position of the SBR and its relationschip with the backbones, unit base and surveys are
illustrated in fig. 8
Figure 8 Illustration of the position of the SBR within a statistical-DWH. The SBR is basically related
to three important input data of the S-DWH; the population frame, unit base and the surveys. This
figure also shows the position of a) data-integration and b) ‘weighting/calculation of aggregates’ in the
statistical process. It shows at which step in the statistical process the feedback to the SBR preferably
takes place. Note that backbones are denoted by brown cylinders and other input data by grey
cylinders.
6.2
Determination of the default target population in the statisical-DWH
As previously mentioned the backbones ‘population frame’, ‘turnover’ and ‘employments’
are used for the determination of the default target population. As these backbones are linked
20
to each other at the beginning of the processing phase (GSBPM step 5.1), the determination of
the default target population can take place here. We call this the provisional default target
population.
This provisional default target population can be used for checking, cleaning and integrating
all other datasources data at a micro-level. During these steps, conflicting information
between the datasources might be detected (in practice: will probably be detected).
Conflicting information may in extremis lead to the conclusion that errors in the provisional
default target population do exist. Deliverable 2.8 of the ESSnet of DataWareHousing
addresses the question how this conclusion might be drawn, because deliverable 2.8 deals
with hierarchy between the different data sources.
Whatever the methodology for detection, errors in the provisional default target population
might have three possible origins. More specifically, they may be related

to errors in the data-linking

to errors in the VAT- and/or employment data

to errors in the population frame.
The first two point result in an erroneous estimation of the activity status and therefore the
number of active enterprises. It is expected that most errors in the population frame are related
to errors in

the NACE-code

the size class of the enterprise
In other words, other data sources like surveys and admin data indicate that the enterprise has
either another activity as recorded in the SBR or another size as recorded in the SBR.
If the SBR, unit base and VAT/employment backbones are of good quality, the number of
errors in the provisional default target population should be limited. Moreover, data-cleaning
+ data-integration at microlevel are basically independent of the number of active enterprises,
NACE-code and size class. Therefore, it is proposed to use the provisional default target
population for these steps, even after errors have been detected. Another reason for this
proposal is that errors might be detected at several stages of the data-cleaning and integration
process. Therefore, it preferred to collect all detected errors in this part of the process at a first
stage before correcting them in the population.
Errors in the provisional default target population might become influential when weighting
the integrated microdata and calculating aggregates at the end of the processing phase.
Therefore, it recommended correcting all errors in the population just before performing these
steps! Hence the provisional default target population is replaced by the default target
population at (the beginning of) GSBPM-step 5.7 (“weighting”).
6.3 Panel surveys and correcting the population frame
21
Surveys are another data source to detect errors in the default target population. Especially,
surveys about produced goods, performed services and investments can be very useful to
detect errors in the NACE-code. However, it should be prevented that corrections of NACEcodes ect. lead to selectivity in the quality of the SBR. Selectivity means in this context: some
parts of the SBR are of better quality than others because they are surveyed. To prevent this
drawback, one should be very cautious in the use of (limited) panel survey data to correct
information in the SBR. When using panel surveys, erroneous information leading to
incorrect estimates should preferably be treated as outliers. This warning is not valid for
admin data, because these data cover almost the entire population.
6.4 Timing of feedback to the unit base and the backbones
The unit base and the backbones “population frame”, “turnover”and “employment” have a
crucial role in the linking and estimation process. Therefore, it is advisable that if the default
target population is updated due to errors in these sources, these sources themselves are
updated too. These updates are desirable to ensure that late information or later available input
data are processed with the correct population information. Disadvantage of correcting the
input data is that previous published estimates are revised when re-running the process the
with improved population information. In this case, previous published estimates are defined
by published estimates before the errors in the population were detected. This drawback of
unexpected revisions, can be limited by
 developing a good metadata system, i.e. which data belong to which estimate

using the paradigm that the information derived from the SBR (= source for population
frame) and the backbones “turnover” and “employment” is correct unless otherwise
proven. In other words, population and corresponding input data are corrected only, if the
detected errors are certain and influential
 relating the timing of incorporating changes in the input data to the revision moments of
the most important statistical outputs.
Due to time-lags between different data-sources, the revisions caused by corrected
information in the unit base and admin data sources are likely larger if the statistical
DataWareHouse covers short-term statistics, too.
6.5 Timing of feedback to the SBR
It has been argued in the previous chapter that updates in the provisional target populations
due to proven and influential errors in the

backbone “population (frame)

backbones “turnover” and “employment”

unit base
22
should be accompanied by updates in the corresponding backbones, too. However, the timing
of these updates should correspond with the timing of the revision moment in the most
important estimates.
The population frame and unit base are strongly related to the SBR. Hence, the SBR should
also be updated, if the population frame and unit base are updated due to proven and
influential errors. However, the timing of updating the SBR is extremely important as the
SBR also acts as a frame for survey sampling including for surveys falling outside the scope
of the statistical DWH.
The importance of the timing can be best illustrated with an example. If the SBR is used as
sampling frame for an STS-survey of current year t+1 and the SBR is ‘suddenly’ updated
with information from the statistical-DWH from last year t, a sudden – and concerning timing
incorrect - discontinuity in the STS-survey series arises. The question is whether this
discontinuity is desirable. The same applies for surveys falling outside the scope of the
statistical DataWareHouse.
Therefore, it is advisable to develop a strategy for correcting information in the SBR. A
possible strategy is:
 For the errors with the biggest impact: correcting the population frame and the SBR at the
same time (and as soon as possible). However, consultation with the stakeholders of the
most important statistics outside the scope of the statistical DataWareHouse is strongly
recommended as these changes may have impact on other statistics.
 For less influential errors: corrections in the SBR are carried out at the end of the
calendar year when all surveys are renewed or refreshed. In this case, prelimary estimates
outside the statistical-DWH published within 12 months after the statistical year t are still
on a SBR including known-errors. The final estimates published more than 12 months
after statistical year t are still on a SBR corrected for known-errors
7.
Conclusions
Two conditions are required for a succesful statistical DataWareHouse. Firstly, the population
is well defined. Secondly, one (enterprise) unit should be used in the statistical
DataWareHouse, because it is – in practice – impossible to create integrated datasets for
several (types of) enterprise units. A unit base is needed to link the different units of all
individual data sources to the statistical enterprise unit. Backbones about “population”,
“turnover”, “employment” are desired as these three datasources cover (almost) all enterprises
and therefore provide good basic characteristics for enterprises. Linking other datasources to
these three backbones, instead of only population totals, improve the quality of the integrated
datasest. If a NSI integrates the data of the largest enterprises at the beginning of
dataprocessing, the integrated data from Large enterprises can be considered as a backbone,
too.
The SBR is an indirect data source for the statistical DataWareHouse. It is an indirect source
because 1) the backbone “population” is derived from it, 2) the unit base is strongly related to
23
it and 3) the surveys – another important data source for the statistical DataWareHouse – are
based from it. Hence, when errors in the population are revealed after integrating different
data sources, it is desired that these errors are corrected in the SBR, too. However, the timing
of incorporating these corrections in the SBR (and other backbones) is extremely important
due to multiple use of SBR-information in data sources within or beyond the scope of the
statistical DataWarehouse.
24
Download