SBS Workshop: *Structural Business Statistics on

in partnership with Title: Guidelines (including options) on how the BR interacts with the SDWH - WORKING DOCUMENT WP: 2 Deliverable: 2.2.1 Version: 1.2 Date: 28-2-2012 Autor: NSI: Netherlands Pieter Vlag ESS - NET ON MICRO DATA LINKING AND DATA WAREHOUSING IN PRODUCTION OF BUSINESS STATISTICS Contents Contents ...................................................................................................................................... 2 1. Summary ..................................................................................................................................... 3 2. Introduction ................................................................................................................................. 4 2.1 Definition of a statistical DataWareHouse (according to the FPA) ........................... 4 2.2 The statistical DataWarehouse: architecture and layers ............................................ 4 2.3 Linking different input data: the population frame ..................................................... 6 2.4 Relationship between the population frame and the Business Register ...................... 6 Statistical units, backbones and default target population ........................................................... 7 3. 3.1 Statistical units and default target population ............................................................ 7 3.2 Backbone “population frame” ................................................................................... 8 3.3 Backbones “turnover” and “employment” ................................................................. 8 3.4 Determination of the default target population ........................................................... 9 3.5 Updating the population ............................................................................................ 11 4. The largest enterprises ............................................................................................................... 11 5. Linking the data-sources to the statistical unit .......................................................................... 12 5.1 Position in the statistical process of a DataWareHouse ........................................... 12 5.2 Variation in input units .............................................................................................. 12 5.3 Variation in output units ............................................................................................ 12 5.4 The statistical unit and the process of a statistical-DWH ......................................... 12 5.5 The statistical unit base ............................................................................................. 13 Correction information in the population frame and feedback to SBR ..................................... 14 6. 7. 6.1 The position of the Business Register in a statistical DataWareHouse .................... 14 6.2 Determination of the default target population in the statisical-DWH ..................... 15 6.3 Panel surveys and correcting the population frame .................................................. 16 6.4 Timing of feedback to the unit base and the backbones ............................................ 17 6.5 Timing of feedback to the SBR ................................................................................... 17 Conclusions ............................................................................................................................... 18 2 1. Summary An important characteristic of a statistical datawarehouse is that data-input come from different data sources, like surveys, administrative data, accounting data and census data. However, depending on the data source these input data may refer to different enterprise units like the establishment, the legal (enterprise) unit or the enterprise group. Moreover some data sources may cover different populations. More precisely some input data cover all enterprises with a certain activity but others only big enterprises or other subpopulations. Hence to use several input data sources for one statistical estimate – which is the aim of a statistical DataWareHouse - it is crucial to ensure that these data are linked to the same enterprise units and are compared target population. Most statistical institutes use the Statistical Business Register (SBR) to derive a) a statistical enterprise units and b) the target population, e.g. the group of enterprises to which the statistical estimates refer. We propose that the statistical datawarehouse uses only the statistical enterprise as unit. The default target population is defined as all enterpises which have been active during the reference period. The use of only one unit is proposed for sake of governance and clarity. The default target population is proposed because this default corresponds with output requirements of the European regulations. Desired estimates about subpopulation (like large or small enterprises only) can be determined from this default within. It should be noted that enterprise units (plus some characteristics of these units) and the corresponding default target population are input sources for the statistical DataWareHouse and not the Business Register as a whole. Errors might be detected in statistical units and target populations when linking other input data to this information. A frequently expected error will be the NACE-code (=activity). If influental, these errors need to be corrected in the statistical DataWareHouse. In practice data-linking will be done source by source and not simultaneously for all sources. This is because some sourcea become available much earlier than othera (and inverse). Hence, to keep the process restrained one has to think at which step in the statistical proces errors in statistical units and the target population have to be corrected in the DataWareHouse. Another question is at which stage of the process this corrected information has to be linked back to the SBR. In this document it will be argues that this information could be corrected at the end of the processing phase and feedback to the SBR needs to take place once a year unless the detected errors are really influential. 3 2. Introduction 2.1 Definition of a statistical DataWareHouse (according to the FPA) The main goal of the ESSnet on “micro data linking and data warehousing” is to prepare recommendations about better use of data that already exist in the statistical system. Its ultimate aim is: ‘To create fully integrated data sets for enterprise and trade statistics at micro level: a 'data warehouse' approach to statistics.’ The broad definition of a data warehouse to be used in this ESSnet is therefore: ‘A common conceptual model for managing all available data of interest, enabling the NSI to (re)use this data to create new data/new outputs, to produce the necessary information and perform reporting and analysis, regardless of the data’s source.’ Within this ESSnet one workpackage (WP 2) covers all essential methodological elements for designing, building and implementing the statistical data warehouse. 2.2 The statistical DataWarehouse: architecture and layers Another workpackage (WP 3) covers all essential architectural and technical elements for designing, building and implementing the statistical data warehouse. Basically this workpackage has applied the GSPBM sub-processes to the statistical-DWH concept in order to provide a Business Architecture of the statistical-DWH. Moreover, it proposes a modular workflow for the SDWH in order to manage the information flow between data sources and SDWH central administration. To do this; it uses four functional layers:  data source layer,  integration layer,  interpretation and data analysis layer,  data presentation layer. Figure 1 shows the GSBPM model. Figure 2 show the relationship between the phases of the statistical process as defined by the GSBPM and the functional layers as proposed by the workpackage 3 team. Statistical (enterprise) units and the target population play an important role in the integration layer of the statistical-DWH, which corresponds with the processing phase of the GSBPM. This is because in this phase different input sources are linked to each other. Later, they are weighted to the target population in order to obtain statistical estimates. In the next chapters of this document, we’ll discuss more precisely at which steps the population frame and statistical units play a crucial role. 4 Figure 1 A schematic sketch of the GSBPM (Generic Statistical Business Process Model). Note that the GSBPM divides the statistical process into 9 phases, which are divided into subprocesses. Figure 2 Relationships between the layers of a statistical DataWareHouse and the statistical processes according to the GSBPM (Generic Statistical Business Process Model). 5 2.3 Linking different input data: the population frame One aim of a statistical DataWareHouse is to create a set of fully integrated data about enterprises. These data may come from different data sources. These data sources are collected in the collection phase of the “Business Architecture” (fig. 2). In many countries the different data sources cover different populations. For example, the Value Added Tax (VAT)-data and corporate tax data do not include the smallest enterprises, but are quasi-complete for the part of the enterprise population they cover. Some survey samples include the smallest enterprises but they provide only information about a few enterprises within a population. Hence, linking these sources is not only a matter of linking two sources but also a matter of relating all input data to a reference, the so-called target population. Another factor, one has to take into account is that different sources may have different units. For example, surveys in the Netherlands are based on statistical units (which generally correspond with legal units), while VAT-units are based on enterprise groups. Hence, it has to be agreed to which unit VAT-data and survey-data are linked. Other examples in other countries can also be given. Summarising, when linking several input data in a statistical DataWareHouse, one has to agree about  the default target population, i.e. to which reference are all input data linked.  the enterprise unit to which all input data are matched. These questions will be addressed in this deliverable. The technical aspects about linking of several data-sources are described in deliverable 2.3 of the ESSnet on DataWareHousing (DWH). 2.4 Relationship between the population frame and the Business Register Member States of the European Union maintain business registers for statistical purposes as a tool for the preparation and coordination of surveys, as a source of information for the statistical analysis of the business population and its demography, for the use of administrative data, and for the identification and construction of statistical units. The Regulation (EC) No 177/2008 of the European Parliament and the Council (EC) sets out a common framework for the harmonisation of the national business registers for statistical purposes and Article 7 of the Regulation asks for the publication of a business register recommendation manual. The manual aims to explain the reasoning behind the provisions of the Regulation. It aims to provide the extra information required for the correct and consistent interpretation of the Regulation in all countries. This second edition of the manual is derived from the first one published in 2003 and replaces it. The manual has been updated in close cooperation with the Member States. The regulation and manual inlictly imply that business register contains at least  a statistical unit. 6  a name and address of the statistical unit  an activity-code (NACE)  starting and a stopping date of enterprises. The implication for the statistical DataWareHouse is that the needed population characteristics, e.g. unit and default target population (see paragraph 2.3), can be derived from the SBR. Hence, the SBR is indirectly a crucial input for the statistical DataWareHouse. The complete statistical Business Register itself is not necessarily an input source for the statistical DataWarehouse, but the population frame derived from it. 3. Statistical units, backbones and default target population 3.1 Statistical units and default target population Taking into account the (expected ?) recommendations of the ESSnet on Consistency and the European regulations about the Business, it is proposed that the statistical enterprise unit is the standard unit to which all datasources are linked in the statistical DataWareHouse. Taking into account the SBS-regulations, we propose that the default target population is defined as: all enterprises with a certain activity being active during the reference period. The activity is derieved from the NACE-code. In other words, for annual statistical this means that the default target population consists of all active enterprises during the year, including the starters and stoppers (and the new/stopping units due to merging and splitiing companies). Note that this document uses the term default target population. This is the target population of several obligatory statistics, like SBS, and the largest possible population for a reference period as it includes all enterprises. We propose that the default target populations is used as standard to check, clean and weight the input data in the processing phase of the DataWareHouse (see figs. 2/3). On the other hand one aim of the statistical DataWareHouse is to produce flexible output. Therefore, the statistical DataWareHouse should be able to produce estimates about  all enterprises being active during the reference period (= standard)  and subpopulations (of this standard) Examples of subpopulation are: large or small enterprises only or all enterprises active at a certain date. Formally, these subpopulation also have a target population when estimating aggregates. This leads to a confusion about the term target population in a statistical DataWareHouse. To prevent this confusion,  the term default target population is used when when referring to the standard, i.e. all enterprises being being active.  the term target population has a broader definition. It applies to producing estimates for both the standard and the subpopulations. 7 Checking, cleaning, integrating and weighting the input data in a statistical DataWareHouse (SDWH) are further discussed in chapter 5 of this deliverable and in other deliverables of the ESSnet on DataWareHousing. 3.2 Backbone “population frame” To determine the default target population in the SDWH, is crucial that the population frame is derived from the SBR. This population frame consists of all enterprises being the SBR during the year, regardless whether they are active or not. This input source will be called ‘backbone: population frame’ in the remainder of this document. To derive activity status and subpopulations, it is recommended that this backbone includes the following information: 1) Frame reference year 2) Statistical enterprises unit, including its national ID and its EGR ID1 3) Name/address of enterprise of the enterprises 4) National ID of the enterprises 5) Date in population (mm/yr) 6) Date out of population (mm/yr) 7) NACE-code 8) Institutional sector code 9) Size class2 Note that the population frame is crucial information to determine the default active population. However the activity status of enterprises cannot be derived from this backbone itself. To estimate whether enterprises are active or not, a comparison with VAT and/or employment is are needed. This will be discussed in the next chapter (3.3). Chapter 3.4 and 3.5 discuss how the active population can be determined in case the statistical DataWareHouse  is limited to annual statistics  includes short-term statistics, too 3.3 Backbones “turnover” and “employment” The results of the ESSnet on AdminData showed that VAT and social security data can be used for turnover and employment estimates when quasi complete. The latter is the case for quarterly statistics for most European countries on continental Europa and annual statistics for all European countries. Note however that VAT can only be used for statistical purposes if a) the data deliverably from the tax office is guaranteed and b) the link with the SBR established. Assuming that these conditions are fullfilled, it is proposed to include backbones of VAT-data and employment data as input data in a statistical DataWareHouse. 1 meaningless ID assigned by the EGR system to enterprises, it is advised to include this ID in the DataWareHouse to enable comparatibility between the country specific estimates 2 could be based on employment data 8 The reasons to include a backbone VAT-data and a backbone social security data in a statistical DataWareHouse are twofold:  Backbones of VAT and social security data are crucial to create a fully integrated dataset suitable for flexible outputs  VAT and social security data are crucial to determine the activity status of the enterprises and implictly to determine the default target population. These reasons are explained further in the remainder of this section. When (quasi) complete, VAT and social security data can be used to produce good-quality estimates on turnover and employment. Therefore, it is very useful to use these estimates as benchmarks when incorporating results of survey sampling in a statistical DataWareHouse. In other words, together with the number of enterprises, totals of turnovers and employment should determine the population frame when weighting survey results (or imputing non-observed enterprises). This population frame can also be used when relating other datasets to it. Such an approach is necessary to  create a fully integrated dataset using of several input data  reduce the impact of sampling errors of survey. The first condition is the aim of a statistical DataWareHouse. The second condition is required to produce flexible output, especially about subgroups of the defaul target population. Several NSIs use VAT- and social security data to determine the activity status of an enterprise. More precisely, enterprises are considered as active if VAT and/or social security data are available for the reference period or the previous period (in case of late VAT or late social security data). This method is preferred over a suvey to determine the activity status, becuase the latter might be biased due to high non-response rates under the stopped enterprises. Summarising, VAT and social security data are crucial to determine the activity status of the enterprises. Indirectly they are crucial to determine the default target population. Note that several countries have incorporated VAT and social security data in the SBR already. Even in this case, it is proposed to consider both admin data sources as separate backbones in the statistical DataWareHouse. The reasons are twofold:  VAT and social security are essential factors in the estimation process within a statistical DataWareHouse. For sake of transparancy, crucial decisions about these data should preferably be taken within the statistical DataWareHouse and not outside.  SBR, VAT and social securiy data may provide contradictionay information, especially when including short-term statistics (STS) in the statistical-DWH (see chapter 3.5). 3.4 Determination of the default target population 3.4.1 Case I: Statistical DataWareHouse is limited to annual statistics 9 The determination of the default target population is relatively easy, if the scope of the statistical-DataWareHouse is limited to annual statistics. This case is relatively easy because the required information from the SBR and administrative data (VAT + employment) can be selected afterwards, i.e. when the year has finished. This is because surveys results and other datasources with annual data (like accountancy data + combined results of four quarters) become available after the year has ended. Furthermore, surveys designs about production, investments etc. are not finalised before the year has ended. As a result, no provisional populations to link provisional data during the calendar year are needed for the statistical DataWareHouse. Therefore, the default target population can be determined by  selecting all enterprises which are recorded in the SBR during the reference year  using the complete annual VAT and employment (admin) dataset to determine the activity status. 3.4.2 Case 2: the Statistical DataWareHouse includes short-term statistics The determination of the default target population becomes more complicated when results of short-term statistics are included in the statistical DataWareHouse. In this case a provisional population frame for reference year t frame should be constructed at the end of year t-1, i.e. November or December. This population frame is used to design short-term surveys. It is also the starting point for the SDWH. This provisional frame is called release 1 and it does formally not cover the entire population of year t as it does not contain the starting enterprises yet. During the year the population frame is regularly updated with new information from the SBR (especially new enterprises) and the administrative data (VAT + social security data). The frequency of these updates depends on the updates of the SBR, VAT and social security data. At the end of year t (or at the beginning of year t+1), a regular population frame for year t can be constructed. This regular population frame consists of all enterpises in the year and is called release 2. The ESSnet of Administrative Data has observed that time-lags do exist between the registration of starting/stopping enterprises in the SBR and the different admin data sources. The impact of these time-lags differs per countries, because it depends  on the updates of the SBR, VAT and social security data  the quality the underlying source information Despite the different impact of the time-lags, the ESSnet on AdminData has shown that these time-lags do exist in every country and lead to revisions in estimates about active enterprises on monthly and quarterly base. This effect is enhanced, because the admin data are not entirely complete on quarterly base. These time-lag and incompleteness issues might be an consideration for choosing a low-frequency for updating the population frame in a statistical DataWareHouse For example, quarterly and/or bi-annual updates could be considered. 10 3.5 Updating the population At the beginning of year t+1 (or latter) additional admin data and survey results for reference year t become available. Therefore, it cannot be excluded that errors are detected in the ‘release 2’ population determined at the end of the reference year. For this reason, a special procedure for additional frame error corrections should be developed and a final population frame is foreseen for July, T+1. How these updates should be incorporated in the statistical DataWareHouse and the SBR will be discussed in chapter 5. This updating scheme is schematically presented in figure 3. FATS frame population of active units year T FATS survey population year T FATS frame population of active units year T (revised) Undercoverage In both frame populations Frame error procedure Overcoverage dec Release 1 jul Release 2 jan apr nov Release 3 jul nov okt jul Release 4 jan apr jul aug Figure 3 Proposal for the construction of an annual population. Figure copied from van der Ven, 2012. This figure is an example for FATS but can be generalised for the entire Data WareHouse. Please note that the release 2 in this figure is skipped for the SDWH-procedure. 4. The largest enterprises The situation described in chapter 3 is applicable to most enterprises. However, an increasing number of national statistical institutes (NSIs) have created an unit, which is responsable to create a fully integrated dataset for the largest enterprises or largest enterprise groups which dominate the economy. If such unit, or such integrated dataset for the largest enterprises, exists, is could be considered as a backbone ‘largest enterprises’. This backbone is an input for the statistical DataWareHouse and should be linked to the backbone “population frame” in the first step of the integration phase of the statistical DataWareHouse (GSBPM-step 5.1 – see figure 2). In the remainder of the process is similar for the largest enterprises as well as all other enterprises. 11 5. Linking the data-sources to the statistical unit 5.1 Position in the statistical process of a DataWareHouse Data-linking between the different sources is the first step in the processing phase of a statistical DataWareHouse. As the population frame consists of the statistical enterprise unit only, this step can be desribed more precisely as linking the input data to the statistical units. Technical aspects of data-linking are described in deliverable 2.4 of the ESSnet on DataWareHousing. The next chapter of this document addresses the question, which information is required to link the several input sources to the statistical unit. 5.2 Variation in input units Accountancy data, tax data (including VAT and social security data) and other data may be reported for different parts within an enterprise group. These data might be reported for the enterprise group as whole, the underlying legal units, the underlying legal units, and tax units consisting of other part of the enterprise groups. The variation in units and the challenge of linking them depends on the national legislation. Therefore, the impact of this issue differs per country. The size of the enterprise also determines the variation is units and the complexity of linking them. For small enterprises one-to-one relationships between the different units can be assumed, but this assumption cannot be made for medium-sized enterprises. Nevertheless, whatever the extent of this issues in the individual countries and whatever the determination of the statistical unit, it cannot be taken for granted that all input data link automatically to the statistical unit. Hence, the relationship between these ‘input’ units and the statistical unit should be known before the data can be linked. Data-linking is of less importance when using surveys, because surveys are generally based on statistical units as they are designed from information of the SBR. 5.3 Variation in output units Most statistical estimates in enterpise statistics is produced on the statistical unit. Examples are SBS, STS and most instituional statistical. However, some output is produced on different units like local untis, LKAU, KAU or enterprises groups. Again the complexity of linking these units depends per country and size of the enterprises. Nevertheless, one-to-one relationships between these output units and the statistical enterprise unit cannot be taken for granted. Hence, relationships between the ‘output’ units and the statistical units should be known before flexible outputs can be generated. 5.4 The statistical unit and the process of a statistical-DWH The most simple and transparant statistical process can be generated by 12  Linking all input sources to the statistical enterprise unit at the beginning of the processing phase (GSBPM-step 5.1 – see figures 2,3).  Performing datacleaning, plausibilty checks and data-integration on statistical units only (GSBPM steps 5.2-5.6).  Producing statistical output (GSBPM-steps 5.7-5.8) by default on the statistical unit and the default target population. Flexible outputs on other target populations and other units are also produced in these steps by using repeated weighting techniques and/or domain estimates. Technical aspects of these estimation methods are described in deliverable 2.8 of the ESSnet on DataWareHousing. Note that it is theoretically possible to perform data-analyses and data-cleaning on several units simultaneously. However, experiences of Statistics Netherlands with cleaning VAT-data on statistical units and ‘implementing’ these changes on the original VAT-units too, reveal that the statistical process becomes quite complex. Therefore, it is proposed that  linking to the statistical units is carried out at the beginning of the processing phase only.  the creation of a fully integrated dataset is performed for statistical units only  statistical estimates for other units are produced at the end of the processing phase only  relationships between the different in- and output units on one hand and the statistical enterprise unit on the other hand should be known (or estimated) beforehand. 5.5 The statistical unit base As the relationship between the different in- and output units on one hand and the statistical enterprise units on the other hand should be known (or estimated) before the processing phase, it is recommended to include this information in a separate input source. This input source is the so-called unit base. It describes the relationship between the different units and the statistical enterprise unit. Figure 4 illustrates the content of a unit base. Note that the exact contents of the unit base depends on  legislation per country  output requirements and desired output of a statistical DataWareHouse  available input data. It should also be mentioned that the unit base might be updated when other input data (with different units) become available. The unit base is closely related to the SBR. However, as its contents are also closely related to available input data we recommend to consider it as a separate input source. The use of an unit base is preferred over incorporating a statistical unit in data source. First of all, when doing the latter the data-linking is implictly done in the collection phase of a statistical DataWareHouse. Secondly, it is much more efficient and transparant to store the relationship 13 between the different in one source. This is especially the case when deficiences in the datalinking process are detected in a later phase of the statistical process and these deficiences lead to corrections in the earlier determined relationships between the units. Figure 4 Example of an unit base. 6. Correction information in the population frame and feedback to SBR 6.1 The position of the Business Register in a statistical DataWareHouse The position of the SBR in a statistical DataWareHouse is three-fold. More precisely  the SBR is the input source for the backbone ‘population frame’ of the statistical-DWH  the SBR is closely related to the unit base.  the SBR is the sampling frame for the surveys. Surveys are another important datasources of the statistical-DWH. The last point implies that deficiences in the population frame, which might be detected during the statistical process, should be incorporated in the SBR. By not doing this, the same definciences will return the survey results of next period. 14 The key questions are:  at which step of the statistical proces should the population frame be corrected when deficiences are detected ?  when are the same corrections made in the backbones and the SBR ? The position of the SBR and its relationschip with the backbones, unit base and surveys are illustrated in fig. 5 Figure 5 Illustration of the position of the SBR within a statistical DataWareHouse. The SBR is basically related to three important input data of the S-DWH; the population frame, unit base and the surveys. This figure also shows the position of a) data-integration and b) ‘weighting/calculation of aggregates’ in the statistical process. It shows at which step in the statistical process the feedback to the SBR preferably takes place. Note that backbones are denoted by brown cylinders and other input data by grey cylinders. 6.2 Determination of the default target population in the statisical-DWH As previously mentioned the backbones ‘population frame’, ‘turnover’ and ‘employments’ are used for the determination of the default target population. As these backbones are linked to each other at the beginning of the processing phase (GSBPM step 5.1), the determination of the default target population can take place here. We call this the provisional default target population. 15 This provisional default target population can be used for checking, cleaning and integrating the data at a mirco-level. During these steps, contradictionary information might be detected (in practice: will probably be detected). Such contradictionary information may in extremis lead to the conclusion that errors in the provisional default target population do exist. Deliverable 2.8 of the ESSnet of DataWareHousing addresses the question how this conclusion might be drawn, because deliverable 2.8 deals with hierarchy between the different data sources. Whatever the methodology for detection, errors in the provisional default target population might have three possible origins. More specifically, they may be related  to errors in the data-linking  to errors in the VAT- and/or employment data  to errors in the population frame. The first two point result in an erroneous estimation of the activity status and therefore the number of active enterprises. It is expected that most errors in the population frame are related to errors in  the NACE-code  the size class of the enterprise In other words, other data sources like surveys and admin data indicate that the enterprise has either another activity as recorded in the SBR or another size as recorded in the SBR. If the SBR, unit base and VAT/employment backbones are of good quality, the number of errors in the provisional default target population should be limited. Moreover, data-cleaning + data-integration at microlevel are basically independent of the number of active enterprises, NACE-code and size class. Therefore, it is proposed to use the provisional default target population for these steps, even after errors have been detected. Another reason for this proposal is that errors might be detected at several stages of the data-cleaning and integration process. Therefore, it preferred to collect all detected errors in this part of the process at a first stage before correcting them in the population. Errors in the provisional default target population might become influential when weighting the integrated microdata and calculating aggregates at the end of the processing phase. Therefore, it recommended correcting all errors in the population just before performing these steps! Hence the provisional default target population is replaced by the default target population at (the beginning of) GSBPM-step 5.7 (“weighting”). 6.3 Panel surveys and correcting the population frame Surveys are another data source to detect errors in the default target population. Especially, surveys about produced goods, performed services and investments can be very useful to detect errors in the NACE-code. However, it should be prevented that corrections of NACE16 codes ect. lead to selectivity in the quality of the SBR. Selectivity means in this context: some parts of the SBR are of better quality than others. To prevent this drawback, one should be very cautious in the use of (limited) panel survey data to correct information in the SBR. When using panel surveys, erroneous information leading to incorrect estimates should preferably be treated as outliers. This warning is not valid for admin data, because these data cover almost the entire population. 6.4 Timing of feedback to the unit base and the backbones The unit base and the backbones “population frame”, “turnover”and “employment” have a crucial role in the linking and estimation process. Therefore, it is advisable that if the default target population is updated due to errors in these sources, these sources themselves are updated too. These updates are desirable to ensure that late information or later available input data are processed with the correct population information. Disadvantage of correcting the input data is that previous published estimates are revised when re-running the process the with improved population information. In this case, previous published estimates are defined by published estimates before the errors in the population were detected. This drawback of unexpected revisions, can be limited by  developing a good metadata system, i.e. which data belong to which estimate  using the paradigm that the information derieved from the SBR (= source for population frame) and the backbones “turnover” and “employment” is correct unless otherwise proven. In other words, population and corresponding input data are corrected only, if the detected errors are certain and influential  relating the timing of incorporating changes in the input data to the revision moments of the most important statistical outputs. Due to time-lags between different data-sources, the revisions caused by corrected information in the unit base and admin data sources are likely larger if the statistical DataWareHouse covers short-term statistics, too. 6.5 Timing of feedback to the SBR It has been argued in the previous chapter that updates in the provisional target populations due to proven and influential errors in the  backbone “population (frame)  backbones “turnover” and “employment”  unit base should be accompanied by updates in the corresponding backbones, too. However, the timing of these updates should correspond with the timing of the revision moment in the most important estimates. 17 The backbone “population frame” and unit base are strongly related to the SBR. Hence, the SBR should also be updated, if the population frame and unit base are updated due to proven and influential errors. However, the timing of updating the SBR is extremely important as the SBR also acts as a frame for survey sampling including for surveys falling outside the scope of the statistical DataWareHouse. The importance of the timing can be best illustrated with an example. If the SBR is used as sampling frame for an STS-survey of current year t+1 and the SBR is ‘suddenly’ updated with information from the statistical DataWarehouse from last year t, a sudden – and concerning timing incorrect - discontinuity in the STS-survey series arises. The question is whether this discontinuity is desirable. The same applies for surveys falling outside the scope of the statistical DataWareHouse. Therefore, it is advisable to develop a strategy for correcting information in the SBR. A possible strategy is:  For the errors with the biggest impact: corrections in the SBR are simultaneously with the corrections in the backbone ‘population frame’. However, consultation with the stakeholders of the most important statistics outside the scope of the statistical DataWareHouse is trongly recommended.  For less influential errors: corrections in the SBR are carried out at the end of the calendar year when all surveys are renewed or refreshed. 7. Conclusions Two conditions are required for a succesful statistical DataWareHouse. Firstly, the population is well defined. Secondly, one (type of an) enterprise unit should be used in the statistical DataWareHouse, because it is – in practice – impossible to create integrated datasets for several (types of) enterprise units. To ensure these conditions, backbones about “population”, “turnover”, “employment” and “large enterprises” are required. Furthermore, a unit base is needed to link the different units of all individual data sources to the statistical enterprise unit. The SBR is an indirect data source for the statistical DataWareHouse. It is an indirect source because 1) the backbone “population” is derived from it, 2) the unit base is strongly related to it and 3) the surveys – another important data source for the statistical DataWareHouse – are based from it. Hence, when errors in the population are revealed after integrating different data sources, it is desired that these errors are corrected in the SBR, too. However, the timing of incorporating these corrections in the SBR is extremely important due to multiple use of SBR-information in data sources within or beyond the scope of the statistical DataWarehouse. 18

SBS Workshop: *Structural Business Statistics on

Related documents

Products

Support

SBS Workshop: *Structural Business Statistics on

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib