in partnership with Title: WP: Guidelines (incl. options) on how the BR interacts with the S-DWH WORKING DOCUMENT 2 Deliverable: 2.2.1 Version: 2.0 Date: 15-3-2013 Autor: NSI: Netherlands Pieter Vlag ESS - NET ON MICRO DATA LINKING AND DATA WAREHOUSING IN PRODUCTION OF BUSINESS STATISTICS Contents 1. Summary ..................................................................................................................................... 4 2. Introduction ................................................................................................................................. 5 2.1 Definition of a statistical DataWareHouse (according to the FPA) ........................... 5 2.2 The statistical DataWarehouse: architecture and layers ............................................ 5 2.3 Linking different data sources: the population frame ................................................. 7 2.4 Relationship between the population frame and the Business Register ...................... 7 Statistical units, backbones and default target population ........................................................... 8 3. 3.1 Statistical units and default target population ............................................................ 8 3.2 Backbone “population frame” ................................................................................... 9 3.3 Backbones “turnover” and “employment” ................................................................. 9 3.4 Determination of the default target population ......................................................... 13 3.5 Updating the population ............................................................................................ 14 4. The largest enterprises ............................................................................................................... 15 5. Linking the datasources to the statistical unit............................................................................ 16 5.1 Position in the statistical process of a statistical-DWH ............................................ 16 5.2 Variation in input units .............................................................................................. 16 5.3 Variation in output units ............................................................................................ 17 5.4 The statistical unit and the process of a statistical-DWH ......................................... 17 5.5 The statistical unit base ............................................................................................. 18 2 Correction information in the population frame and feedback to SBR ..................................... 19 6. 7. 6.1 The position of the Business Register in a statistical DataWareHouse .................... 19 6.2 Determination of the default target population in the statisical-DWH ..................... 20 6.3 Panel surveys and correcting the population frame .................................................. 21 6.4 Timing of feedback to the unit base and the backbones ............................................ 22 6.5 Timing of feedback to the SBR ................................................................................... 22 Conclusions ............................................................................................................................... 23 3 1. Summary An important characteristic of a statistical datawarehouse is that the data come from different data sources, like surveys, administrative data, accounting data and census data. However, depending on the data source these input data may refer to different parts of an enterprise like the establishment, the legal (enterprise) unit or the enterprise group. Moreover different data sources may cover different enterprise populations. More precisely some input data cover all enterprises with a certain activity but others big enterprises or other subpopulations only. Hence to use several input data sources for one statistical estimate – which is the aim of a statistical DataWareHouse - it is crucial to ensure that these data are linked to the same enterprise units and are compared with the same target population. Most statistical institutes use the Statistical Business Register (SBR) to derive a so-called population frame, which consists of all (enterprise) units with a certain specific activity. We propose that the statistical DataWareHouse uses only the statistical enterprise unit as unit when processing the data. The use of only one unit is proposed for sake of governance and clarity. The default target population is a statistical DataWareHouse is defined as statistical enterprise units which have been active during the reference period. The default target population is proposed because this default corresponds with output requirements of the European regulations. This default target population can also be used to derive estimates about subpopulations (like large or small enterprises only) in a statistical DataWareHouse. It should be noted that the populations frame is an input source for the statistical DataWareHouse and not the Business Register as a whole. Errors might be detected in statistical units and target populations when linking other data sources to this information. A frequently expected error will be the NACE-code, which classifies the kind of activity. If influental, these errors need to be corrected in the statistical DataWareHouse. As the different source data arrive – in practive – at different times, datalinking will be done source by source and not simultaneously in one time. Hence, to keep the process restrained one has to think at which step in the statistical proces errors in statistical units and the target population have to be corrected in the DataWareHouse. Another question is when this corrected information has to be linked back to the SBR. In this document it will be argued that this information could be corrected at the end of the processing phase and feedback to the SBR needs to take place once a year unless the detected errors are really influential. 4 2. Introduction 2.1 Definition of a statistical DataWareHouse (according to the FPA) The main goal of the ESSnet on “micro data linking and data warehousing” is to prepare recommendations about better use of data that already exist in the statistical system. Its ultimate aim is: ‘To create fully integrated data sets for enterprise and trade statistics at micro level: a 'data warehouse' approach to statistics.’ The broad definition of a data warehouse to be used in this ESSnet is therefore: ‘A common conceptual model for managing all available data of interest, enabling the NSI to (re)use this data to create new data/new outputs, to produce the necessary information and perform reporting and analysis, regardless of the data’s source.’ One workpackage in this ESSnet (WP 2) covers all essential methodological elements for designing, building and implementing the statistical data warehouse. 2.2 The statistical DataWarehouse: architecture and layers Another workpackage (WP 3) covers all essential architectural and technical elements for designing, building and implementing the statistical DataWareHouse (statistical-DWH). Basically this workpackage has linked the GSPBM sub-processes to the statistical-DWH concept in order to provide a Business Architecture for the statistical-DWH. Moreover, it has proposed a modular workflow for the statistical-DWH in order to manage the information flow between data sources and the central administration of a DataWareHouse. To do this; it uses four functional layers: data source layer, integration layer, interpretation and data analysis layer, data presentation layer. Figure 1 shows the GSBPM model. Figure 2 show the relationship between the phases of the statistical process as defined by the GSBPM and the functional layers as proposed by the workpackage 3 team. Note that statistical (enterprise) units, which are needed to link several input data, and the target population, which is needed to relate the input data to statistical estimates, play an important role in the processing phase of the GSBPM. This processing phase corresponds with the integration layer of the statistical-DWH. 5 In the next chapters of this document, we’ll discuss more precisely at which steps populations and statistical units play a crucial role. Figure 1 A schematic sketch of the GSBPM (Generic Statistical Business Process Model). Note that the GSBPM divides the statistical process into 9 phases. These phases are divided into subprocesses. 6 Figure 2 Relationships between the layers of a statistical DataWareHouse and the statistical processes according to the GSBPM (Generic Statistical Business Process Model). 2.3 Linking different data sources: the population frame Aim of a statistical-DWH is to create a set of fully integrated data about enterprises, which enables a statistical institute to produce flexible and consistent output. The integrated data come from different data sources. These data sources are collected in the collection phase of the “Business Architecture” (fig. 2). In practine, different data sources may cover different populations in most countries. For example, the Value Added Tax (VAT)-data and corporate tax data do not include the smallest enterprises, but are quasi-complete for the part of the enterprise population they cover. Survey samples include information about the smallest enterprises but generally provide only data for a limited number of small enterprises in a country. Hence, linking data of several sources is not only a matter of linking but also a matter of relating all input data to a reference, the socalled default target population. Another factor, one has to take into account is that different sources may have different units. For example, surveys in the Netherlands are based on statistical units (which generally corresponds with legal units), while VAT-units are based on enterprise groups. Hence, when linking VAT-data and survey-data to the target population, it has to be agreed to which unit these data are linked. Other examples in other countries can also be given. Summarising, when linking several input data in a statistical-DWH, one has to agree about the default target population, i.e. the reference frame to which all data sources are linked. the enterprise unit to which all input data are matched. These questions will be addressed in this deliverable. The technical aspects about linking of several data-sources are described in deliverable 2.3 of the ESSnet on DataWareHousing (DWH). 2.4 Relationship between the population frame and the Business Register Member States of the European Union maintain business registers for statistical purposes as a tool for the preparation and coordination of surveys, as a source of information for the statistical analysis of the business population and its demography, for the use of administrative data, and for the identification and construction of statistical units. The Regulation (EC) No 177/2008 of the European Parliament and the Council (EC) sets out a common framework for the harmonisation of the national business registers for statistical purposes and Article 7 of the Regulation asks for the publication of a business register recommendation manual. The manual aims to explain the reasoning behind the provisions of the Regulation. It aims to provide the extra information required for the correct and consistent interpretation of the Regulation in all countries. This second edition of the manual is derived 7 from the first one published in 2003 and replaces it. The manual has been updated in close cooperation with the Member States. The regulation and manual inplictly imply that business register contains at least a statistical unit. a name and address of the statistical unit an activity-code (NACE) starting and a stopping date of enterprises. The implication for the statistical-DWH is that the required information about the reference or population frame, e.g. unit and default target population (see paragraph 2.3), can be derived from the SBR. Hence, the SBR is indirectly a crucial input for the statistical DataWareHouse. The complete statistical Business Register itself is not necessarily an input source for the statistical DataWarehouse, but the population frame derived from it. 3. Statistical units, backbones and default target population 3.1 Statistical units and default target population Taking into account the expected recommendations of the ESSnet on Consistency and the European regulations about the Business, it is proposed that the statistical enterprise unit is the only unit to which all datasources are linked in the statistical DataWareHouse. Taking into account the SBS-regulations, we propose that the default target population is defined as: all enterprises with a certain kind of activity being economically active during the reference period. The NACE-code is used to classify the kind of activity. In other words, for annual statistical this means that the default target population consists of all active enterprises during the year, including the starters and stoppers (and the new/stopping units due to merging and splitting companies). Note that this document uses the term default target population. This population corresponds with the target population of several important obligatory statistics, like SBS. Furthermore, it is the largest possible population for a reference period becuase it includes all enterprises with some economic activity during (part of) the period. We propose that the default target populations is used as standard to check, link, clean and weight the input data in the processing phase of the statistical-DWH (see figs. 2/3). On the other hand one aim of a datawarehouse is to produce flexible output. Therefore, the statistical-DWH should be able to produce estimates about subpopulations of this standard, too. Examples of subpopulation are: large or small enterprises only or all enterprises active at a certain date. From a pure methodological term of view, these subpopulation have a target population when estimating population totals, too. Therefore, some confusion about the term target population in a statistical DataWareHouse may arise. To prevent this confusion, the term default target population is used when explicitly referring to the standard, i.e. all enterprises being being active. 8 the term target population has a broader definition. It refers to estimation populations for both the standard and all subpopulations. Checking, cleaning, integrating and weighting the input data in a statistical DataWareHouse (SDWH) are further discussed in chapter 5 of this deliverable. They are more extensively discussed in other deliverables of the ESSnet on DataWareHousing. 3.2 Backbone “population frame” To determine the default target population in the SDWH, two kind of datasources are needed the population frame, i.e. a list of enterprises with a certain kind of activity during a period. information to determine which enterprises of the list really performed economic activities during a period As previously mentioned, the population frame is derived from the SBR. This population frame consists of all enterprises being the SBR during the year, regardless whether they are active or not. This input source is backbone for the statistical-DWH and will be simply called ‘population frame’ in the remainder of this document. To derive activity status and subpopulations, it is recommended that this backbone includes the following information: 1) the frame reference year 2) the statistical enterprises unit, including its national ID and its EGR ID1 3) the name and address of the enterprise 4) the national identification number (ID) of the enterprise 5) the date in population (mm/yr) 6) the date out of population (mm/yr) 7) the NACE-code 8) the nstitutional sector code 9) a ize class2 Note that the population frame is crucial information to determine the default active population. Only the activity status of enterprises cannot be derived from this backbone itself, mainly because it can often be determined afterwards. To estimate whether enterprises really carried our economic activities, a comparison with VAT and/or employment data is done. This will be discussed in the next chapter (3.3). Chapter 3.4 and 3.5 discuss how the population frame and default active target population can be determined in two specific cases the statistical-DWH is limited to annual statistics the statistical-DWH includes short-term statistics, too 3.3 Backbones “turnover” and “employment” 1 meaningless ID assigned by the EGR system to enterprises, it is advised to include this ID in the DataWareHouse to enable comparatibility between the country specific estimates 2 could be based on employment data 9 The results of the ESSnet on AdminData showed that VAT and social security data can be used for turnover and employment estimates when quasi complete. The latter is the case for annual statistics and for quarterly statistics in most European countries on continent. Note however that VAT and social security data can only be used for statistical purposes if a) the data transfer from the tax office to the statistical institute is guaranteed and b) the link with the SBR established. Assuming that these conditions are fullfilled, it is proposed to include backbones of VAT-data and employment data as input data in a statistical-DWH. VAT and social security cover almost all enterprises in the domain covered by the SBS and STS-regulations and are timely available (i.e. earlier than most annual statistics). Therefore, two reasons do exist to include these datasources in a statistical-DWH and consider them as backbones. 1. 2. VAT and social security data are crucial to determine the activity status of the enterprises and implictly to determine the default target population. VAT and social security data are crucial to create a fully integrated dataset suitable for flexible outputs, because these admin data sources contain information about almost all enterprises (unlike survey which contain only information of a small sample of enterprises). The latter reason is explained further in the remainder of this section. When (quasi) complete, VAT and social security data can be used to produce good-quality estimates of turnover and employment. Therefore, these estimates can be used benchmarks when incorporating results of survey sampling in a statistical DataWareHouse. In this case totals of turnovers and employment define, together with the number of active enterprises, the basic population characteristics. Other datasets or survey covering more specific parts of the population should be made consistent with these three main characteristics of the entire population. This improves the quality of the integrated dataset in a statistical-DWH because more auxiliary information is used when weighting survey results (or other datasets) or when imputing for missing values. This quality improvement can be quantified bu the reduction of variance in varaibles not derieved from the backbones. Every estimator has it variance (e.g. uncertainty), many literature studies have proven that estimates derived from weighting techniques using auxiliary information (e.g. ratio or GREG-type estimators) have lower sampling errors than estimates using no auxiliary information when weighting. Summarising using backbones for turnover and employment in addition to the population frame improves the quality of a fully integrated dataset using of several input data, as two key variables can be estimated precisely. reduces the impact of sampling errors of other variables not observed in the backbones. As the first condition is the aim of a statistical DataWareHouse and the second condition is required to produce flexible output (especially about subgroups of the defaul target 10 population), this is the first argument to use backbones of employment and turnover in a statistical-DWH. Several NSIs use VAT- and social security data to determine the activity status of an enterprise. More precisely, enterprises are considered as active if VAT and/or social security data are available for the reference period or the previous period (in case of late VAT or late social security data). This method is preferred over a suvey to determine the activity status, becuase the latter might be biased due to high non-response rates under the stopped enterprises. Summarising, VAT and social security data are crucial to determine whether an enterprises has been economial active or not. Hence, backbones of turnover and employment are crucial to determine the default target population. This is the second reason two include these backbones in a statistical-DWH. Several countries have incorporated VAT and social security data in the SBR already. Even in this case, it is proposed to consider both admin data sources as separate backbones in a to be developed statistical- DWH. The reasons are twofold: VAT and social security are essential factors in the estimation process within a statistical DataWareHouse. For sake of transparancy, crucial decisions about linking these data with the SBR should preferably be taken within the statistical DataWareHouse and not outside. SBR, VAT and social security data may provide conflicting information, especially - in short-term statistics (STS), - when using a frozen population frame - when using a non-regularly updated datasources for the SBR. For sake of transparancy, crucial decisions about dealing with such contradictionay information should preferably be taken within the statistical-DWH and not outside. Especially, because other datasources of the datawarehouse may help to decided which datasourse is correct in case of conflicting information. A schematic sketch of the positions of the backbones in a statistical-DWH is provided in figures 3 and 4. 11 Figure 3 Illustration of the position of the SBR and the backbones in a statistical-DWH. Note that the population, one of the backbones, is derieved from the SBR. The backbones “population”, admin data based turnover (VAT), admin data based employment are used to described to population characteristis. All other datasources are integrated to these characteristics in the beginning of the processing phase. These backbones are also used for weighting when producing outputs at the end of the processing phase. 12 Figure 3 The position of the SBR in case the VAT and employment data are completely integrated in the SBR. Note that in this case, crucial parts of the data-linking are carried out outside the statisticalDWH. 3.4 Determination of the default target population 3.4.1 Case I: Statistical DataWareHouse is limited to annual statistics The determination of the default target population is relatively easy, if the scope of the statistical-DataWareHouse is limited to annual statistics. This case is relatively easy because the required information from the SBR and administrative data (VAT + employment) can be selected afterwards, i.e. when the year has finished. This is because survey designsm surveys results and other datasources with annual data (like accountancy data + combined results of four quarters) become available after the year has ended. As a result, no provisional populations are needed to link provisional data during the calendar year. Therefore, the default target population can be determined by selecting all enterprises which are recorded in the SBR during the reference year using the complete annual VAT and employment (admin) dataset to determine the activity status. 3.4.2 Case 2: the Statistical DataWareHouse includes short-term statistics The determination of the default target population becomes more complicated when results of short-term statistics are included in the statistical DataWareHouse. In this case a provisional population frame for reference year t frame should be constructed at the end of year t-1, i.e. 13 November or December. This population frame is used to design short-term surveys. It is also the starting point for the statistical-DWH. This provisional frame is called release 1 and it does formally not cover the entire population of year t as it does not contain the starting enterprises yet. During the year the population frame is regularly updated with new information from the SBR (especially new enterprises) and the administrative data (VAT + social security data). The frequency of these updates depends on the updates of the SBR, VAT and social security data. At the end of year t (or at the beginning of year t+1), a regular population frame for year t can be constructed. This regular population frame consists of all enterpises in the year and is called release 2. The ESSnet of Administrative Data has observed that time-lags do exist between the registration of starting/stopping enterprises in the SBR and the different admin data sources. The impact of these time-lags differs per countries, because it depends on the updates of the SBR, VAT and social security data the quality the underlying datasources Despite the different impact of the time-lags, the ESSnet on AdminData has shown that these time-lags do exist in every country and lead to revisions in estimates about active enterprises on monthly and quarterly base. This effect is enhanced, because the admin data are not entirely complete on quarterly base. These time-lag and incompleteness issues might be an consideration for choosing a low-frequency for updating the population frame in a statistical DataWareHouse For example, quarterly and/or bi-annual updates could be considered. 3.5 Updating the population At the beginning of year t+1 (or latter) additional admin data and survey results for reference year t become available. Therefore, it cannot be excluded that errors are detected in the ‘release 2’ population determined at the end of the reference year. For this reason, a special procedure for additional frame error corrections should be developed and a final population frame is foreseen for July, T+1. How these updates should be incorporated in the statistical DataWareHouse and the SBR will be discussed in chapter 5. This updating scheme is schematically presented in figure 5. 14 FATS frame population of active units year T FATS survey population year T FATS frame population of active units year T (revised) Undercoverage In both frame populations Frame error procedure Overcoverage dec Release 1 jul Release 2 jan apr nov Release 3 jul okt jul Release 4 jan apr jul nov aug Figure 5 Proposal for the construction of an annual population in case the statistical-DWH includes short-term statistics. Figure copied from van der Ven, 2012. Please note that the release 2 in this figure is skipped for the SDWH-procedure. 4. The largest enterprises An increasing number of national statistical institutes (NSIs) have created a team, which is responsable to create a fully integrated dataset for the largest enterprises or largest enterprise groups which dominate the economy. In contrast to other enterprises, these large enterprises are systematically analysed at mircolevel and data are made consistent at datasources. Main motivation to create such a team is the large contribution of the largest to the economy and some specific statistical estimates. If such team, or such integrated dataset for the largest enterprises, exists, is could be considered as a backbone ‘largest enterprises’. The reason for this consideration is that the dataset is already integrated covers all enterprises of a subpopulation (the largest enterprises) is of good quality and don’t need to be re-analysed again covers a considerable part of the total estimates and integration other datasources with this backbone increases the quality of the integrated dataset. This backbone is an input for the statistical DataWareHouse and should be linked to the population frame in the first step of the integration phase of the statistical DataWareHouse (GSBPM-step 5.1 – see figure 2). In the remainder of the process is similar for the largest enterprises as well as all other enterprises. 15 Figure 3 Illustration of the position of the backbones in a statistical-DWH when the data of the largest enterprises are integrated at the source. 5. Linking the datasources to the statistical unit 5.1 Position in the statistical process of a statistical-DWH Data-linking between the different sources is the first step in the processing phase of a statistical DataWareHouse. As the population frame consists of the statistical enterprise unit only, this step can be desribed more precisely as linking the input data to the statistical units. Technical aspects of data-linking are described in deliverable 2.4 of the ESSnet on DataWareHousing. The next chapter of this document addresses the question, which information is required to link the several input sources to the statistical unit. 5.2 Variation in input units Accountancy data, tax data (including VAT and social security data) and other data may be reported for different parts within an enterprise group. These data might be reported for the enterprise group as whole, the underlying legal units, the underlying legal units, and tax units consisting of other part of the enterprise groups. The variation in units and the challenge of linking them depends on the national legislation. Therefore, the impact of this issue differs 16 per country. The size of the enterprise also determines the variation is units and the complexity of linking them. For small enterprises one-to-one relationships between the different units can be assumed, but this assumption cannot be made for medium-sized enterprises. Nevertheless, whatever the extent of this issues in the individual countries and whatever the determination of the statistical unit, it cannot be taken for granted that all input data link automatically to the statistical unit. Hence, the relationship between these ‘input’ units and the statistical unit should be known before the data can be linked. Data-linking is of less importance when using surveys, because surveys are generally based on statistical units as they are designed from information of the SBR. 5.3 Variation in output units Most statistical estimates in enterpise statistics is produced on the statistical unit. Examples are SBS, STS and most instituional statistical. However, some output is produced on different units like local untis, LKAU, KAU or enterprises groups. Again the complexity of linking these units depends per country and size of the enterprises. Nevertheless, one-to-one relationships between these output units and the statistical enterprise unit cannot be taken for granted. Hence, relationships between the ‘output’ units and the statistical units should be known before flexible outputs can be generated. 5.4 The statistical unit and the process of a statistical-DWH The most simple and transparant statistical process can be generated by Linking all input sources to the statistical enterprise unit at the beginning of the processing phase (GSBPM-step 5.1 – see figures 2,3). Performing datacleaning, plausibilty checks and data-integration on statistical units only (GSBPM steps 5.2-5.6). Producing statistical output (GSBPM-steps 5.7-5.8) by default on the statistical unit and the default target population. Flexible outputs on other target populations and other units are also produced in these steps by using repeated weighting techniques and/or domain estimates. Technical aspects of these estimation methods are described in deliverable 2.8 of the ESSnet on DataWareHousing. Note that it is theoretically possible to perform data-analyses and data-cleaning on several units simultaneously. However, experiences of Statistics Netherlands with cleaning VAT-data on statistical units and ‘implementing’ these changes on the original VAT-units too, reveal that the statistical process becomes quite complex. Therefore, it is proposed that linking to the statistical units is carried out at the beginning of the processing phase only. the creation of a fully integrated dataset is performed for statistical units only statistical estimates for other units are produced at the end of the processing phase only 17 relationships between the different in- and output units on one hand and the statistical enterprise unit on the other hand should be known (or estimated) beforehand. 5.5 The statistical unit base As the relationship between the different in- and output units on one hand and the statistical enterprise units on the other hand should be known (or estimated) before the processing phase, it is recommended to include this information in a separate input source. This input source is the so-called unit base. It describes the relationships between the different units and the statistical enterprise unit, which is used in the process of a statistical unit the enterprise group, which is the base for tax and legal units in some countries, like the Netherlands The relationship between the enterprise group and the statistical (enterprise) unit should of cource also be included in the unit base. It might be possible that no one-to-one relationships between the different units are observed. In case one unit of a datasource corresponds with severals statistical units, the ‘observed’ values of a datasources need to be allocated (“subdivided”) correctly over the several statistical unit before they can processed further in the statistical-DWH. To be able to do this, it is recommended to include an estimated share of all units (including the statistical unit) to the enterprise groups in the unit base. This share may be based on employment or turnover. The unit base can be subvided into ‘input’ units, used to link the different dataset to the statistical unit at the beginning of the processing phase (GSBPM-step 5.1: “intgrate data”) and ‘output unit’ used to produce output on other units than the statistical enterprise at the end of the processing phase (GSBPM-step 5.7/5.8 “calculate aggregated”). Figure 7 illustrates the content of a unit base. Note that the exact contents of the unit base depends on legislation per country output requirements and desired output of a statistical DataWareHouse available input data. It should also be mentioned that the unit base might be updated when other input data (with different units) become available. The unit base is closely related to the SBR. However, as its contents are also closely related to available input data we recommend to consider it as a separate input source. The use of an unit base is preferred over incorporating a statistical unit in data source. First of all, when doing the latter the data-linking is implictly done in the collection phase of a statistical DataWareHouse. Secondly, it is much more efficient and transparant to store the relationship between the different in one source. This is especially the case 18 when deficiences in the data-linking process are detected in a later phase of the statistical process and these deficiences lead to corrections in the earlier determined relationships between the units. when additional new datasources are used for the statistical-DWH Figure 7 Example of an unit base. 6. Correction information in the population frame and feedback to SBR 6.1 The position of the Business Register in a statistical DataWareHouse The position of the SBR in a statistical DataWareHouse is three-fold. More precisely the SBR is the input source for the backbone ‘population frame’ of the statistical-DWH the SBR is closely related to the unit base. the SBR is the sampling frame for the surveys. Surveys are another important datasources of the statistical-DWH. 19 The last point implies that deficiences in the population frame, which might be detected during the statistical process, should be incorporated in the SBR. By not doing this, the same definciences will return the survey results of next period. The key questions are: at which step of the statistical proces should the population frame be corrected when deficiences are detected ? when are the same corrections made in the backbones when are the same corrections based in the SBR (which is a source of some backbones)? The position of the SBR and its relationschip with the backbones, unit base and surveys are illustrated in fig. 8 Figure 8 Illustration of the position of the SBR within a statistical-DWH. The SBR is basically related to three important input data of the S-DWH; the population frame, unit base and the surveys. This figure also shows the position of a) data-integration and b) ‘weighting/calculation of aggregates’ in the statistical process. It shows at which step in the statistical process the feedback to the SBR preferably takes place. Note that backbones are denoted by brown cylinders and other input data by grey cylinders. 6.2 Determination of the default target population in the statisical-DWH As previously mentioned the backbones ‘population frame’, ‘turnover’ and ‘employments’ are used for the determination of the default target population. As these backbones are linked 20 to each other at the beginning of the processing phase (GSBPM step 5.1), the determination of the default target population can take place here. We call this the provisional default target population. This provisional default target population can be used for checking, cleaning and integrating all other datasources data at a micro-level. During these steps, conflicting information between the datasources might be detected (in practice: will probably be detected). Conflicting information may in extremis lead to the conclusion that errors in the provisional default target population do exist. Deliverable 2.8 of the ESSnet of DataWareHousing addresses the question how this conclusion might be drawn, because deliverable 2.8 deals with hierarchy between the different data sources. Whatever the methodology for detection, errors in the provisional default target population might have three possible origins. More specifically, they may be related to errors in the data-linking to errors in the VAT- and/or employment data to errors in the population frame. The first two point result in an erroneous estimation of the activity status and therefore the number of active enterprises. It is expected that most errors in the population frame are related to errors in the NACE-code the size class of the enterprise In other words, other data sources like surveys and admin data indicate that the enterprise has either another activity as recorded in the SBR or another size as recorded in the SBR. If the SBR, unit base and VAT/employment backbones are of good quality, the number of errors in the provisional default target population should be limited. Moreover, data-cleaning + data-integration at microlevel are basically independent of the number of active enterprises, NACE-code and size class. Therefore, it is proposed to use the provisional default target population for these steps, even after errors have been detected. Another reason for this proposal is that errors might be detected at several stages of the data-cleaning and integration process. Therefore, it preferred to collect all detected errors in this part of the process at a first stage before correcting them in the population. Errors in the provisional default target population might become influential when weighting the integrated microdata and calculating aggregates at the end of the processing phase. Therefore, it recommended correcting all errors in the population just before performing these steps! Hence the provisional default target population is replaced by the default target population at (the beginning of) GSBPM-step 5.7 (“weighting”). 6.3 Panel surveys and correcting the population frame 21 Surveys are another data source to detect errors in the default target population. Especially, surveys about produced goods, performed services and investments can be very useful to detect errors in the NACE-code. However, it should be prevented that corrections of NACEcodes ect. lead to selectivity in the quality of the SBR. Selectivity means in this context: some parts of the SBR are of better quality than others because they are surveyed. To prevent this drawback, one should be very cautious in the use of (limited) panel survey data to correct information in the SBR. When using panel surveys, erroneous information leading to incorrect estimates should preferably be treated as outliers. This warning is not valid for admin data, because these data cover almost the entire population. 6.4 Timing of feedback to the unit base and the backbones The unit base and the backbones “population frame”, “turnover”and “employment” have a crucial role in the linking and estimation process. Therefore, it is advisable that if the default target population is updated due to errors in these sources, these sources themselves are updated too. These updates are desirable to ensure that late information or later available input data are processed with the correct population information. Disadvantage of correcting the input data is that previous published estimates are revised when re-running the process the with improved population information. In this case, previous published estimates are defined by published estimates before the errors in the population were detected. This drawback of unexpected revisions, can be limited by developing a good metadata system, i.e. which data belong to which estimate using the paradigm that the information derived from the SBR (= source for population frame) and the backbones “turnover” and “employment” is correct unless otherwise proven. In other words, population and corresponding input data are corrected only, if the detected errors are certain and influential relating the timing of incorporating changes in the input data to the revision moments of the most important statistical outputs. Due to time-lags between different data-sources, the revisions caused by corrected information in the unit base and admin data sources are likely larger if the statistical DataWareHouse covers short-term statistics, too. 6.5 Timing of feedback to the SBR It has been argued in the previous chapter that updates in the provisional target populations due to proven and influential errors in the backbone “population (frame) backbones “turnover” and “employment” unit base 22 should be accompanied by updates in the corresponding backbones, too. However, the timing of these updates should correspond with the timing of the revision moment in the most important estimates. The population frame and unit base are strongly related to the SBR. Hence, the SBR should also be updated, if the population frame and unit base are updated due to proven and influential errors. However, the timing of updating the SBR is extremely important as the SBR also acts as a frame for survey sampling including for surveys falling outside the scope of the statistical DWH. The importance of the timing can be best illustrated with an example. If the SBR is used as sampling frame for an STS-survey of current year t+1 and the SBR is ‘suddenly’ updated with information from the statistical-DWH from last year t, a sudden – and concerning timing incorrect - discontinuity in the STS-survey series arises. The question is whether this discontinuity is desirable. The same applies for surveys falling outside the scope of the statistical DataWareHouse. Therefore, it is advisable to develop a strategy for correcting information in the SBR. A possible strategy is: For the errors with the biggest impact: correcting the population frame and the SBR at the same time (and as soon as possible). However, consultation with the stakeholders of the most important statistics outside the scope of the statistical DataWareHouse is strongly recommended as these changes may have impact on other statistics. For less influential errors: corrections in the SBR are carried out at the end of the calendar year when all surveys are renewed or refreshed. In this case, prelimary estimates outside the statistical-DWH published within 12 months after the statistical year t are still on a SBR including known-errors. The final estimates published more than 12 months after statistical year t are still on a SBR corrected for known-errors 7. Conclusions Two conditions are required for a succesful statistical DataWareHouse. Firstly, the population is well defined. Secondly, one (enterprise) unit should be used in the statistical DataWareHouse, because it is – in practice – impossible to create integrated datasets for several (types of) enterprise units. A unit base is needed to link the different units of all individual data sources to the statistical enterprise unit. Backbones about “population”, “turnover”, “employment” are desired as these three datasources cover (almost) all enterprises and therefore provide good basic characteristics for enterprises. Linking other datasources to these three backbones, instead of only population totals, improve the quality of the integrated datasest. If a NSI integrates the data of the largest enterprises at the beginning of dataprocessing, the integrated data from Large enterprises can be considered as a backbone, too. The SBR is an indirect data source for the statistical DataWareHouse. It is an indirect source because 1) the backbone “population” is derived from it, 2) the unit base is strongly related to 23 it and 3) the surveys – another important data source for the statistical DataWareHouse – are based from it. Hence, when errors in the population are revealed after integrating different data sources, it is desired that these errors are corrected in the SBR, too. However, the timing of incorporating these corrections in the SBR (and other backbones) is extremely important due to multiple use of SBR-information in data sources within or beyond the scope of the statistical DataWarehouse. 24