The difference between data linking and integration

in partnership with Title: Data linking aspects of combining data (survey/administrative) including options for various hierarchies (S-DWH context) WP: 2 Deliverable: 2.4.1 Version: Final version Date: 04-06-2013 Autor: Nadezda Fursova NSI: Lithuania ESS - NET ON MICRO DATA LINKING AND DATA WAREHOUSING IN PRODUCTION OF BUSINESS STATISTICS Content 1. Introduction ..................................................................................................................................... 3 1.1 Definition of statistical Data Warehouse (according to the FPA) ........................................... 3 1.2 Purpose of the document ......................................................................................................... 4 1.3 What is data linking? ............................................................................................................... 5 The difference between data linking and integration .......................................................................... 5 2. The input data set............................................................................................................................. 6 2.1 Statistical Business Register and Population frame ................................................................ 7 2.2 Data sources ............................................................................................................................ 8 2.2.1 Surveys .................................................................................................................................... 8 2.2.2 Combined data......................................................................................................................... 8 2.2.3 Administrative data ................................................................................................................. 9 2.3 Defining metadata ................................................................................................................... 9 2.4 The statistical unit base ......................................................................................................... 10 3. The quality of data ......................................................................................................................... 11 4. Data linking process ...................................................................................................................... 12 4.1 Deterministic linkage ............................................................................................................ 12 4.2 Probabilistic linkage .............................................................................................................. 13 4.3 The data linkage process ....................................................................................................... 14 4.4 Determinants of linkage quality ............................................................................................ 15 4.5 Measures of quality of data linkage....................................................................................... 16 5. Methodological issues ................................................................................................................... 17 6. Link with other ESSnet.................................................................................................................. 17 7. Conclusions ................................................................................................................................... 20 8. References ..................................................................................................................................... 20 2 1. Introduction 1.1 Definition of statistical Data Warehouse (according to the FPA) The main goal of the ESSnet on “micro data linking and data warehousing” is to prepare recommendations about better use of data that already exist in the statistical system and to create fully integrated data sets for enterprise and trade statistics at micro level: a 'data warehouse' approach to statistics. The broad definition of a data warehouse to be used in this ESSnet is: ‘A common conceptual model for managing all available data of interest, enabling the NSI to (re)use this data to create new data/new outputs, to produce the necessary information and perform reporting and analysis, regardless of the data’s source.’ The project describes a generic DWH for Statistics (S-DWH) or statistical Data Warehouse as: a central statistical data store, regardless of the data’s source, for managing all available data of interest, improving the NSI to: - (re)use data to create new data/new outputs; - perform reporting; - execute analysis; - produce the necessary information. This corresponds with a central repository able to support several kinds of data, micro, macro and meta, entering into the S-DWH in order to support cross-domain production processes and statistical design, fully integrated in terms of data, metadata, process and instruments. In practice, the S-DWH is subdivided into two main different environments:  first is where all available information is collect and build-up, usually defined as Extraction, Transformation and Loading (ETL) functions. Aim of this environment is to create a set of fully integrated data  second is the actual data warehouse, i.e. where data analysis, or mining, and reports for executives are realized. Aim of this environment is to disseminate the fully integrated data is consistent output. Compared to ‘commercial’ Data Warehouse, the statistical Data Warehouse is – in general – more focussed on creating a set of fully integrated data (from different data sources). Workpackage 2 (WP 2) of this ESSnet covers all essential methodological elements for designing, building and implementing the statistical Data Warehouse (S-DWH). It is concentrated on the methodological aspects to create a set of integrated data. This document describes an essential part of the creation of an integrated dataset: e.g. linking data from different sources. 3 1.2 Purpose of the document The purpose of this document is to make overview on data linking in a Statistical Data Warehouse, to provide information about data linking methods, about useful links, as well to mention problems that we can meet linking data from multiple sources and to present guidelines about methodological challenges on data linking. In the workpackage 3 (WP 3) that covers all essential architectural and technical elements for designing, building and implementing the SDWH were applied the GSPBM sub-processes (Figure 1) to the S-DWH concept in order to provide a Business Architecture of the S-DWH. Data linking process is a part of data integration process that is covered by GSBPM model 5.1 sub-processes. Figure 1. A schematic sketch of the GSBPM (Generic Statistical Business Process Model). The WP 3 team propose to use four functional layers (Figure 2):  data source layer,  integration layer,  interpretation and data analysis layer,  data presentation layer; to manage the information flow between data sources and SDWH central administration. 4 This integration layer of the S-DWH corresponds with the processing phase 5 of the GSBPM. This document scope refers to the integration layer, where different input sources are linked to each other. Figure 2. Relationships between the layers of a statistical Data Warehouse and the statistical processes according to the GSBPM (Generic Statistical Business Process Model). 1.3 What is data linking? The difference between data linking and integration Data linking is linking the different input sources (administrative data, surveys data, etc.) to one population and processing this data to one consistent dataset that will greatly increase the power of analysis then possible with the data. While data integration according to GSBPM model 5.1 sub-process (Figure 1) it is a process that integrates data from one or more sources. The input data can be from a mixture of external or internal data sources, and a variety of collection modes, including extracts of administrative data. The result is a harmonized data set. Data integration typically includes: - matching / record linkage routines, with the aim of linking data from different sources, where those data refer to the same unit; - prioritising, when two or more sources contain data for the same variable (with potentially different values). Data integration may take place at any point in this phase, before or after any of the other subprocesses. There may also be several instances of data integration in any Statistical business process. Following integration, depending on data protection requirements, data may be 5 anonymized, that is stripped of identifiers such as name and address, to help to protect confidentiality. Data integration process put data from disparate sources into a consistent format. Must be resolved such problems as naming conflicts and inconsistencies among units of measure. When this is achieved, data are said to be integrated. Data integration is a big opportunity for NSIs, it opening up possibilities for reducing costs, leads to reduced survey burden on respondents and may increase data quality. But also it is a big challenge, a lot of preparatory work must be done by NSIs, should be examined the data sources, the metadata should be defined before linking data. There are a lot of issues and questions that should be analyzed and answered in order to create fully integrated data sets for enterprise and trade statistics at micro level. If the data include error-free and unique common identifiers as a unique identification code of the legal entity or a social security number, record linkage is a simple file merge operation which can be done by any standard database management system. In other cases it is necessary to resort to a combination of ambiguous and error-prone identifiers as surnames, names, address, NACE code information. Data quality problems of such identifiers usually yield a considerable amount of unlinkable cases. In this situation the use of much more sophisticated techniques and specialised record linkage software is inevitable. 2. The input data set First and the main step in data linking are to determine needs and check data availability. It is considered to have all available data of interest in S-DWH. Proposed scope of input data set: 6 2.1 Statistical Business Register and Population frame In a Data Warehouse system the Statistical Business Register has a crucial role in linking data from several sources and defining the population for all statistical output. Member States of the European Union maintain business registers for statistical purposes as a tool for the preparation and coordination of surveys, as a source of information for the statistical analysis of the business population and its demography, for the use of administrative data and for the identification and construction of statistical units. The SBR contains at least:  a statistical unit  a name and address of the statistical unit  an activity-code (NACE)  a starting and a stopping date of enterprises and van NACE-code for activity. NSIs use SBR to derive a population frame and to relate all input data to a reference target population. As it was proposed in deliverable 2.2 of the ESSnet on Data Warehousing, to link several input data in a S-DWH we need to agree about the default target population and about the enterprise unit to which all input data are matched. The default target population is defined as statistical enterprise units which have been active during the reference year and this target population was proposed because it corresponds with output requirements of the European regulation. Most of statistics use the SBR to derive a population frame, which consist of all units (enterprises) with a certain specific activity. The activity is derived from the NACE-code. For example, for annual statistical this means that the default target population consists of all active enterprises during the year, including the starters and stoppers (and the new/stopping units due to merging and splitting companies). This input source will be called ‘population frame’. The population frame includes the following information to derive activity status and subpopulations: 1) Frame reference year 2) Statistical enterprises unit, including its national ID and its EGR ID1 3) Name/address of enterprise of the enterprises 4) National ID of the enterprises 5) Date in population (mm/yr) 6) Date out of population (mm/yr) 7) NACE-code 8) Institutional sector code 1 meaningless ID assigned by the EGR system to enterprises, it is advised to include this ID in the Data Warehouse to enable comparability between the country specific estimates 7 9) Size class2 The population frame is crucial information to determine the default active population. 2.2 Data sources One aim of a S-DWH is to create a set of fully integrated data about enterprises. And these data may come from different sources like surveys, administrative data, accounting data and census data. Different data sources cover different populations. Some data sources like censuses cover all population (enterprises). Some cover all enterprises with a certain activity, some only big enterprises or other subpopulations. Other survey samples include small enterprises, but provide information about a few of them. To link this input data sources and to ensure that these data are linked to the same enterprise unit and are compared with the same target population is the main issue. Main data sources: 1. Surveys (censuses, sample surveys) 2. Combined data (survey and administrative data) 3. Administrative data 2.2.1 Surveys Survey based on statistical data collection (statistical questionnaire). A sample survey is more restricted in scope: the data collection is based on a sample, a subset of total population - i.e. not total count of target population which is called a census. However, in sample surveys some sub-populations may be investigated completely while the most sub-populations are subject to selected samples. Some surveys are based on statistical data collection and administrative data and they are so-called combined data. Surveys as well as administrative data can be used to detect errors in default target population. 2.2.2 Combined data Since survey and administrative data sets have their respective advantages, a combination of both sources enhances the potential for research. Furthermore, record linkage has several advantages from a survey methodological perspective. The administrative data is used to 2 could be based on employment data 8 update the frame of active enterprises, to cover and estimate non-surveyed or non-responding enterprises. The success of the actual linkage depends on the available information to identify a respondent in administrative records and on the quality of these identifiers. Record linkage can be performed using different linkage methods by means of a unique identifier such as the Social Security Number or unique common identifier, or on the basis of the ambiguous and error-prone identifiers as name, sex, date of birth and address etc. Before the records from both data sources are actually comparing, extensive pre-processing need to be conducted to clean up typographical errors as well as to fill in missing information. These steps of standardization should be done consistently for both the administrative and survey records. 2.2.3 Administrative data Administrative data is the set of units and data derived from an administrative source. A traditional definition of administrative sources is that they are files of data collected by government bodies for the purposes of administering taxes or benefits, or monitoring populations. This narrow definition is gradually becoming less relevant as functions previously carried out by the government sector are, in many countries, being transferred partly or wholly to the private sector, and the availability of good quality private sector data sources is increasing. Administrative data in many NSIs used to update SBR data (in Lithuania it is State social insurance Fund Board (SODRA), VAT from State tax inspectorate), to determine the activity status of the statistical unit (annual VAT and employment dataset), to update the final survey frame, to cover non-surveyed and non-responding enterprises, to estimate missing data. 2.3 Defining metadata The term metadata is very broad and distinction is made between “structural” metadata that define the structure of statistical data sets and metadata sets, and “reference” metadata describing actual metadata contents, for instance, concepts and methodologies used, the unit of measure, the data quality (e.g. accuracy and timeliness) and the production and dissemination process (e.g. contact points, release policy, dissemination formats). Reference metadata may refer to specific statistical data, to entire data collections or even to the institution that provides the data. NSIs need to define metadata before link sources. What kind of reference metadata needs to be submitted? As we know in Eurostat this information is presented in files based on a 9 standardised format called ESMS (Euro SDMX Metadata Structure) (Figure 3). ESMS Metadata files are used for describing the statistics released by Eurostat. It aims at documenting methodologies, quality and the statistical production processes in general. Figure 3. ESMS (Euro SDMX Metadata Structure). It uses 21 high-level concepts, with a limited breakdown of sub-items, strictly derived from the list of cross domain concepts in the SDMX Content Oriented Guidelines (2009). 2.4 The statistical unit base The unit base is closely related to the SBR. However, as its contents are also closely related to available input data to consider it as a separate input source. This unit base describes the relationship between the different units and the statistical enterprise unit. It summarizes the results of the data linking process described before. Figure 4 and 5 illustrates the content of unit bases of Netherlands and Lithuania respectively. Note that the exact contents of the unit base depend on:  legislation per country  output requirements and desired output of a S-DWH  available input data It should also be mentioned that the unit base might be updated when other input data (with different units) become available. 10 Figure 4. Example of a unit base of Netherlands Figure 5. Example of a unit base of Lithuania The Unit Base essential for data linking process. You need to have established links between data to make the process of data integration fluid, accurate and quality assured. 3. The quality of data Caring about data quality is the key to safeguarding and improving it. Discovering whether data are of acceptable quality is a measurement task, and not a very easy one. There are several criteria for defining data quality: o relevance o accuracy o timeliness 11 o o o o accessibility and clarity of results comparability coherence completeness Data quality problems can produce unrealistic or noticeably strange answers in statistical analysis and estimations. 4. Data linking process As was mentioned above if we have a unique identifier data linking process is a simple file merge operation. Statistical Business Register (SBR) is a main tool of the data collection for economic statistics in the Department of Statistics of Lithuania. On the basis of SBR all statistical surveys are conducted. Some surveys are conducted without using administrative data and some based on statistical data collection (statistical questionnaire) and administrative data. In Lithuania we have unique identifier – company ID number, so there are arising no technical problems in data linking process. Other NSI may have another situation; if there is no such identifier, sophisticated techniques to link data from different sources are used. The result of this complex situation is summarized in Unit Base. More details about unit Base will not be described any further. Data linkage methods usually fall across a spectrum between deterministic and probabilistic methods. A combination of linkage methods may be used, but the choice of method depend on the types and quality of linkage variables available on the data sets to be linked. 4.1 Deterministic linkage Deterministic linkage ranges from simple joining of two or more datasets by reliable and stable identifier to sophisticated stepwise algorithmic linkage. Simple deterministic linkage uses a single identifier. A high degree of certainty is required in deterministic linkage. This high degree can be obtained if there is a unique entity identifier, which uniquely identifies an individual or legal unit across the data sets. Unique identification number, such as company ID number or Social Security number, may exist, allowing records to be matched across multiple data sets. Deterministic linkage is based on exact matches. Variables used in deterministic linkage need to be accurate, robust, stable over time and complete, for example: first name, last name (for males), sex, date of birth etc. Alternatively, a combination of attribute, including last name, first name, sex and date of birth, can be used to create a linkage key which is then used to match records that have the same linkage key value. This linkage key is known as a statistical linkage key (SLK). 12 A more sophisticated form of deterministic linkage is stepwise deterministic record linkage, which has been developed in response to variations that often exist in the attributes that are used in creating the SLKs for deterministic linkage. Stepwise deterministic record linkage uses auxiliary information on the data sets to provide a platform from which variation in the reported linkage key information can be considered. This differs from simple deterministic linkage that relies on an exact, one-to-one character matching of linkage keys across two or more data sets. Another variant of deterministic linkage is “rules-based linkage” where a set of rules can be used to classify pairs of records as matches or non-matches. Such rules can be more flexible than using a linkage key, but their development is labour intensive and highly dependent on the data sets to be linked. Statistical linkage keys The SLK is a code that replaces for example the person’s first name and last name, to protect the person’s identity. Generally most SLK for personnel statistics are constructed from last name, first name, sex and full date of birth. SLK protects privacy and data confidentiality because they serve as an alternative to a person’s name and dates of birth being on the data sets to be linked. For example: Commonly used SLK is the SLK 581. It consists of the concatenation of the 2nd, 3rd and 5th letters of family name, the 2nd and 3rd letters of given name, date of birth as a character string of form “ddmmyyyy”, followed by the character”1” for male and “2” for female. It comprises five characters from person’s first and last name, eight characters from date of birth and one representing the person’s sex. Data linkage using an SLK is commonly deterministic, but this requires the variables used in constructing SLK to be accurate, complete and as exact as possible. Difficulties with SLK construction may be experienced where variations of names are used on different databases (e.g. Smith/Smythe; Thompson/Thomson etc.) There are two kinds of errors associated with SLKs. Firstly; there may be incomplete or missing data items on an individual’s record, which means that SLK will be incomplete. Secondly, errors in the source data may lead to generation of multiple SLKs for the same individual or multiple individuals will share the same SKL. It could be combined deterministic linkage and linkage with SLKs with probabilistic linkage, which requires less exacting standards of accuracy, stability and completeness. 4.2 Probabilistic linkage Probabilistic linkage may be undertaken where there are no unique entity identifiers or SLKs, or where the linkage variables and/or entity identifiers are not as accurate, stable or complete as are required for deterministic method. In such cases, matching and linking will depend on achieving the closest approach to unique identification by using several identifying variables. Each of these variables is only a partial 13 identifier, but, in combination, they provide a match that is sufficiently accurate for the intended purpose of linking data sets. Probabilistic linkage has a greater capacity to link records with errors in their linking variables. In the traditional probabilistic linkage approach, pairs of records are classified as matches if their common attributes predominantly agree or, as non-matches, if they predominantly disagree. Probabilistic methodologies can lead to a much better linkage of records from separate data collections than simple deterministic methodologies for statistical linkage purpose. There are 2n possible match/non-match configurations of n fields, so this method of record linkage uses the M and U probabilities one can sum the probabilities for agreement and disagreement between a ranges of linkage variables. M-probability (match probability) Probability that a field agrees given that the pair of records is a true match. For any given field the same M-probability applies for all records. For example: M-probability of 0.95 for surnames means that the probability that two records belonging to the same person will agree on last name is 0.95. The reasons of 5 % of disagree: data errors, missing data, instability of value, e.g. surname change, misspelling. U-probability (unmatch probability) Probability that a field agrees given that the pair of records is not a true match. Often it is simplified as the chance that two records will randomly match. It is defined as a specific value and will often have multiple values for each field. Typically U-probabilities estimated as the proportion of records with a specific value, based on the frequencies in the primary or more comprehensive and accurate data source. 4.3 The data linkage process Data linkage can be project-based (ad hoc) or systematic. Systematic data linkage involves the maintenance of a permanent and continuously updates master linkage file and a master linkage key. The data linkage process may vary, depending on linkage model and the linkage method. But there are however four steps that are common to both data linkage models. These are data cleaning and data standardisation, blocking and searching, record pair or record group comparisons, and a decision model. However, while data cleaning and standardisation are common to both deterministic and probabilistic linkage, the other tree processes are more relevant to the probabilistic method.  Data cleaning and standardisation This process identifies and removes the errors and inconsistencies in the data, and analyzes the text fields so that the data items in each data file are comparable. 14  Blocking Data linkage often involves large data sets. When two data sets are linked the number of possible comparisons equals the products of the number of records in two data sets. Blocking reduces the number of comparisons needed, by only comparing record pairs where links are more likely to be found. Blocking involves selecting sets of blocking attributes, such as sex, data of birth, last name or components of first and last name and only comparing records with the same attributes.  Record pair comparison During the comparison stage, record pairs are compared on each linkage field, and the level of agreement is measured. Field comparison weights are assigned to each linkage field for each record pair. The field comparison weights are the summed over the linkage variables to form a record pair comparison weight.  Decision model Record pair comparison weights help data linkers decide whether a record pair belongs to the same entity. This decision can be based on a single cut-off weight or on a set of lower and upper cut-off weights. Under the single cut-off weight approach, all record pairs with a comparison weight equal to or above the cut-off weight are assigned as links and all those below the cut-off weight are assigned as non-links. Under the lower and upper cut-off weights approach, all record pairs with a comparison weight below the upper cut-off are assigned as links and those with weights below the cut-off are assigned as non-links. Record pairs with comparison weights between the upper and lower cut-offs are assigned as possible links, and designated for clerical review. In clerical, data linkers manually inspect all the variables available for the record pairs whose link status cannot be automatically determined, and then decide whether the record pairs belong to the same entity. 4.4 Determinants of linkage quality Key determinants of linkage quality include:  the quality of SLKs (in case of deterministic linkage)  the quality of blocking and linkage variables (in the case of probabilistic linkage). When the linkage method is deterministic, and SLKs are used to compare the records, then the quality of the linkage will depend on the quality of the variables used in constructing the SLKs. In addition to having accurate, reliable and consistently reported blocking and linking variables, the quality of data linkage also depends on the blocking and linking strategy adopted. 15 Blocking and linking variables are used in constructing SLKs (deterministic linkage) or in linking the data sets (probabilistic linkage). Poor quality (if blocking or linking variables are missing, inaccurate, incomplete or are inconsistently reported on the data sets to be linked) could lead to some records not being linked or to some records being linked to wrong records. 4.5 Measures of quality of data linkage Measures that may be used to asses data linkage quality include accuracy, sensitivity, specificity, precision and the false-positive rate. These measures are based on basic data that are simply defined from the results of data linkage (Figure 6). Figure 6. Classification of matches and links Not of all these measures are easily calculated, because their calculation depends on knowing the number of true non-matches or true negatives (TN). These are often unknowable or difficult to calculate. The most widely used quality measures are:  sensitivity or the true positive rate: This is the proportion of matches that are correctly classified as matches. It may be defined as the proportion of all records in a file with a match in another file that is correctly accepted as links (true links) This measure is calculated as: TP/(TP+FN)  precision or the positive predictor value is the proportion of all classified links that are true links or true positive. This measure is calculated as: TP/(TP+FP). 16 5. Methodological issues When we are linking data from different sources like sample surveys, combined data and administrative data we can meet such problems as data missing, data overlapping, “unlinked data” etc. Errors might be detected in statistical units and target population when linking other data to this information. And if these errors are influential they need to be corrected in the SDWH. There are three kind of missing data type: structural, temporal and errors. Temporal missing data – data that is not available at this time, but it will be provided and incorporated in a short period of time. Errors – are incorrect values. A frequently expected error will be the NACE code (activity). One of the problems is conflict between sources. Data conflict occurs when two or more sources contain data for the same variable (with different values). In many cases, when two (or more) reliable sources conflict, one (or more) of those sources can be demonstrated to be unreliable. The main goal is to define data source priority for each indicator and rules determining the quality of priority data source. It is needed to be defined what data source for what indicator more reliable and determined by different criteria. In some other cases it could be needed additional analysis, and could be used more sophisticated methods, or even manual techniques. In Lithuania we use data source priority for each indicator in SBR and when priority sources are determined they are not changed. To determine priority source need to be defined priority rules: like quality of data source, completeness, update time, and of course the consultation with experts. When there is no unique identifier, we use more sophisticated methods for matching and linking several identifiers. It could cause that some data could be “unlinked”. Pour quality of selected linkage variables or of probabilistic methods can lead to some record not being linked or linked to the wrong records, some records are unable to be linked because of missing incomplete, inaccurate variables. 6. Link with other ESSnet In the document “Mapping the coverage of ESSnet projects relevant to workpackage 2 of the ESSnet on data warehousing” by Ria Sanderson it was identified potential areas of overlap, or potential gaps, between the deliverables of Work Package 2 (WP2) of the ESSnet on Data Warehousing (hereafter, ESSnet DWH). Although, it was identified which of the completed and ongoing ESSnet projects are of direct relevance to WP2. We can use the visual depiction of the links with other ESSnet, which is very useful to find all information related to data linking within ESSnet projects, whether they are finished or not. 17 Figure shows a visualisation of the links between the deliverables of the ESSnet DWH and work packages in the other relevant ESSnet projects. The links identified as being “methods” links are shown as solid black lines, and the links identified as “background” links are shown as grey lines. Each of the ESSnet projects considered are represented by different coloured circles, as shown in the legend, and refer to the ESSnet’s DWH, admin data (AD), data integration (DI), common reference architecture (CORA), statistical disclosure control (SDC), statistical disclosure control harmonization (SDC-H) and the Blue-ETS project. In the case of the ESSnet DWH, the circles represent the different deliverables, and the deliverable number is shown in each circle. For the other ESSnet projects, the numbers in the circles represent the individual work packages where links have been made. The figure highlights the importance of the results of the data integration for deliverable 2.4 of the ESSnet DWH. There are many links that have been classified as “background” links, meaning that a lot can be learned for this deliverable from work that is currently taking place elsewhere, particularly in the ESSnet on Admin Data. “Methods” links: 1. DI WP1 State of the art in data integration – provides methods for data integration in a variety of circumstances (The final report is available) http://www.crosportal.eu/sites/default/files//ESSnetDI_WP1_v1.32.pdf . The document “introduce a formal 18 procedure to find records belonging to the same unit being from either different or the same source –that is, database. These procedures are based on probabilistic instead of deterministic criteria and rely on the equivalence of values when comparing those from two different records on a field-by-field basis; and then, on the probability of agreement between values given the true –and unknown– status of the pair of records –that is, they actually do belong to the same entity or they actually do not. Both standard and alternative approaches for probabilistic record linkage are discussed “. 2. DI WP2 Development of methods – provides methods for establishing linkage error, performing inference, and maintaining micro-consistency and macro-consistency (The final report is available) http://www.cros-portal.eu/sites/default/files//WP2.pdf. The document presents methodological developments that are classified according to three areas of integration activities connected with different phases in the statistical processing of data from multiple sources. Record linkage, which is an obvious first stage when combining data from different sources on the same units. When no unique error-free identifiers are available, the problem of linkage errors has to be dealt with at this stage. Inference with multiple sources, where estimation problems are studied when multiple sources are available but the sets of units in the different sources are not or only partially overlapping. Micro and macro consistency, where the problem is tackled that information from different sources can be inconsistent both at the micro-level, that is within linked records, as well as at the macro-level, that is between estimates of aggregates. 3. DI WP3 Development of common software tools – provides software for data linking in R (The software is available) https://joinup.ec.europa.eu/software/relais/description ; http://www.cros-portal.eu/content/wp3-development-common-software-tools. The project presents improvements that had been done to software RELAIS (REcord Linkage At IStat) for data linking as well as the software itself. The RELAIS (REcord Linkage At IStat) is a toolkit providing a set of techniques for dealing with record linkage projects. “Background” links: 4. AD WP2b Checklist for quality of admin data. Aims to provide a checklist for assessing the quality of an admin data source http://essnet.admindata.eu/WorkPackage?objectId=4253 5. AD WP3 Methods of estimation for variables. Aims to provide methods to estimate for variables where admin data sources are not available http://essnet.admindata.eu/WorkPackage?objectId=4254 6. AD WP4 Timeliness of administrative data. Aims to provide methods of estimation when admin data is not available in time http://essnet.admindata.eu/WorkPackage?objectId=4255 7. AD WP6 Development of quality indicators. Aims to develop quality indicators for assessing outputs based on survey and admin sources http://essnet.admindata.eu/WorkPackage?objectId=4257 19 8. AD WP7 Statistics and accounting standards. Aims to align business statistics definitions with accounting characteristics http://essnet.admindata.eu/WorkPackage?objectId=4258 7. Conclusions Data linking it is a quite challenging process. For a successful data linking firstly, the population of the different data sources should be well defined. Secondly, the input sources should be of a high quality. Frequently they are not of a very high quality due to errors, variations and missing data. Another challenge is data dynamics: data may regularly and routinely change over time. The third and the main condition for successful data linking is that a unique identifier is desired. If there is no unique identifier we need to use more sophisticated methods: deterministic or probabilistic linking. Deterministic methods work best with high quality of data, and probabilistic linkage methods can lead to much better linkage than simple deterministic methods when there is missing data and no unique identifier. The results of data linking are summarized in a unit base which is part of a statistical data warehouse. Evaluation of quality of data linkage should also be done; it depends on presence of unique ID and accuracy, precision of data and false-positive ratios when linking. Next steps in data integration are dealing with errors, missing data and conflicting data. 8. References 1. Manfred Antoni, Institute for Employment Research (IAB) „Linking survey data with administrative employment data: The case of the German ALWA survey“ http://www.cros-portal.eu/sites/default/files//NTTS2013fullPaper_89.pdf . 2. Pieter Vlag, Statistics Netherlands, Guidelines (including options) on how the BR interacts with the S-DWH. 3. Ria Sanderson, Mapping the coverage of ESSnet projects relevant to Work Package 2 of the ESSnet on Data warehousing. 4. Bishop, G.R. and Khoo, J. (2007) Research Paper: Methodology of evaluating the quality of probabilistic linking. Australian Bureau of Statistic. 5. National Best Practice Guidelines for Indigenous Data Linkage Activities Relating to Aboriginal and Torres Strait Islander People (2012). 20

The difference between data linking and integration

Related documents

Products

Support

The difference between data linking and integration

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib