The difference between data linking and integration

advertisement
in partnership with
Title:
Data linking aspects of combining data (survey/administrative) including
options for various hierarchies (S-DWH context)
WP:
2
Deliverable:
2.4.1
Version:
Final version
Date:
04-06-2013
Autor:
Nadezda Fursova
NSI:
Lithuania
ESS - NET
ON MICRO DATA LINKING AND DATA WAREHOUSING
IN PRODUCTION OF BUSINESS STATISTICS
Content
1.
Introduction ..................................................................................................................................... 3
1.1
Definition of statistical Data Warehouse (according to the FPA) ........................................... 3
1.2
Purpose of the document ......................................................................................................... 4
1.3
What is data linking? ............................................................................................................... 5
The difference between data linking and integration .......................................................................... 5
2.
The input data set............................................................................................................................. 6
2.1
Statistical Business Register and Population frame ................................................................ 7
2.2
Data sources ............................................................................................................................ 8
2.2.1
Surveys .................................................................................................................................... 8
2.2.2
Combined data......................................................................................................................... 8
2.2.3
Administrative data ................................................................................................................. 9
2.3
Defining metadata ................................................................................................................... 9
2.4
The statistical unit base ......................................................................................................... 10
3.
The quality of data ......................................................................................................................... 11
4.
Data linking process ...................................................................................................................... 12
4.1
Deterministic linkage ............................................................................................................ 12
4.2
Probabilistic linkage .............................................................................................................. 13
4.3
The data linkage process ....................................................................................................... 14
4.4
Determinants of linkage quality ............................................................................................ 15
4.5
Measures of quality of data linkage....................................................................................... 16
5.
Methodological issues ................................................................................................................... 17
6.
Link with other ESSnet.................................................................................................................. 17
7.
Conclusions ................................................................................................................................... 20
8.
References ..................................................................................................................................... 20
2
1. Introduction
1.1 Definition of statistical Data Warehouse (according to the FPA)
The main goal of the ESSnet on “micro data linking and data warehousing” is to prepare
recommendations about better use of data that already exist in the statistical system and to
create fully integrated data sets for enterprise and trade statistics at micro level: a 'data
warehouse' approach to statistics.
The broad definition of a data warehouse to be used in this ESSnet is:
‘A common conceptual model for managing all available data of interest, enabling the NSI to
(re)use this data to create new data/new outputs, to produce the necessary information and
perform reporting and analysis, regardless of the data’s source.’
The project describes a generic DWH for Statistics (S-DWH) or statistical Data Warehouse
as: a central statistical data store, regardless of the data’s source, for managing all available
data of interest, improving the NSI to:
-
(re)use data to create new data/new outputs;
-
perform reporting;
-
execute analysis;
-
produce the necessary information.
This corresponds with a central repository able to support several kinds of data, micro, macro
and meta, entering into the S-DWH in order to support cross-domain production processes
and statistical design, fully integrated in terms of data, metadata, process and instruments.
In practice, the S-DWH is subdivided into two main different environments:

first is where all available information is collect and build-up, usually defined as
Extraction, Transformation and Loading (ETL) functions. Aim of this environment is
to create a set of fully integrated data

second is the actual data warehouse, i.e. where data analysis, or mining, and reports
for executives are realized. Aim of this environment is to disseminate the fully
integrated data is consistent output.
Compared to ‘commercial’ Data Warehouse, the statistical Data Warehouse is – in general –
more focussed on creating a set of fully integrated data (from different data sources).
Workpackage 2 (WP 2) of this ESSnet covers all essential methodological elements for
designing, building and implementing the statistical Data Warehouse (S-DWH). It is
concentrated on the methodological aspects to create a set of integrated data.
This document describes an essential part of the creation of an integrated dataset: e.g. linking
data from different sources.
3
1.2 Purpose of the document
The purpose of this document is to make overview on data linking in a Statistical Data
Warehouse, to provide information about data linking methods, about useful links, as well to
mention problems that we can meet linking data from multiple sources and to present
guidelines about methodological challenges on data linking.
In the workpackage 3 (WP 3) that covers all essential architectural and technical elements for
designing, building and implementing the SDWH were applied the GSPBM sub-processes
(Figure 1) to the S-DWH concept in order to provide a Business Architecture of the S-DWH.
Data linking process is a part of data integration process that is covered by GSBPM model 5.1
sub-processes.
Figure 1. A schematic sketch of the GSBPM (Generic Statistical Business Process Model).
The WP 3 team propose to use four functional layers (Figure 2):

data source layer,

integration layer,

interpretation and data analysis layer,

data presentation layer;
to manage the information flow between data sources and SDWH central administration.
4
This integration layer of the S-DWH corresponds with the processing phase 5 of the GSBPM.
This document scope refers to the integration layer, where different input sources are linked
to each other.
Figure 2. Relationships between the layers of a statistical Data Warehouse and the statistical processes
according to the GSBPM (Generic Statistical Business Process Model).
1.3 What is data linking?
The difference between data linking and integration
Data linking is linking the different input sources (administrative data, surveys data, etc.) to
one population and processing this data to one consistent dataset that will greatly increase the
power of analysis then possible with the data.
While data integration according to GSBPM model 5.1 sub-process (Figure 1) it is a process
that integrates data from one or more sources. The input data can be from a mixture of
external or internal data sources, and a variety of collection modes, including extracts of
administrative data. The result is a harmonized data set. Data integration typically includes:
- matching / record linkage routines, with the aim of linking data from different sources,
where those data refer to the same unit;
- prioritising, when two or more sources contain data for the same variable (with potentially
different values).
Data integration may take place at any point in this phase, before or after any of the other subprocesses. There may also be several instances of data integration in any Statistical business
process. Following integration, depending on data protection requirements, data may be
5
anonymized, that is stripped of identifiers such as name and address, to help to protect
confidentiality.
Data integration process put data from disparate sources into a consistent format. Must be
resolved such problems as naming conflicts and inconsistencies among units of measure.
When this is achieved, data are said to be integrated.
Data integration is a big opportunity for NSIs, it opening up possibilities for reducing costs,
leads to reduced survey burden on respondents and may increase data quality. But also it is a
big challenge, a lot of preparatory work must be done by NSIs, should be examined the data
sources, the metadata should be defined before linking data. There are a lot of issues and
questions that should be analyzed and answered in order to create fully integrated data sets for
enterprise and trade statistics at micro level.
If the data include error-free and unique common identifiers as a unique identification code of
the legal entity or a social security number, record linkage is a simple file merge operation
which can be done by any standard database management system. In other cases it is
necessary to resort to a combination of ambiguous and error-prone identifiers as surnames,
names, address, NACE code information. Data quality problems of such identifiers usually
yield a considerable amount of unlinkable cases. In this situation the use of much more
sophisticated techniques and specialised record linkage software is inevitable.
2. The input data set
First and the main step in data linking are to determine needs and check data availability. It is
considered to have all available data of interest in S-DWH.
Proposed scope of input data set:
6
2.1 Statistical Business Register and Population frame
In a Data Warehouse system the Statistical Business Register has a crucial role in linking data
from several sources and defining the population for all statistical output.
Member States of the European Union maintain business registers for statistical purposes as a
tool for the preparation and coordination of surveys, as a source of information for the
statistical analysis of the business population and its demography, for the use of
administrative data and for the identification and construction of statistical units.
The SBR contains at least:

a statistical unit

a name and address of the statistical unit

an activity-code (NACE)

a starting and a stopping date of enterprises and van NACE-code for activity.
NSIs use SBR to derive a population frame and to relate all input data to a reference target
population. As it was proposed in deliverable 2.2 of the ESSnet on Data Warehousing, to link
several input data in a S-DWH we need to agree about the default target population and about
the enterprise unit to which all input data are matched. The default target population is
defined as statistical enterprise units which have been active during the reference year and this
target population was proposed because it corresponds with output requirements of the
European regulation. Most of statistics use the SBR to derive a population frame, which
consist of all units (enterprises) with a certain specific activity. The activity is derived from
the NACE-code. For example, for annual statistical this means that the default target
population consists of all active enterprises during the year, including the starters and stoppers
(and the new/stopping units due to merging and splitting companies).
This input source will be called ‘population frame’. The population frame includes the
following information to derive activity status and subpopulations:
1) Frame reference year
2) Statistical enterprises unit, including its national ID and its EGR ID1
3) Name/address of enterprise of the enterprises
4) National ID of the enterprises
5) Date in population (mm/yr)
6) Date out of population (mm/yr)
7) NACE-code
8) Institutional sector code
1
meaningless ID assigned by the EGR system to enterprises, it is advised to include this ID in the Data
Warehouse to enable comparability between the country specific estimates
7
9) Size class2
The population frame is crucial information to determine the default active population.
2.2 Data sources
One aim of a S-DWH is to create a set of fully integrated data about enterprises. And these
data may come from different sources like surveys, administrative data, accounting data and
census data. Different data sources cover different populations. Some data sources like
censuses cover all population (enterprises). Some cover all enterprises with a certain activity,
some only big enterprises or other subpopulations. Other survey samples include small
enterprises, but provide information about a few of them. To link this input data sources and
to ensure that these data are linked to the same enterprise unit and are compared with the same
target population is the main issue.
Main data sources:
1. Surveys (censuses, sample surveys)
2. Combined data (survey and administrative data)
3. Administrative data
2.2.1 Surveys
Survey based on statistical data collection (statistical questionnaire). A sample survey is more
restricted in scope: the data collection is based on a sample, a subset of total population - i.e.
not total count of target population which is called a census. However, in sample surveys
some sub-populations may be investigated completely while the most sub-populations are
subject to selected samples. Some surveys are based on statistical data collection and
administrative data and they are so-called combined data.
Surveys as well as administrative data can be used to detect errors in default target population.
2.2.2 Combined data
Since survey and administrative data sets have their respective advantages, a combination of
both sources enhances the potential for research. Furthermore, record linkage has several
advantages from a survey methodological perspective. The administrative data is used to
2
could be based on employment data
8
update the frame of active enterprises, to cover and estimate non-surveyed or non-responding
enterprises.
The success of the actual linkage depends on the available information to identify a
respondent in administrative records and on the quality of these identifiers. Record linkage
can be performed using different linkage methods by means of a unique identifier such as the
Social Security Number or unique common identifier, or on the basis of the ambiguous and
error-prone identifiers as name, sex, date of birth and address etc.
Before the records from both data sources are actually comparing, extensive pre-processing
need to be conducted to clean up typographical errors as well as to fill in missing information.
These steps of standardization should be done consistently for both the administrative and
survey records.
2.2.3 Administrative data
Administrative data is the set of units and data derived from an administrative source. A
traditional definition of administrative sources is that they are files of data collected by
government bodies for the purposes of administering taxes or benefits, or monitoring
populations. This narrow definition is gradually becoming less relevant as functions
previously carried out by the government sector are, in many countries, being transferred
partly or wholly to the private sector, and the availability of good quality private sector data
sources is increasing.
Administrative data in many NSIs used to update SBR data (in Lithuania it is State social
insurance Fund Board (SODRA), VAT from State tax inspectorate), to determine the activity
status of the statistical unit (annual VAT and employment dataset), to update the final survey
frame, to cover non-surveyed and non-responding enterprises, to estimate missing data.
2.3 Defining metadata
The term metadata is very broad and distinction is made between “structural” metadata that
define the structure of statistical data sets and metadata sets, and “reference” metadata
describing actual metadata contents, for instance, concepts and methodologies used, the unit
of measure, the data quality (e.g. accuracy and timeliness) and the production and
dissemination process (e.g. contact points, release policy, dissemination formats). Reference
metadata may refer to specific statistical data, to entire data collections or even to the
institution that provides the data.
NSIs need to define metadata before link sources. What kind of reference metadata needs to
be submitted? As we know in Eurostat this information is presented in files based on a
9
standardised format called ESMS (Euro SDMX Metadata Structure) (Figure 3). ESMS
Metadata files are used for describing the statistics released by Eurostat. It aims at
documenting methodologies, quality and the statistical production processes in general.
Figure 3. ESMS (Euro SDMX Metadata Structure). It uses 21 high-level concepts, with a limited breakdown of
sub-items, strictly derived from the list of cross domain concepts in the SDMX Content Oriented Guidelines
(2009).
2.4 The statistical unit base
The unit base is closely related to the SBR. However, as its contents are also closely related to
available input data to consider it as a separate input source. This unit base describes the
relationship between the different units and the statistical enterprise unit. It summarizes the
results of the data linking process described before. Figure 4 and 5 illustrates the content of
unit bases of Netherlands and Lithuania respectively.
Note that the exact contents of the unit base depend on:

legislation per country

output requirements and desired output of a S-DWH

available input data
It should also be mentioned that the unit base might be updated when other input data (with
different units) become available.
10
Figure 4. Example of a unit base of Netherlands
Figure 5. Example of a unit base of Lithuania
The Unit Base essential for data linking process. You need to have established links between
data to make the process of data integration fluid, accurate and quality assured.
3. The quality of data
Caring about data quality is the key to safeguarding and improving it. Discovering whether
data are of acceptable quality is a measurement task, and not a very easy one. There are
several criteria for defining data quality:
o relevance
o accuracy
o timeliness
11
o
o
o
o
accessibility and clarity of results
comparability
coherence
completeness
Data quality problems can produce unrealistic or noticeably strange answers in statistical
analysis and estimations.
4. Data linking process
As was mentioned above if we have a unique identifier data linking process is a simple file
merge operation.
Statistical Business Register (SBR) is a main tool of the data collection for economic statistics
in the Department of Statistics of Lithuania. On the basis of SBR all statistical surveys are
conducted. Some surveys are conducted without using administrative data and some based on
statistical data collection (statistical questionnaire) and administrative data. In Lithuania we
have unique identifier – company ID number, so there are arising no technical problems in
data linking process. Other NSI may have another situation; if there is no such identifier,
sophisticated techniques to link data from different sources are used. The result of this
complex situation is summarized in Unit Base. More details about unit Base will not be
described any further.
Data linkage methods usually fall across a spectrum between deterministic and probabilistic
methods. A combination of linkage methods may be used, but the choice of method depend
on the types and quality of linkage variables available on the data sets to be linked.
4.1 Deterministic linkage
Deterministic linkage ranges from simple joining of two or more datasets by reliable and
stable identifier to sophisticated stepwise algorithmic linkage. Simple deterministic linkage
uses a single identifier. A high degree of certainty is required in deterministic linkage. This
high degree can be obtained if there is a unique entity identifier, which uniquely identifies an
individual or legal unit across the data sets. Unique identification number, such as company
ID number or Social Security number, may exist, allowing records to be matched across
multiple data sets.
Deterministic linkage is based on exact matches. Variables used in deterministic linkage need
to be accurate, robust, stable over time and complete, for example: first name, last name (for
males), sex, date of birth etc. Alternatively, a combination of attribute, including last name,
first name, sex and date of birth, can be used to create a linkage key which is then used to
match records that have the same linkage key value. This linkage key is known as a statistical
linkage key (SLK).
12
A more sophisticated form of deterministic linkage is stepwise deterministic record linkage,
which has been developed in response to variations that often exist in the attributes that are
used in creating the SLKs for deterministic linkage. Stepwise deterministic record linkage
uses auxiliary information on the data sets to provide a platform from which variation in the
reported linkage key information can be considered. This differs from simple deterministic
linkage that relies on an exact, one-to-one character matching of linkage keys across two or
more data sets.
Another variant of deterministic linkage is “rules-based linkage” where a set of rules can be
used to classify pairs of records as matches or non-matches. Such rules can be more flexible
than using a linkage key, but their development is labour intensive and highly dependent on
the data sets to be linked.
Statistical linkage keys
The SLK is a code that replaces for example the person’s first name and last name, to protect
the person’s identity. Generally most SLK for personnel statistics are constructed from last
name, first name, sex and full date of birth. SLK protects privacy and data confidentiality
because they serve as an alternative to a person’s name and dates of birth being on the data
sets to be linked. For example: Commonly used SLK is the SLK 581. It consists of the
concatenation of the 2nd, 3rd and 5th letters of family name, the 2nd and 3rd letters of given
name, date of birth as a character string of form “ddmmyyyy”, followed by the character”1”
for male and “2” for female. It comprises five characters from person’s first and last name,
eight characters from date of birth and one representing the person’s sex.
Data linkage using an SLK is commonly deterministic, but this requires the variables used in
constructing SLK to be accurate, complete and as exact as possible. Difficulties with SLK
construction may be experienced where variations of names are used on different databases
(e.g. Smith/Smythe; Thompson/Thomson etc.) There are two kinds of errors associated with
SLKs. Firstly; there may be incomplete or missing data items on an individual’s record, which
means that SLK will be incomplete. Secondly, errors in the source data may lead to
generation of multiple SLKs for the same individual or multiple individuals will share the
same SKL. It could be combined deterministic linkage and linkage with SLKs with
probabilistic linkage, which requires less exacting standards of accuracy, stability and
completeness.
4.2 Probabilistic linkage
Probabilistic linkage may be undertaken where there are no unique entity identifiers or SLKs,
or where the linkage variables and/or entity identifiers are not as accurate, stable or complete
as are required for deterministic method.
In such cases, matching and linking will depend on achieving the closest approach to unique
identification by using several identifying variables. Each of these variables is only a partial
13
identifier, but, in combination, they provide a match that is sufficiently accurate for the
intended purpose of linking data sets. Probabilistic linkage has a greater capacity to link
records with errors in their linking variables. In the traditional probabilistic linkage approach,
pairs of records are classified as matches if their common attributes predominantly agree or,
as non-matches, if they predominantly disagree.
Probabilistic methodologies can lead to a much better linkage of records from separate data
collections than simple deterministic methodologies for statistical linkage purpose.
There are 2n possible match/non-match configurations of n fields, so this method of record
linkage uses the M and U probabilities one can sum the probabilities for agreement and
disagreement between a ranges of linkage variables.
M-probability (match probability)
Probability that a field agrees given that the pair of records is a true match. For any given
field the same M-probability applies for all records. For example: M-probability of 0.95 for
surnames means that the probability that two records belonging to the same person will agree
on last name is 0.95. The reasons of 5 % of disagree: data errors, missing data, instability of
value, e.g. surname change, misspelling.
U-probability (unmatch probability)
Probability that a field agrees given that the pair of records is not a true match. Often it is
simplified as the chance that two records will randomly match. It is defined as a specific value
and will often have multiple values for each field. Typically U-probabilities estimated as the
proportion of records with a specific value, based on the frequencies in the primary or more
comprehensive and accurate data source.
4.3 The data linkage process
Data linkage can be project-based (ad hoc) or systematic. Systematic data linkage involves
the maintenance of a permanent and continuously updates master linkage file and a master
linkage key. The data linkage process may vary, depending on linkage model and the linkage
method. But there are however four steps that are common to both data linkage models. These
are data cleaning and data standardisation, blocking and searching, record pair or record group
comparisons, and a decision model. However, while data cleaning and standardisation are
common to both deterministic and probabilistic linkage, the other tree processes are more
relevant to the probabilistic method.

Data cleaning and standardisation
This process identifies and removes the errors and inconsistencies in the data, and analyzes
the text fields so that the data items in each data file are comparable.
14

Blocking
Data linkage often involves large data sets. When two data sets are linked the number of
possible comparisons equals the products of the number of records in two data sets.
Blocking reduces the number of comparisons needed, by only comparing record pairs
where links are more likely to be found. Blocking involves selecting sets of blocking
attributes, such as sex, data of birth, last name or components of first and last name and
only comparing records with the same attributes.

Record pair comparison
During the comparison stage, record pairs are compared on each linkage field, and the
level of agreement is measured. Field comparison weights are assigned to each linkage
field for each record pair. The field comparison weights are the summed over the linkage
variables to form a record pair comparison weight.

Decision model
Record pair comparison weights help data linkers decide whether a record pair belongs to
the same entity. This decision can be based on a single cut-off weight or on a set of lower
and upper cut-off weights. Under the single cut-off weight approach, all record pairs with a
comparison weight equal to or above the cut-off weight are assigned as links and all those
below the cut-off weight are assigned as non-links. Under the lower and upper cut-off
weights approach, all record pairs with a comparison weight below the upper cut-off are
assigned as links and those with weights below the cut-off are assigned as non-links.
Record pairs with comparison weights between the upper and lower cut-offs are assigned
as possible links, and designated for clerical review. In clerical, data linkers manually
inspect all the variables available for the record pairs whose link status cannot be
automatically determined, and then decide whether the record pairs belong to the same
entity.
4.4 Determinants of linkage quality
Key determinants of linkage quality include:

the quality of SLKs (in case of deterministic linkage)

the quality of blocking and linkage variables (in the case of probabilistic linkage).
When the linkage method is deterministic, and SLKs are used to compare the records, then
the quality of the linkage will depend on the quality of the variables used in constructing the
SLKs. In addition to having accurate, reliable and consistently reported blocking and linking
variables, the quality of data linkage also depends on the blocking and linking strategy
adopted.
15
Blocking and linking variables are used in constructing SLKs (deterministic linkage) or in
linking the data sets (probabilistic linkage). Poor quality (if blocking or linking variables are
missing, inaccurate, incomplete or are inconsistently reported on the data sets to be linked)
could lead to some records not being linked or to some records being linked to wrong records.
4.5 Measures of quality of data linkage
Measures that may be used to asses data linkage quality include accuracy, sensitivity,
specificity, precision and the false-positive rate. These measures are based on basic data that
are simply defined from the results of data linkage (Figure 6).
Figure 6. Classification of matches and links
Not of all these measures are easily calculated, because their calculation depends on knowing
the number of true non-matches or true negatives (TN). These are often unknowable or
difficult to calculate.
The most widely used quality measures are:

sensitivity or the true positive rate: This is the proportion of matches that are correctly
classified as matches. It may be defined as the proportion of all records in a file with a
match in another file that is correctly accepted as links (true links) This measure is
calculated as: TP/(TP+FN)

precision or the positive predictor value is the proportion of all classified links that
are true links or true positive. This measure is calculated as: TP/(TP+FP).
16
5. Methodological issues
When we are linking data from different sources like sample surveys, combined data and
administrative data we can meet such problems as data missing, data overlapping, “unlinked
data” etc. Errors might be detected in statistical units and target population when linking other
data to this information. And if these errors are influential they need to be corrected in the SDWH.
There are three kind of missing data type: structural, temporal and errors. Temporal missing
data – data that is not available at this time, but it will be provided and incorporated in a short
period of time. Errors – are incorrect values. A frequently expected error will be the NACE
code (activity).
One of the problems is conflict between sources. Data conflict occurs when two or more
sources contain data for the same variable (with different values). In many cases, when two
(or more) reliable sources conflict, one (or more) of those sources can be demonstrated to be
unreliable. The main goal is to define data source priority for each indicator and rules
determining the quality of priority data source. It is needed to be defined what data source for
what indicator more reliable and determined by different criteria. In some other cases it could
be needed additional analysis, and could be used more sophisticated methods, or even manual
techniques.
In Lithuania we use data source priority for each indicator in SBR and when priority sources
are determined they are not changed. To determine priority source need to be defined priority
rules: like quality of data source, completeness, update time, and of course the consultation
with experts.
When there is no unique identifier, we use more sophisticated methods for matching and
linking several identifiers. It could cause that some data could be “unlinked”. Pour quality of
selected linkage variables or of probabilistic methods can lead to some record not being
linked or linked to the wrong records, some records are unable to be linked because of
missing incomplete, inaccurate variables.
6. Link with other ESSnet
In the document “Mapping the coverage of ESSnet projects relevant to workpackage 2 of the
ESSnet on data warehousing” by Ria Sanderson it was identified potential areas of overlap, or
potential gaps, between the deliverables of Work Package 2 (WP2) of the ESSnet on Data
Warehousing (hereafter, ESSnet DWH). Although, it was identified which of the completed
and ongoing ESSnet projects are of direct relevance to WP2.
We can use the visual depiction of the links with other ESSnet, which is very useful to find all
information related to data linking within ESSnet projects, whether they are finished or not.
17
Figure shows a visualisation of the links between the deliverables of the ESSnet DWH and
work packages in the other relevant ESSnet projects. The links identified as being “methods”
links are shown as solid black lines, and the links identified as “background” links are shown
as grey lines. Each of the ESSnet projects considered are represented by different coloured
circles, as shown in the legend, and refer to the ESSnet’s DWH, admin data (AD), data
integration (DI), common reference architecture (CORA), statistical disclosure control (SDC),
statistical disclosure control harmonization (SDC-H) and the Blue-ETS project. In the case of
the ESSnet DWH, the circles represent the different deliverables, and the deliverable number
is shown in each circle. For the other ESSnet projects, the numbers in the circles represent the
individual work packages where links have been made.
The figure highlights the importance of the results of the data integration for deliverable 2.4
of the ESSnet DWH. There are many links that have been classified as “background” links,
meaning that a lot can be learned for this deliverable from work that is currently taking place
elsewhere, particularly in the ESSnet on Admin Data.
“Methods” links:
1. DI WP1 State of the art in data integration – provides methods for data integration in a
variety of circumstances (The final report is available) http://www.crosportal.eu/sites/default/files//ESSnetDI_WP1_v1.32.pdf . The document “introduce a formal
18
procedure to find records belonging to the same unit being from either different or the
same source –that is, database. These procedures are based on probabilistic instead of
deterministic criteria and rely on the equivalence of values when comparing those from
two different records on a field-by-field basis; and then, on the probability of agreement
between values given the true –and unknown– status of the pair of records –that is, they
actually do belong to the same entity or they actually do not. Both standard and alternative
approaches for probabilistic record linkage are discussed “.
2. DI WP2 Development of methods – provides methods for establishing linkage error,
performing inference, and maintaining micro-consistency and macro-consistency (The
final report is available) http://www.cros-portal.eu/sites/default/files//WP2.pdf. The
document presents methodological developments that are classified according to three
areas of integration activities connected with different phases in the statistical processing
of data from multiple sources. Record linkage, which is an obvious first stage when
combining data from different sources on the same units. When no unique error-free
identifiers are available, the problem of linkage errors has to be dealt with at this stage.
Inference with multiple sources, where estimation problems are studied when multiple
sources are available but the sets of units in the different sources are not or only partially
overlapping. Micro and macro consistency, where the problem is tackled that information
from different sources can be inconsistent both at the micro-level, that is within linked
records, as well as at the macro-level, that is between estimates of aggregates.
3. DI WP3 Development of common software tools – provides software for data linking in R
(The software is available) https://joinup.ec.europa.eu/software/relais/description ;
http://www.cros-portal.eu/content/wp3-development-common-software-tools. The project
presents improvements that had been done to software RELAIS (REcord Linkage At IStat)
for data linking as well as the software itself. The RELAIS (REcord Linkage At IStat) is a
toolkit providing a set of techniques for dealing with record linkage projects.
“Background” links:
4. AD WP2b Checklist for quality of admin data. Aims to provide a checklist for assessing
the quality of an admin data source
http://essnet.admindata.eu/WorkPackage?objectId=4253
5. AD WP3 Methods of estimation for variables. Aims to provide methods to estimate for
variables where admin data sources are not available
http://essnet.admindata.eu/WorkPackage?objectId=4254
6. AD WP4 Timeliness of administrative data. Aims to provide methods of estimation when
admin data is not available in time
http://essnet.admindata.eu/WorkPackage?objectId=4255
7. AD WP6 Development of quality indicators. Aims to develop quality indicators for
assessing outputs based on survey and admin sources
http://essnet.admindata.eu/WorkPackage?objectId=4257
19
8. AD WP7 Statistics and accounting standards. Aims to align business statistics definitions
with accounting characteristics http://essnet.admindata.eu/WorkPackage?objectId=4258
7. Conclusions
Data linking it is a quite challenging process. For a successful data linking firstly, the
population of the different data sources should be well defined. Secondly, the input sources
should be of a high quality. Frequently they are not of a very high quality due to errors,
variations and missing data. Another challenge is data dynamics: data may regularly and
routinely change over time. The third and the main condition for successful data linking is
that a unique identifier is desired.
If there is no unique identifier we need to use more sophisticated methods: deterministic or
probabilistic linking. Deterministic methods work best with high quality of data, and
probabilistic linkage methods can lead to much better linkage than simple deterministic
methods when there is missing data and no unique identifier.
The results of data linking are summarized in a unit base which is part of a statistical data
warehouse.
Evaluation of quality of data linkage should also be done; it depends on presence of unique ID
and accuracy, precision of data and false-positive ratios when linking.
Next steps in data integration are dealing with errors, missing data and conflicting data.
8. References
1. Manfred Antoni, Institute for Employment Research (IAB) „Linking survey data with
administrative employment data: The case of the German ALWA survey“
http://www.cros-portal.eu/sites/default/files//NTTS2013fullPaper_89.pdf .
2. Pieter Vlag, Statistics Netherlands, Guidelines (including options) on how the BR interacts
with the S-DWH.
3. Ria Sanderson, Mapping the coverage of ESSnet projects relevant to Work Package 2 of
the ESSnet on Data warehousing.
4. Bishop, G.R. and Khoo, J. (2007) Research Paper: Methodology of evaluating the quality
of probabilistic linking. Australian Bureau of Statistic.
5. National Best Practice Guidelines for Indigenous Data Linkage Activities Relating to
Aboriginal and Torres Strait Islander People (2012).
20
Download