Report of WP1 of the CENEX on Statistical Methodology

advertisement
ESSnet Statistical Methodology Project on
Integration of Survey and Administrative Data
Report of WP1. State of the art on statistical methodologies for
integration of surveys and administrative data
LIST OF CONTENTS
Preface (Mauro Scanu – ISTAT)
III
1. Literature review on probabilistic record linkage
1.1. Statement of the record linkage problem (Marco Fortini - ISTAT)
1
1
1.2. The probabilistic record linkage workflow (Nicoletta Cibella, Mauro Scanu,
Tiziana Tuoto)
2
1.3. Notation and difficulties for probabilistic record linkage (Marco Fortini –
ISTAT)
3
1.4. Decision rules and procedures (Miguel Guigo – INE)
4
1.5. Estimation of the distribution of matches and nonmatches (Mauro Scanu –
ISTAT)
8
1.6. Blocking procedures (Gervasio Fernandez – INE)
11
1.7. Quality assessments (Nicoletta Cibella, Tiziana Tuoto – ISTAT)
16
1.8. Analysis of files obtained by record linkage (Miguel Guigo – INE)
20
2. Literature review on statistical matching
24
2.1. Statement of the problem of statistical matching (Marcello D’Orazio –
ISTAT)
24
2.2. The statistical matching workflow (Marcello D’Orazio, Marco Di Zio, Mauro
Scanu – ISTAT)
25
2.3. Statistical matching methods (Marco Di Zio – ISTAT)
27
2.4. Uncertainty in statistical matching (Mauro Scanu – ISTAT)
33
2.5. Evaluation of the accuracy of statistical matching (Marcello D’Orazio ISTAT)
36
3. Literature review on micro integration processing
WP1
I
39
3.1. Micro integration processing (Eric Schulte Nordholt – CBS)
39
3.2. Combining data sources: micro linkage and micro integration (Eric Schulte
Nordholt, Frank Linder – CBS)
39
3.3. Key reference on micro integration (Miguel Guigo – INE; Paul Knottnerus, Eric
Schulte Nordholt – CBS)
42
3.4. Other references
47
4. Practical experiences
48
4.1. Record linkage of administrative and survey data for the EU-SILC survey: the
Italian experience (Paolo Consolini – ISTAT)
48
4.2. Record linkage applied for the production of business demography data
(Caterina Viviano – ISTAT)
51
4.3. Combination of administrative, register and survey data for Structural Business
Statistics (SBS) – the Austrian concept (Gerlinde Dinges – STAT)
55
4.4. Record linkage applied for the computer assisted maintenance of Business
Register: the Austrian experience (Alois Haslinger –STAT)
59
4.5. The use of data from population registers in the 2001 Population and Housing
Census: the Spanish experience (INE Spain)
64
4.6. Administrative data source (DBP) for population statistics based on ISEO
register in Czech Republic (Jaroslav Kraus – CZSO)
71
4.7. An experiment of statistical matching between Labour Force Survey (RFL) and
Time Use Survey (TUS) (Gianni Corsetti - Isfol)
73
5. Results of the survey on the use and/or development of integration methodologies
in the different ESS countries
79
5.1. Introduction (Mauro Scanu – ISTAT)
79
5.2. Objective of the integration process and characteristics of the file to integrate
(Luis Esteban Barbado – INE Spain)
80
5.3. Privacy issues (Eric Schulte Nordholt – CBS)
83
5.4. Problems of the integration process and methods (Nicoletta Cibella, Tiziana
Tuoto – ISTAT)
86
5.5. Software issues (Ondrej Vozár – CZSO)
90
5.6. Documentation on the integration process (Nicoletta Cibella, Tiziana Tuoto –
ISTAT)
91
5.7. Possible changes (Alois Haslinger – STAT)
92
5.8. Possibility to establish links between experts (Alois Haslinger – STAT)
94
Annex. Survey on the use and/or development of integration methodologies in the
different ESS countries
96
WP1
II
Preface
(Mauro Scanu - ISTAT)
This document is the deliverable of the first work package of the Centre of Excellence on
Statistical Methodology, Area Integration of Surveys and Administrative Data (CENEXISAD, consisting of the NSIs of Austria, the Czech Republic, Italy, the Netherlands and
Spain). The objective of this document is to provide a complete and updated overview of the
state of the art of the methodologies regarding integration of different data sources. The
different NSIs (within the ESS) can refer to this unique document if they need to:
1) define a problem of integration of different sources according to the characteristics of the
data sets to integrate;
2) discover the different solutions available in the statistical literature;
3) understand which problems still need to be tackled, and motivate the research on these
issues;
4) look at the characteristics of many different projects that needed the integration of different
data sources.
This document consists of five chapters that can be broadly clustered in two groups.
The first three chapters are mainly methodological. They describe the state of the art
respectively for i) probabilistic record linkage, ii) statistical matching, and iii) micro
integration processing. Each chapter is indeed a collection of references. As a matter of fact,
this part of the document is intended as a tool enabling orientation through the wide amount
of papers on different integration methodologies. This aspect should not be considered as a
secondary issue in the production of official statistics. The main problem is that
methodologies for the integration of different sources are, most of the times, still in their
infancy. On the contrary, the current informative needs for official statistics require an
increasingly more sophisticated use of multiple sources for the production of statistics.
Whoever is in charge of a project on integration of different sources must be conscious of all
the available alternatives and should be able to justify the chosen method.
The last two chapters are an overview of integration experiences in the ESS. Chapter 4
collects detailed information on many different projects that need a joint use of two or more
sources in the participating NSIs of this CENEX. Chapter 5 illustrates the results of a survey
on the use and/or development of integration methodologies in the ESS countries. These
chapters illustrate the many informative needs that cannot be solved by means of a unique
source of information, as well as the peculiar problems that must be treated in each integration
process.
WP1
III
1. Literature review on probabilistic record linkage
1.1. Statement of the problem of record linkage
Marco Fortini (ISTAT)
Record linkage consists in identifying pairs of records, coming from either the same or
different data files, which belong to the same entity, on the base of the agreement between
common indicators.
The previous figure is taken from Fortini et al. (2006) and shows record linkage of two data
sets A and B. Links aim at connecting records belonging to the same unit, comparing some
indicators (name, address, telephone). It is possible that some agreement is not perfect (as in
the telephone of the first record of the left data set and third record of the right data set), but
the records still belong to the same unit.
A classical use of linked data in the statistical research context is the study of the relationships
between variables collected on the same individuals but coming from different sources. Other
important applications entail the removal of duplicates from data sets and the development
and management of registers. Record linkage is a pervasive technique also in a business
context where it regards information systems for customer relationship management and
marketing. Recently, an increasing interest in e-government applications comes also from
public institutions.
Regardless of the record linkage purposes, the same logic is adopted in extreme cases: when a
pair of records is in complete disagreement on some key issues it will be almost certainly
composed of different entities; conversely, a perfect agreement will indicate an almost certain
match. All the intermediate cases, whether a partial agreement between two different units is
achieved by chance or a partial disagreement between a couple of records relating to the same
entity is caused by errors in the comparison variables, have to be properly resolved depending
on the particular approach which is adopted.
A distinction between a deterministic and probabilistic approach is often made in the
literature, where the former is associated with the use of formal decision rules while the latter
makes an explicit use of probabilities for deciding when a given pair of records is actually a
match. The existence of a large number of different approaches, mainly defined in computer
science, that make use of techniques based on similarity metrics, data mining, machine
learning, etc., without defining explicitly any substantive probabilistic model, makes the
previous distinction more subtle. In the present review only the strictly probabilistic
WP1
1
approaches will be discussed, given their natural attitude to acknowledge the essential task of
matching errors evaluation, whereas Gu et al. (2003) is referenced for a first attempt at an
integrated view of recent developments in all the major approaches.
Bibliography
Fortini, M., Scannapieco, M., Tosco, L., and Tuoto, T., 2006. Towards an Open Source
Toolkit for Building Record Linkage Workflows. Proceedings SIGMOD 2006 Workshop on
Information Quality in Information Systems (IQIS’06), Chicago, USA, 2006.
Gu, L., Baxter, R., Vickers, D., and Rainsford C., 2003. Record linkage: Current practice and
future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences,
Canberra, Australia, April 2003. (http://citeseer.ist.psu.edu/585659.html)
1.2. The probabilistic record linkage workflow
Nicoletta Cibella, Mauro Scanu, Tiziana Tuoto (ISTAT)
Probabilistic record linkage is a complex procedure that is composed of different steps. A
workflow (adapted from a workflow described in the record linkage manual of Statistics New
Zealand) is the following.
WP1
2
This document reviews in detail the papers on the different steps of the workflow.
a. Start from the (already harmonized) data sets A and B (see WP2, Section 1, for
more details).
b. Usually the overall set of pairs of records from the data sets A and B is too large,
and this causes computational and statistical problems. There are different
procedures to deal with this problem, which are listed in Section 1.6.
c. A necessary step for probabilistic record linkage is to consider how the variables
used for matching pairs remain stable from one data set to another. This
information is seldom available, but can be estimated from the data sets at hand
(Section 1.5).
d. Consider all the pairs of records in the search space created by the procedure in
step 2. Apply a decision rule for each pair of records (results are: link, possible
link, no link). This is described in Section 1.4.
e. Link the two data sets according to the results of the previous step.
f. Evaluate the quality of the results (Section 1.7).
g. Analyse the resulting completed data sets, taking in mind that this file can contain
matching errors (Section 1.8).
In the following, all the previous steps will be analyzed starting from the core of the
probabilistic record linkage problem (Section 1.3), i.e. the definition of the model that
generates the observed data, the optimal decision procedure, according to the Fellegi and
Sunter theory (Section 1.4), the estimation of the necessary parameters for the application of
the decision procedure (Section 1.5). After reviewing these aspects, the procedures for
reducing the search space of the pairs of records will be illustrated (Section 1.6). Appropriate
methods for the evaluation of the quality of the probabilistic record linkage are outlined
(Section 1.7). Finally, the problem of analysing data sets obtained by means of a record
linkage procedure is given (Section 1.8).
1.3. Notation and technicalities for probabilistic record linkage
Marco Fortini (ISTAT)
The early contribution to modern record linkage dates back to Newcombe et al. (1959) in the
field of health studies, followed by Fellegi and Sunter (1969) where a more general and
formal definition of the problem is given. Following the latter approach, let A and B be two
partially overlapping files consisting of the same type of entities (individuals, households,
firms, etc.) respectively of size nA and nB. Let  be the set of all possible pairs of records
coming from A and B, i.e. ={(a,b): aA, bB}. Suppose also that the two files consist of
vectors of variables (XA,YA) and (XB,ZB), either quantitative or qualitative, and that XA and XB
are sub-vectors of k common identifiers, called key variables in what follows, so that any
single unit is univocally identified by an observation x. Moreover, let γab designate the vector
WP1
3
of indicator variables regarding the pair (a,b) so that γjab=1 in the j-th position if xaA, j  xbB, j
and 0 otherwise, j=1,…,k. The indicators γjab will be called comparison variables.
Given the definitions above we can formally represent record linkage as the problem of
assigning the couple (a,b) to either one of the two subsets M or U, which identify the
matched and the unmatched sets of pairs respectively, given the state of the vector γab.
Probabilistic methods of record linkage generally assume that observations are independent
and identically distributed observations from appropriate probability distributions. Following
Fellegi and Sunter (1969), there is a first bivariate random variable that assigns each pair of
records (a,b) to the matched records (set M) or to the unmatched ones (set U). This variable is
latent (unobserved), and it is actually the target of the record linkage process. Secondly, the
comparison variables γab follow distinct distributions according to the pair status. Let m(γab)
be the distribution of the comparison variables given that the pair (a,b) is a matched pair, i.e.
(a,b)M, and u(γab) be the distribution of the comparison variables given that the pair (a,b) is
an unmatched pair, i.e. (a,b)U. These distributions will be crucial for deciding the record
pairs status.
Bibliography
Fellegi, I. P., and A. B. Sunter, 1969. A theory for record linkage. Journal of the American
Statistical Association, Volume 64, pp. 1183-1210.
Newcombe, H., Kennedy, J., Axford, S. and James, A., 1959. Automatic Linkage of Vital
Records, Science, Volume 130, pp. 954–959.
1.4. Decision rules and procedures
Miguel Guigo (INE)
1.4.1. General statement of the statistical problem
The key task of all the record linkage process is to determine whether a pair of records
belongs to the same entity or not; hence, the quality of the whole result achieved by the
linkage procedure relies on the quality of the tool applied to make this choice, that is, the
decision rule.
From a statistical point of view, following De Groot (1970), a decision problem consists of an
experiment the actual outcome of which is unknown, and the consequences of which will
depend on that outcome and a decision taken by the statistician. Specifically, let D be the
space of all possible decisions d which might be made, let  be the space of all possible
outcomes  of the experiment, and let R be the space of all possible rewards or results r =
r(,d) of the statistician decision d and the outcome  of the experiment. In most cases, r is
actually a loss function.
We also assume that there exists a probability distribution P on the space  of outcomes
whose value P() is specified for each event . Then, the statistician must choose an optimal
non-deterministic behaviour in an incompletely known situation. A way to face this is to
minimize the expectation of the total loss, and then the decision rule is optimal (Wald, 1950);
but the statistician must also face a problem with respect to the probability distribution P,
WP1
4
which is known to belong to a family of probability distributions, but some of whose
parameters are unknown; by making some observations of the phenomenon and processing
the data, the statistician has to make a decision on P. Therefore, a statistical decision rule is a
transition probability distribution from a space of outcomes  into a space of decisions D1.
In the case of a record linkage procedure, the space of actual outcomes consists of a real
match or a real nonnmatch for every pair of records belonging to ={(a,b): aA, bB}, and
the space D of all possible decisions consists of assigning or not the pair as a link.
In this context, the decision rule can be seen as made up of a two-step process, where the first
stage is to organize the set of agreements between common identifiers for each pair of records
(a,b) in an array γab. This means a mapping from  on , where  is known as space of
comparisons. A function that returns a numerical comparison value for γjab multiplied by a
weight wj, gives a basic score on the level of coincidence for the j-th key variable, which sets
the contribution of every common identifier. Procedures for measuring agreement between
records (a,b) will then result in a composite weight of their closeness. Patterns can be more or
less arbitrary, based on distance, similarity, or linear regression models, amongst others. For a
more complete list of comparators, see Yancey (2004b).
Newcombe et al. (1959), and Fellegi and Sunter (1969) consider the different amount of
information provided by each key variable, by means of using a log-likelihood ratio taking
into account the agreement probabilities. This is considered the standard procedure, as shown
below. From m(γab) and u(γab) as defined in the previous section, each pair is assigned the
following weight: wab = log (m(γab) / u(γab)).
Once a weighted measure of agreement is set, the following step is in its turn a mapping from
 on a space of states which consists of the following decisions: A1 (that is, a link), A3 (that is,
a non-link), and A2 (that is, a possible link), with related probabilities given that (a,b) U or
(a,b) M, which can also be derived from the probability distributions m(γab) and u(γab) and
the regions of  associated to each decision. As the weighted score increases, the associated
pair (a,b) is more likely to belong to M. So, on the one hand, given an upper threshold, a
bigger numerical comparison value for γab will lead to consider the pair as a link; and, on the
other hand, a smaller comparison value, given a lower threshold, will lead to consider it as a
non-link.
Taking into account both steps, the problem of record linkage and the decision rule can be
faced up as a common statistical hypothesis test with a critical and an acceptance region,
which are obtained through the different values of γ in  and their respective composite
weight values on R, compared with a set of fixed bounds. A probability model based on
[m(γab), u(γab)] in order to calibrate the error rates, i.e.  = P{A1/(a,b) U}and  = P{A3/(a,b)
M}is therefore also needed.
At this point, it is important to remark that, while consists of only two disjoint subsets M or
U, the space of decisions is split into three subsets due to the fact that probability distributions
of matches and non matches are partially overlapping. Then, for possible links, when A2 is
achieved, a later clerical review of the ambiguous results will be needed, in order to
appropriately discriminate these intermediate results between the link cases and non-link
1
For a more formal definition of a statistical decision rule, see Chentsov (1982), 65.
WP1
5
cases. An intuitive idea is that, if the main reason to implement an automatic record linkage
procedure is to avoid or reduce costs, time wasting, or errors due to the use of specifically
trained staff to link records manually, the larger A2 is, the bigger those costs, time
consumption and errors are, and the worst the decision rule is. So, the optimal linkage rule has
to maximize the probabilities of positive dispositions of comparisons -that is to say, positive
links A1 and positive non-links A3- for a given pair of fixed errors  and .
1.4.2. Probability model and optimal fusion rule
Following Fellegi and Sunter (1969), m(γ) and u(γ) are defined to be the conditional
probabilities of observing γ given that the record pair is, respectively, a true match or a true
non-match. Then, P{A1/U}and P{A3/M}are defined respectively as the sum of probabilities
γ m(γ)P{A1/ γ} and γ m(γ)P{A3/ γ}. Moreover, P{A2/ γ} should be minimised in the
optimal decision rule2. In order to simplify notation, we write just γ instead of γab.
Then, the values of γ must be arranged in order to make the ratio R1(γ) = m(γ)/u(γ)
monotonically decreasing -provided that values of R1(γ) where m(γ)>0 and u(γ)=0 should be
placed first- and indexed as 1,2 ... | |, where | | is the cardinality of the set .
For a value of  equal to the sum of u(γ) for the first n values of γ so previously arranged, and
a value of  equal to the sum of m(γ) for the last values of γ starting the count from a value n’,
let T = m(γn)/u(γn) be an upper cut-off threshold, and T = m(γn’)/u(γn’) a lower one. Then,
the optimal rule is given by:
(a,b) A1 (positive link) when the ratio R1(γ) is bigger than or equal to T
(a,b) A2 (possible link) when the ratio R1(γ) is in the region lying between TandT
(a,b) A3 (positive non- link) when the ratio R1(γ) is lower than or equal to T

The figure above illustrates how the optimal rule works, creating the critical, acceptance and
intermediate regions. Vertical lines are each representing a correspondent threshold, the line
2
For other criteria of optimality, see Gu et al. (2003) and Verykios (2000).
WP1
6
on the left represents the lower bound T, and the one on the right represents the upper bound
T.. Areas marked as FU and FM represent, respectively, the probability of false non-matches
(FU) and false matches (FM), that is to say, the associated error or false-match rates.
As the figure suggests, the number of records (a,b) U is widely greater than the number of
records (a,b) M. Let nA and nB the number of records in A and B. Say, without loss of
generality, nA<nB. Then nA< (nA×nB - nA). So, it is a common assumption when estimating the
u(γab) distribution, that the proportion p of matched pairs is negligible.
As shown below, some additional assumptions on the behaviour of m(γab) and u(γab) can be
made in order to simplify those conditional distributions, which result in a little bit different
form of the ratio R, closer to the weights proposed at the beginning of the section. Given the
independence between the components, note that they can be written as
m(γab) = m1(γ1ab)· m2(γ2ab)·... mk(γkab), and u(γab) = u1(γ1ab)·u2(γ2ab)·... uk(γkab).
Then, the decision rule can be written as a log-likelihood ratio R2(γ) = log[m(γ)/u(γ)], which
represents a weighted score j wj, where wj = log (mj / uj).
Anyway, as far as the reliability of the decision rule is heavily dependent on the accuracy of
the estimates of m(γab) and u(γab), the core problem of the standard procedure for record
linkage is to determine the values of those probabilities, known as matching parameters (see
Scheuren and Winkler, 1993); the difficulties of empirically simulating the accuracy of the
estimated parameters has lead to different approaches, discussed in the following section.
Bibliography
Belin, T.R. and Rubin, D.B. ,1990. Calibration of errors in computer matching for Census
undercount. Proceedings of the Government Statistics Section of the American Statistical
Association, pp. 124-131.
Chentsov, N.N., 1982. Statistical Decision Rules and Optimal Inference. Translations of
Mathematical Monographs, Volume 53, 499. American Mathematical Society, Rhode Island,
U.S.
De Groot, M.H., 1970. Optimal Statistical Decisions. McGraw-Hill, New York [etc.]
Fellegi, I. P., and A. B. Sunter, 1969. A theory for record linkage. Journal of the American
Statistical Association, Volume 64, pp. 1183-1210.
Gu L., Baxter R., Vickers D., and Rainsford C., 2003. Record linkage: Current practice and
future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences,
Canberra, Australia, April 2003.
Jaro, M.A., 1989. Advances in record-linkage methodology as applied to matching the 1985
Census of Tampa, Florida. Journal of the American Statistical Association, Volume 84, pp.
414-420.
Larsen, M.D., 2004. Record Linkage of Administrative Files and Analysis of Linked Files. In
IMS-ASA's SRMS Joint Mini-Meeting on Current Trends in Survey Sampling and Official
Statistics. The Fort Radisson, Raichak, West Bengal, India.
WP1
7
Newcombe, H. B., Kennedy, J.M., Axford, S.J. and James, A.P., 1959. Automatic linkage of
vital records. Science, 130, pp. 954-959.
Newcombe, H. B., and Kennedy, J. M., 1962. Record linkage. Making maximum use of the
discriminating power of identifying information. Communication of the Association for
Computing Machinery, Volume 5(11), pp. 563-566.
Scheuren, F. and Winkler, W.E., 1993. Regression analysis of data files that are computer
matched – Part I. Survey Methodology, Volume 19, 1, pp. 39-58.
Verykios, V.S., 2000. A Decision Model for Cost Optimal Record Matching. In: National
Institute of Statistical Sciences Affiliates Workshop on Data Quality, Morristown, New
Jersey.
Wald, A.,1950. Statistical Decision Functions, John Wiley & Sons, New York.
Yancey, W.E., 2004b. An Adaptive String Comparator for Record Linkage. U.S. Bureau of
the Census, Statistical Research Division Report Series, n. 2004/02.
1.5. Estimation of the distributions of matches and nonmatches
Mauro Scanu (ISTAT)
As shown in the previous sections, a key aspect for applying probabilistic rules for record
linkage is played by the distributions of comparison variables for respectively matches and
nonmatches. The problem is that these distributions are usually unknown, and need to be
estimated. Most papers deal with the problem of estimating these distributions from the data
sets to be linked. The proposed methods basically follow the approach firstly established in
Fellegi and Sunter (1969). The latter approach consists in considering all the pairs as a sample
of nA×nB records independently generated by a mixture of two distributions: one for the
matched pairs and the other for the unmatched ones. The status of matched and unmatched
pairs is randomly chosen by a latent (i.e. unobserved) dichotomous variable. This model
allows the computation of a likelihood function to be maximized in order to estimate the
unknown distributions of the comparison variables γab for matched and unmatched pairs.
Maximization of the likelihood function will require iterative methods for dealing with the
latent variable, usually the EM algorithm or some if its generalizations3.
1.5.1. Different approaches in the estimation
As a matter of fact, the presence of a latent variable risks to make the model parameter
unidentified. For this reason, different papers have considered simplifying assumptions. In
3
The EM (Expectation Maximization) algorithm has been defined in Dempster, Laird and Rubin (1977) as a
method for obtaining maximum likelihood estimates from partially observed data sets (including the case of
latent variables). Broadly speaking, it is an iterative procedure which starts with a preliminary value of the
parameter to estimate , say 0; fills in missing values with the expected value of the missing value under 0 (E
step); computes the maximum likelihood estimate of  on the completed file (M step); iterates the E and M steps
until convergence.
WP1
8
almost all cases comparison variables γab are assumed to be dichotomous, i.e. they just report
the equivalence or difference of each key variable.
Independence between the comparison variables – This assumption is usually called the
Conditional Independence Assumption (CIA), i.e. the assumption of independence between
the comparison variables γjab, j=1,…,k, given the match status of each pair (matched or
unmatched pair). Fellegi and Sunter (1969) define a system of equations for estimating the
parameters of the distributions for matched and unmatched pairs, which gives estimates in
closed form when the comparison variables are at most three. Jaro (1989) solves this problem
for a general number of comparison variables with the use of the EM algorithm.
Dependence of comparison and latent variable defined by means of loglinear models –
Thibaudeau (1989, 1993) and Armstrong and Mayda (1993) have estimated the distributions
of the comparison variables under appropriate loglinear models of the comparison variables.
They found out that these models are more suitable than the CIA. The problem is estimating
the appropriate loglinear model. Winkler (1989, 1993) underlines that it is better to avoid
estimating the appropriate model, because tests are usually unreliable when there is a latent
variable. He suggests using a sufficiently general model, as the loglinear model with
interactions larger than three set to zero, and incorporating appropriate constraints during the
estimation process. For instance, an always valid constraint states that the probability of
having a matched pair is always smaller than the probability of having a nonmatch. A more
refined constraint is obviously the following:
nA
1
.
p

nB  n A nB
Estimation of model parameters under these constraints may be performed by means of
appropriate modifications of the EM algorithm, see Winkler (1993).
Bayesian approaches – Fortini et al. (2001, 2002) look at the status of each pair (match and
nonmatch) as the parameter of interest. For this parameter and for the parameters of the latent
variables that generates matches and nonmatches they define natural prior distributions. The
Bayesian approach consists in marginalizing the posterior distribution of all these parameters
with respect to the parameters of the comparison variables (nuisance parameters). The result
is a function of the status of the different pairs that can be analysed for finding the most
probable configuration of matched and unmatched pairs.
Iterative approaches – Larsen and Rubin (2001) define an iterative approach which alternates
a model based approach and clerical review for lowering as much as possible the number of
records whose status is uncertain. Usually, models are estimated among the set of fixed
loglinear models, through parameter estimation computed with the EM algorithm and
comparisons with “semi-empirical” probabilities by means of the Kullback-Leibler distance.
Other approaches – Different papers do not estimate the distributions of the comparison
variables on the data sets to link. In fact, they use ad hoc data sets or training sets. In this last
case, it is possible to use comparison variables more informative than the traditional
dichotomous ones. For instance, a remarkable approach is considered in Copas and Hilton
(1990), where comparison variables are defined as the pair of categories of each key variable
observed in two files to match for matched pairs (i.e. comparison variables report possible
classification errors in one of the two files to match). Unmatched pairs are such that each
component of the pair is independent of the other. In order to estimate the distribution of
WP1
9
comparison variables for matched pairs, Copas and Hilton need a training set. They estimate
model parameters for different models, corresponding to different classification error models.
1.5.2. Quality assessment and open issues
Usually papers on record linkage do not report any quality assessment on the estimation of the
distributions of the comparison variables. However, it is necessary to report a weakness of all
the estimation methods based on a specific model (apart from that proposed by Copas and
Hilton, 1990). The assumed model fails to be true for the sample defined by the set of n A×nB
records for the two data sets to link. In that case, it is not possible to state that comparison
variables are independently generated by appropriate distributions. For more details about this
weakness, see Kelley (1984). It is not yet clear how the failure of this independence
hypothesis affects the record linkage results.
Given the presence of a latent variable, estimation is not reliable when one of the categories of
the latent variable is rare. In this case, the set of the matched pairs M should be large enough
(say, more than 5% of the overall set of nAnB pairs). This is one of the motivations for the
application of blocking procedures, as shown in the next paragraph.
Bibliography
Armstrong, J. and Mayda, J.E., 1993. Model-based estimation of record linkage error rates.
Survey Methodology, Volume 19, pp. 137-147.
Copas, J. R., and F. J. Hilton, 1990. Record linkage: statistical models for matching computer
records. Journal of the Royal Statistical Society, A, Volume 153, pp. 287-320.
Dempster, A.P., Laird, N.M., and Rubin, D.B., 1977 Maximum Likelihood from Incomplete
Data via the EM algorithm. Journal of the Royal Statistical Society, Series B, Volume 39, pp.
1-38
Fellegi, I. P., and A. B. Sunter, 1969. A theory for record linkage. Journal of the American
Statistical Association, Volume 64, pp. 1183-1210.
Fortini, M., Liseo, B., Nuccitelli, A. and Scanu, M., 2001. On Bayesian record linkage.
Research in Official Statistics, Volume 4, pp. 185-198. Published also in Monographs of
Official Statistics, Bayesian Methods (E. George (ed.)), Eurostat, pp. 155-164.
Fortini, M., Nuccitelli, A., Liseo, B., Scanu, M., 2002. Modelling issues in record linkage: a
Bayesian perspective. Proceedings of the Section on Survey Research Methods, American
Statistical Association, pp. 1008-1013.
Jaro, M.A., 1989. Advances in record-linkage methodology as applied to matching the 1985
Census of Tampa, Florida. Journal of the American Statistical Association, Volume 84, pp.
414-420.
Kelley, R.B., 1984. Blocking considerations for record linkage under conditions of
uncertainty. Statistical Research Division Report Series, SRD Research Report No. RR-84/19.
Bureau of the Census, Washington. D.C.
WP1
10
Larsen, M.D. and Rubin, D.B., 2001. Iterative automated record linkage using mixture
models. Journal of the American Statistical Association, 96, pp. 32-41.
Thibaudeau, Y., 1989. Fitting log-linear models when some dichotomous variables are
unobservable. Proceedings of the Section on statistical computing, American Statistical
Association, pp. 283-288.
Thibaudeau, Y., 1993. The discrimination power of dependency structures in record linkage.
Survey Methodology, Volume 19, pp. 31-38.
Winkler, W.E., 1989a. Near automatic weight computation in the Fellegi-Sunter model of
record linkage. Proceedings of the Annual Research Conference, Washington D.C., U.S.
Bureau of the Census, pp. 145-155.
Winkler, W.E., 1989b. Frequency-based matching in Fellegi-Sunter model of record linkage.
Proceedings of the Section on Survey Research Methods, American Statistical Association,
778-783 (longer version report rr00/06 at http://www.census.gov/srd/www/byyear.html).
Winkler, W.E., 1993. Improved decision rules in the Fellegi-Sunter model of record linkage.
Proceedings of the Survey Research Methods Section, American Statistical Association, pp.
274-279.
1.6. Blocking procedures
Gervasio Fernandez (INE)
1.6.1. General Blocking procedures
Record linkage procedures require that every record from a data set be compared with all the
records from the other data set; when one or both is supposed to be quite large, the expected
number of pair wise comparisons would shoot up and system requirements would become
correspondingly higher either in time or in resources.
There is a way to reduce those needs, by splitting records into groups or 'blocks', provided that
comparisons between elements from different blocks will not be made. So, each record from a
given block in the first data set should be compared only with records from a given block in
the second data set.
However, it must be taken into account that this reduction bears the risk of mistakenly
including records in a block and then some of their possible matches would never be
compared, i.e. they will not be properly matched. This sort of handicap can be reduced by
means of applying multi-pass techniques.
To find a review on blocking procedures for record linkage, see Baxter et al. (2003),
Cochinwala et al. (2001) and Gu et al. (2003).
1.6.2. Standard Blocking
A first and easy way to group records is possible when well-defined and well-coded keys are
available for both data sets, e.g. for place of birth. In this case, each record is just compared
with every other record with the same place of birth from the second data set.
WP1
11
Some other examples of keys would be the first digits from Social Security number, the first
characters from the first or last name of a person; in this case, it is used to deal with phoneticorthographic codes (e.g. Russell-Soundex, NYSIIS, ONCA, Metaphone for English words)
for reducing misspelling/writing errors. The figure below shows an example using the ZIP /
postal code as a key:
Record blocks are kept defined based on those with the same key value, where the key has
been defined using the available attributes in each data set. Depending on used keys, blocks
with a wide amount of records can be found, and then an ineffectively large number of
comparisons; and, on the other hand, in the case of small blocks, true record matches can be
lost, especially if key includes misprints.
An analysis of error reduction using blocking methods can be found in Elfeky et al. (2002),
Jaro (1989) and Newcombe (1988).
Nevertheless, standard blocking procedures will not work properly unless the variables used
as a key are correctly coded and recorded. This ideal situation is not always the case, and
several alternative methods -which are introduced below- have been proposed for rearranging
data into blocks.
1.6.3. Fuzzy Blocking
When keys with misprints are present, and they give rise to losses of true record matches
because they are assigned to different blocks, fuzzy blocking methods can be applied.
Through these methods, records are split on not necessarily disjoint blocks, or records are
assigned to more than one block. For example, it is possible to have date of birth in a data set,
and then define blocks by year of birth. However, in other data sets, people’s ages are
available and then it will possible to look through the records with the appropriate year of
birth. A kind of fuzzy blocking method is known as Bigram Indexing, used in the software
Febrl (Christen and Churches, 2005b). The underlying idea is to consider, for a given key in a
record, all possible bi-grams (length-two strings) and to build feature subsets combining both
characters till a threshold value is reached, and then assign to that record the resulting keys.
Let us give an example of a Bigram Indexing procedure where the ZIP / postal code
mentioned above is split into bigrams. The code number "28046" results in a bigram list
WP1
12
("28", "80", "04", "46"), which is the main list of bigrams. The set of all the sub-strings of
length 3 is as follows:
("28", "80", "04")
("28", "80", "46")
("28", "04", "46")
("80", "04", "46")
Then every record which holds the value "28046" for the key variable ZIP / postal code will
be assigned to 4 different blocks, each of them labelled with the corresponding set of substrings.
1.6.4. Sorted Neighbourhood
Another method consists in stringing together the data to handle, and then order them by some
external key. Every record will be compared with the records in a moving window of size w,
centred upon the selected record.
This method can be used with several independent sort keys, increasing the number of
comparisons to be made. Sort keys for data sets must be found in such a way that records to
be compared stay close to each other in the re-arranged data set. Sort keys should be chosen to
be related to the elements involved in the comparison functions. Ability to sort large data sets
then arises as an important subject.
A description and analysis of some algorithms and methods can be found in Hernandez and
Stolfo (1995, 1998), Monge (2000) and Neiling and Muller (2001).
1.6.5. Alternatives to Sorted Neighbourhood
In order to avoid the re-arrangement of the data set time and time again, and on the
assumption that the involved data sets were previously arranged or at least partially arranged,
several methods have been proposed, among them the use of priority queues (heaps) where
representative records from the ultimately used blocks are stored and first used to seek the
next record to be processed. A description of this method can be seen in Monge and Elkan
(1997). Similar methods are applied by Yancey (2002) in BigMatch system.
1.6.6. Similarity/Distance-based Clustering
Although being similar to Sorted Neighbourhood method, this technique differs from the
previous method as, instead of a centred sliding window of size w in the pre-arranged data set,
it uses a canopy block for the record to be processed, consisting on a block which is made of
records nearby, according to a similarity/distance-based function.
The basic idea is to use similarity/distance-based functions which should be easier to calculate
WP1
13
than the used comparison function, and should approximate its real value. So, records that are
located far apart from each other will be non-matches.
The figure above shows the procedure for identifying canopy clusters: given two datasets A
and B -whose elements (records) are labelled with a or b respectively- and using two key
variables, say BV1 and BV2, both datasets are arranged in a unique list of records; then, one
record is randomly chosen as the centre of the first cluster. All the records within a certain
distance are considered to belong to the corresponding canopy cluster; then, the first record
randomly chosen and a subset of records close to it (within a smaller threshold distance) are
removed from the list, in order to avoid the proliferation of overlapping clusters.
For major detail it is possible to consult the papers of Bilenko and Mooney (2002), Cohen and
Richman (2002), MacCallum et al. (2000) and Neiling and Muller (2001).
Bibliography
Baxter, R., Christen, P. and Churches T. (2003) A Comparison of fast blocking methods for
record linkage. In Proc. of ACM SIGKDD'03 Workshop on Data Cleaning, Record Linkage,
and Object Consolidation, pages 25--27, Washington, DC, USA, August 2003.
Bilenko, M. and Mooney, R.J. (2002) Learning to Combine Trained Distance Metrics for
Duplicates Detection in Databases. Technical Report AI-02-296, University of Texas at
Austin, Feb 2002.
Christen, P. and Churches, T. (2005a) A Probabilistic Deduplication, Record Linkage and
Geocoding System. In Proceedings of the ARC Health Data Mining workshop, University of
South Australia, April 2005.
Christen, P. and Churches, T. (2005b) Febrl: Freely extensible biomedical record linkage
Manual. Release 0.3 edition, Technical Report Computer Science Technical Reports no.TRCS-02-05, Department of Computer Science, FEIT, Australian National University, Canberra.
Cochinwala, M., Dalal, S., Elmagarmid, A.K. and Verykios, V.S. (2001) Record Matching:
Past, Present and Future.
Cohen, W. and Richman, J. (2002) Learning to Match and Cluster Large High-Dimensional
Data Sets for Data Integration. In Eighth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD).
Elfeky, M., Verykios, V. and Elmagarmid, A. (2002) TAILOR: A Record Linkage Toolbox.
Proc. of the 18th Int. Conf. on Data Engineering IEEE.
Gu, L., Baxter, R., Vickers, D., and C. Rainsford, C. (2003). Record linkage: Current practice
and future directions. Technical Report 03/83, CSIRO Mathematical and Information
Sciences, Canberra.
Hernandez, M. and Stolfo, S. (1995) The Merge/Purge Problem for Large Databases. In Proc.
of 1995 ACT SIGMOD Conf., pp. 127–138.
WP1
14
Hernandez, M. and Stolfo, S. (1998) Real-world data is dirty: data cleansing and the
merge/purge problem. Journal of Data Mining and Knowledge Discovery, 1(2).
Jaro, M.A. (1989) Advances in record-linkage methodology as applied to matching the 1985
Census of Tampa, Florida. Journal of the American Statistical Association, Volume 84, pp.
414-420.
Kelley, R.P. (1984) Blocking considerations for record linkage under conditions of
uncertainty. Proceedings of the Social Statistics Section, American Statistical Association, pp.
602-605.
McCallum, A., Nigam, K. and Ungar, L. (2000) Efficient clustering of high-dimensional data
sets with application to reference matching. In Proc. of the sixth ACM SIGKDD Int. Conf. on
KDD, pp. 169–178.
Monge, A.E. (2000a) Matching algorithm within a duplicate detection system. IEEE Data
Engineering Bulletin, 23(4).
Monge, A.E. (2000b) An Adaptive and Efficient Algorithm for Detecting Approximately
Duplicate Database Records.
Monge, A.E. and Elkan, C. (1997) An efficient domain-independent algorithm for detecting
approximately duplicate database records. In The proceedings of the SIGMOD 1997
workshop on data mining and knowledge discovery, May 1997.
Neiling, M. and Muller, R.M. (2001) The good into the Pot, the bad into the Crop.
Preselection of Record Pairs for Database Fusion. In Proc. of the First International Workshop
on Database, Documents, and Information Fusion, Magdeburg, Germany.
Newcombe, H.B. (1988) Handbook of Record Linkage, Oxford University Press.
Yancey, W.E. (2002) BigMatch: A program for extracting probable matches from a large file
for record linkage. RRC 2002-01. Statistical Research Division, U.S. Bureau of the Census.
Yancey, W.E. (2004) A Program for Large-Scale Record Linkage. In Proceedings of the
Section on Survey Research Methods, American Statistical Association.
Other bibliography
Christen, P. (2007) Improving data linkage and deduplication quality through nearestneighbour based blocking. Submitted to the thirteenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD'07).
Christen, P., and Churches, T. (2004) Blind Data Linkage using n-gram Similarity
Comparisons. Proceedings of the 8th PAKDD'04 (Pacific-Asia Conference on Knowledge
Discovery and Data Mining), Sydney. Springer Lecture Notes in Artificial Intelligence,
(3056).
WP1
15
Christen, P., Churches, T., and Hegland, M. (2004) A Parallel Open Source Data Linkage
System. In the Proc of The Eighth Pacific-Asia Conference on Knowledge Discovery and
Data Mining, Sydney.
Christen, P., Churches, T., and Zhu, J.X. (2002) Probabilistic Name and Address Cleaning
and Standardization. Presented at the Australasian Data Mining Workshop, Canberra.
Christen, P., Churches, T., Lim, K., and Zhu, J.X (2002) Preparation of name and address data
for record linkage using hidden Markov models. BioMed Central Medical Informatics and
Decision Making. (http://www.biomedcentral.com/1472-6947/2/9).
Christen, P., et al. (2002a) Parallel Computing Techniques for High-Performance
Probabilistic Record Linkage. Proceedings of the Symposium on Health Data Linkage,
Sydney.
Christen, P., et al. (2002b) High-Performance Computing Techniques for Record Linkage.
Proceedings of the Australian Health Outcomes Conference (AHOC-2002), Canberra.
Elfeky, M.G., Verykios, V.S., Elmagarmid, A., Ghanem, M. and Huwait, H. (2003) Record
Linkage: A Machine Learning Approach, a Toolbox, and a Digital Government Web Service.
Department of Computer Sciences, Purdue University, Technical Report CSD-TR 03-024.
Gu, L., and Baxter, R. (2004) Adaptive Filtering for Efficient Record Linkage. SIAM Int.
Conf. on Data Mining, April 22-24, Orlando, Florida.
Verykios, V.S., Elfeky, M.G., Elmagarmid, A., Cochinwala and M., Dalal, S. (2000) On The
Accuracy And Completeness Of The Record Matching Process. In Sloan School of
Management, editor, Procs. of Information Quality Conference, MIT, Cambridge, MA.
Goiser, K., and Christen P. (2006) Towards Automated Record Linkage. In Proceedings of
the Fifth Australasian Data Mining Conference (AusDM2006), Sydney.
1.7. Quality assessments
Nicoletta Cibella and Tiziana Tuoto (ISTAT)
Record linkage is affected by two types of errors: the record pairs that should have been
linked but actually remain unmatched and, vice versa, the record pairs which are linked even
if they refer to two different entities.
In a statistical context, record linkage accuracy is evaluated in terms of the false match and
false non-match rates. In other contexts, as in the medical and epidemiological fields,
different measures are considered (the positive predicted value and sensitivity), although they
are algebraic transformations of the false match and false non-match rates, respectively. The
same accuracy indicators are also used in the research field of information retrieval, although
they are usually named precision and recall.
In order to define the previous indicators, let us assume to know the following characteristics:
– The number of record pairs linked correctly (true positives) nm.
– The number of record pairs linked incorrectly (false positives, Type I error) nfp.
– The number of record pairs unlinked correctly (true negatives) nu.
– The number of record pairs unlinked incorrectly (false negatives, Type II error) nfn.
WP1
16
– The total number of true match record pairs, Nm.
– The total number of true non-match record pairs, Nu.
The false match rate is defined as:
fmr=nfp/(nm + nfp),
i.e. the number of incorrectly linked record pairs divided by the total number of linked record
pairs. The false match rate corresponds to the well-known 1- error in a one-tail hypothesis
test. The positive predictive value is easily computed from the false match rate
(ppv=nm/(nm + nfp)=1-fmr),
and corresponds to the number of correctly linked record pairs divided by the total number of
linked record pairs.
On the other side, the false non-match rate is defined as:
fnmr=nfn/Nm,
i.e. the number of incorrectly unlinked record pairs divided by the total number of true match
record pairs. The false non-match rate corresponds to the  error in a one-tail hypothesis test.
Similarly as for the ppv, sensitivity can be obtained from the
fnmr (s=nm/Nm=1-fnmr)
as the number of correctly linked record pairs divided by the total number of true match
record pairs.
Some authors also recommend computing the match rate:
(nm + nfp)/Nm,
i.e. the total number of linked record pairs divided by the total number of true match record
pairs.
A different performance measure is specificity, defined as nu/Nu, i.e. the number of correctly
unlinked record pairs divided by the total number of true non-match record pairs. The
difference between sensitivity and specificity is that sensitivity measures the percentage of
correctly classified record matches, while specificity measures the percentage of correctly
classified non-matches.
As anticipated at the beginning of this section, in information retrieval the previous accuracy
measures take the name of precision and recall. Precision measures the purity of search
results, or how well a search avoids returning results that are not relevant. Recall refers to
completeness of retrieval of relevant items. Hence, precision can be defined as the number of
correctly linked record pairs divided by the total number of linked record pairs, i.e. it
coincides with the positive predicted value. Similarly, recall is defined as the number of
correctly linked record pairs divided by the total number of true match record pairs, i.e. recall
is equivalent to sensitivity. As a matter of fact, precision and recall can also be defined in
terms of non-matches.
The same quality indicators can be evaluated even if the linkage procedure is performed
through techniques different from the probabilistic one, as for instance supervised or
unsupervised machine learning (Elfeky et al, 2003).
Additional performance criteria for record linkage are given by the time consumed by
software programmes and by the number of records that require manual review. The time
complexity of a record linkage algorithm is usually dominated by the number of record
comparisons. On the other hand, manual review of records is also time-consuming, expensive
and error prone.
WP1
17
All the performance indicators defined above have to be evaluated on actual data, as shown in
the following methods.
1.7.1. Sampling and clerical review
The above defined measures can be estimated drawing randomly (or with a purposive
selection) a sample of pairs from the whole pairs (i.e. from both M and U). These sample
pairs are then matched with a more intensive (and accurate) procedure in order to evaluate the
accuracy of the original match (see Hogan and Wolter 1988, Ding and Fienberg 1994). As the
procedures implemented are more accurate and are performed by highly qualified personnel,
the “rematch” is considered error free and it represents the true match status. Sometimes the
whole record linkage procedure on the selected sample is done manually so as to be confident
that it is “perfect”. The bias of the original match is evaluated by means of the discrepancies
between the match and the rematch results.
Selection of pairs to include in the sample is sometimes problematic. Winkler (1995) suggests
reducing the sample size by selecting pairs of records from the area where problems arise
more frequently. This can be done adopting a weighting strategy where the nearer to a fixed
threshold the weights are, the more linkage errors occur.
Alternative procedures consist of evaluating the record linkage procedure quality by using
appropriate statistical models. These models produce an automatic estimate of the error rates,
as described in the following paragraphs.
1.7.2. Belin-Rubin procedure
Belin and Rubin (1995) propose a model for estimating false match rates for each possible
threshold value. They define a model where the distribution of observed weights is interpreted
as a mixture of weights for true and false matches. Their approach focuses on providing a
predicted probability of match for a pair of records, with associated a standard error, as a
function of the matching weight.
Their method is particularly useful when record linkage should satisfy the following
constraint: each record of one file cannot be matched to more than one record of the other file
(one to one match constraint). In this case their procedure dramatically improves the record
linkage performance because non-matches are mostly eliminated. Generally, the method
works well when there is a good separation between the matching weights associated with
matches and non-matches and the failure of the conditional independence assumption is not
too severe. This method requires that a training sample of record pairs is available, where the
status of each record pair is perfectly known.
1.7.3. Torelli-Paggiaro estimation method
In order to avoid the use of a training sample, Torelli and Paggiaro (1999) suggest a strategy
that allows evaluating error rates by means of the estimates of the probability to be a link for
each record pair. Maximum likelihood estimates of these probabilities are computed via the
EM algorithm or some of its modifications. Torelli and Paggiaro propose to evaluate the false
non-match rate as the sum of the matching probabilities of the record pairs under the
threshold. The false match rate is computed similarly.
Quality of these error rate estimators dramatically depends on the accuracy of the probability
estimators: if these probabilities are obtained under the conditional independence assumption,
and this assumption does not hold, error rate estimators will be strongly biased.
WP1
18
1.7.4. Adjustment of statistical analyses
Generally speaking, it is important to assess the record linkage quality because linkage errors
can affect the population parameter estimates (Neter et al, 1965). Scheuren and Winkler
(1993) propose a method for adjusting statistical analyses for matching errors. In this case, the
problem is restricted to the impact of the mismatch error on the bias of the coefficient of the
standard regression model on two variables (one from each source). The estimator bias is
corrected introducing in the model the probabilities of being correctly and incorrectly
matched.
Scheuren and Winkler (1997) also propose to modify the former regression estimates for the
presence of outliers, introducing an appropriate iterative solution. Lahiri and Larsen (2000)
extend the model in order to estimate also the regression coefficients when more than one
variable from a data set is considered.
Bibliography
Belin TR. and Rubin D.B., 1995. A method for calibrating false-match rates in record linkage.
Journal of the American Statistical Association, 90, 694-707.
Christen, P and Goiser, K, 2005. Assessing duplication and data linkage quality: what to
measure?, Proceedings of the fourth Australasian Data Mining Conference, Sydney,
December 2005, viewed 16 June 2006, http://datamining.anu.edu.au/linkage.html
Ding Y. and Fienberg S.E., 1994. Dual system estimation of Census undercount in the
presence of matching error, Survey Methodology, 20, 149-158.
Elfeky M.G., Verykios V.S., Elmagarmid A., Ghanem M. and Huwait H., 2003 Record
Linkage: A Machine Learning Approach, a Toolbox, and a Digital Government Web Service.
Department of Computer Sciences, Purdue University, Technical Report CSD-TR 03-024.
Hogan H. and Wolter K., 1998. Measuring accuracy in a post-enumeration survey. Survey
Methodology, 14, 99-116.
Lahiri P. and Larsen M.D., 2000. Model based analysis of records linked using mixture
models, Proceedings of the Section on Survey Research Methods Section, American
Statistical Association, pp. 11-19.
Neter J., Maynes S. and Ramathan R., 1965. The effect of mismatching on the measurement
of response error. Journal of the American Statistical Association, 60, 1005-1027.
Scheuren F. and Winkler W.E., 1993. Regression analysis of data files that are computer
matched, Survey Methodology, 19, 39-58.
Scheuren F. and Winkler W.E., 1997. Regression analysis of data files that are computer
matched- part II, Survey Methodology, 23, 157-165.
Winkler W.E., 1993. Improved decision rules in the Fellegi-Sunter model of record linkage.
Proceedings of the Survey Research Methods Section, American Statistical Association, pp.
274-279.
WP1
19
Winkler W.E., 1995. Matching and record linkage. Business Survey Methods, Cox, Binder,
Chinappa, Christianson, Colledge, Kott (eds.). John Wiley & Sons, New York.
1.8. Analysis of files obtained by record linkage
Miguel Guigo (INE)
As it can be inferred from the amount of applied studies in which these methods are involved
(Alvey and Jamerson, 1997), merging files through probabilistic record linkage technique is
not an end in itself, but a means to a wide variety of goals related to the use of administrative
microdata, even not for statistical purposes. Implementations refer to imputation,
improvement of survey frames, treatment of non-response problems, longitudinal studies,
procedures for obtaining better estimates, and so on.
Therefore, when a pair of data sets is fused, any administrative decisions as well as statistical
conclusions based on the linked file must take into account that the results are affected by two
types of errors: on the one hand, the percentage of incorrect acceptance of false matches and,
on the other hand, the incorrect rejection of true matches. Record linkage procedures must
deal with the existing trade-off between both types of errors and/or measure the effects on the
parameter estimates of the models that are associated to the resulting files.
Different approaches have tackled the problem, the first due to Neter et al. (1965) that has
studied bias in the estimates of response errors when the results of a survey are partially
improved through record checks, and raises awareness of substantial effects in the results with
relatively small errors in the matching process.
Scheuren and Oh (1975) focus on different problems noticed in a large-scale matching task as
a Census - Social Security match through Social Security Number (SSN)4. They focus
attention to the impact of different decision rules on mismatching and erroneous no matching.
Furthermore they point out the constraints to develop an appropriate comparison vector when
statistical purposes differ from administrative aims that generated the file and that regulate its
maintenance. Nevertheless their approach does not offer general criteria to estimate the
parameters of the distributions, as m(γab) and u(γab). Their approach is to select a sample of
records, manually check their status of matched and unmatched pair, and estimate those
parameters from the observed proportions.
Some more complete methodologies have been developed by Scheuren and Winkler (1993,
1996a, 1996b, 1997) through recursive processes of data editing and imputation. This
methodology focuses on building an accurate imputation model from a quite small number of
likely matched pairs, which are in their turn the result of a common record linkage procedure,
i.e. once a first round of links has been made, a subset of high-scored matches, which error
rate is estimated to be low, is selected to design a linear regression model and estimate its
parameters. Let A and B be the two datasets to be compared and, for every unit a in A, some
likely matches have been selected from B. Let also x and y be two characteristics available for
the records in A and B, respectively. In an ordinary univariate linear regression model,
4
Although a unique common identifier is used to fuse data from two files, some different problems can arise
even when linkage is achieved through some automated process. Scheuren and Oh report problems related to
misprints, absence of SSN in one of the two records that are candidate to be matches, unexplainable changes of
SSN in records known to be from the same person, etc.
WP1
20
yi = a0 + a1xi + i
xi and yi ought to be from the same observation, say they belong to the unit a. However, the
mismatched pairs of records are characterized by values of x and y observed on distinct units.
In this case, the actual dependent variable in B is not Y anymore, but a new variable zi whose
values are zi = yi if i=j, and zi = yj otherwise. Scheuren and Winkler (1993) consider several
possible matches in B for every unit in A, and therefore zi has also several possible values, yi
with probability pi, and yj with probability qij. It must be taken into account that the estimates
of the intercept a0 and the slope a1 are biased and the correlation between x and y is
understated due to the independence of both variables in the cases where zi is equal to yj
instead of yi. Assumed that the probabilities pi and qij can be estimated accurately, it is
possible, in its turn, to get better estimates for the parameters of the regression model, which
can then provide feedback for the record linkage procedure by estimating values of yi to be
compared with those in the possible matches in B. The so improved record linkage step can
lead to a new cycle until convergence. Scheuren and Winkler (1996b) also deal with a variety
of different scenarios, depending on availability of comparison variables.
Larsen (1999, 2001, 2004) and Lahiri and Larsen (2000 and 2005) have widely discussed the
use of the former methodology for mixture models, trying to improve the estimates of the
probability that a pair of records is actually a match. Those estimates can be found through
maximum likelihood or Bayesian analysis, and then adjust the regression models by an
alternative to the bias correction method used in Scheuren and Winkler. By means of
simulated data sets, Larsen (1999) finds maximum likelihood estimates on the one hand and
posterior distributions on the other hand, for the mixture model parameters. The different
values can be used to express uncertainty in the relationship between records because of their
unknown real status. Lahiri and Larsen (2000 and 2005) consider the multivariate regression
model
yi = x'i + I,
where xi is a column vector of explanatory variables which belong to some record a in A, yi is
the response variable and is the column vector of unknown regression coefficients. The idea
of investigating the bias of the estimator of  is developed under the assumption of existing
but not identified mismatches between records, in such a way that the observed values for the
left side of the equation above are actually zi as described in the case study by Scheuren and
Winkler. Lahiri and Larsen (2000) propose  = (W'W)-1 W'z as an unbiased estimator instead of the one obtained by ordinary least squares-, where z is the column vector of actual
values of the response variable, and W is a linear transformation of the square matrix X of
explanatory data, with wi = q'i X = jqijx'j. A robust estimator based on absolute deviations is
also mentioned. Variances of different estimators are compared in Lahiri and Larsen (2005)
via Monte Carlo simulation.
Additionally, Liseo and Tancredi (2004) develop a brief regression analysis based on a
Bayesian approach to record linkage, proposing a simulation to show that the relation between
the values of the explanatory variables xi and the actually observed values zi can provide
information in order to improve the linkage process.
Finally, Winkler (2006a) suggests that the use of a regression adjustment to improve matching
can be done by means of identifying variables that are not strictly the same, but actually
WP1
21
include the same information from different points of view. Based on a practical experience
by Steel and Konschnik (1997), a possible application for files with data from companies is
proposed, pointing out the fact that e.g. observations on receipts or income are referred to the
same accounting concept.
Bibliography
Alvey, W. and Jamerson, B. (eds.), 1997. Record Linkage Techniques – 1997. (Proceedings
of an International Record Linkage Workshop and Exposition on March 20-21, 1997 in
Arlington, Virginia) Washington, DC: Federal Committee on Statistical Methodology.
Lahiri P. and Larsen M.D., 2000. Model-based analysis of records linked using mixture
models, Proceedings of the Section on Survey Research Methods Section, American
Statistical Association, pp. 11-19.
Lahiri P. and Larsen M.D., 2005. Regression Analysis With Linked Data. Journal of the
American Statistical Association, 100, pp. 222-230.
Larsen M.D., 1999. Multiple imputation analysis of records linked using mixture models.
Proceedings of the Survey Methods Section, Statistical Society of Canada, pp. 65-71.
Larsen, M.D., 2001. Methods for model-based record linkage and analysis of linked files. In
Proceedings of the Annual Meeting of the American Statistical Association, Mira Digital
publishing, Atlanta.
Larsen, M.D., 2004. Record Linkage of Administrative Files and Analysis of Linked Files. In
IMS-ASA's SRMS Joint Mini-Meeting on Current Trends in Survey Sampling and Official
Statistics. The Ffort Radisson, Raichak, West Bengal, India.
Larsen, M.D., 2005. Advances in Record Linkage Theory: Hierarchical Bayesian Record
Linkage Theory. 2005 Proceedings of the American Statistical Association, Survey Research
Methods Section [CD-ROM], pp. 3277- 3284. Alexandria, VA: American Statistical
Association.
Liseo, B. Tancredi, A., 2004. Statistical inference for data files that are computer linked Proceedings of the International Workshop on Statistical Modelling - Firenze Univ. Press, pp.
224-228.
Neter, J., Maynes, E.S, and Ramanathan, R., 1965. The effect of mismatching on the
measurement of response errors. Journal of the American Statistical Association, 60, pp.
1005-1027.
Scheuren, F. and Oh, H.L., 1975. Fiddling around with nonmatches and mismatches.
Proceedings of the Social Statistics Section, American Statistical Association, pp. 627-633.
Scheuren, F. and Winkler, W.E., 1993. Regression analysis of data files that are computer
matched – Part I. Survey Methodology, Volume 19, pp. 39-58.
Scheuren, F. and Winkler, W.E., 1996a. Recursive analysis of linked data files. U.S. Bureau
of the Census, Statistical Research Division Report Series, n.1996/08.
WP1
22
Scheuren, F. and Winkler, W.E., 1996b. Recursive Merging and Analysis of Administrative
Lists and Data, Proceedings of the Section of Government Statistics, American Statistical
Association, pp. 123–128.
Scheuren F. and Winkler W.E., 1997. Regression analysis of data files that are computer
matched- part II, Survey Methodology, 23, pp. 157-165.
Steel, P., and Konschnik, C.,1997. Post-Matching Administrative Record Linkage Between
Sole Proprietorship Tax Returns and the Standard Statistical Establishment List. In Record
Linkage Techniques 1997, Washington, DC: National Academy Press, pp. 179-189.
Winkler, W. E., 1991. Error model for analysis of computer linked files. Proceedings of the
Section on Survey Research Methods, American Statistical Association, pp. 472-477.
Winkler, W.E., 2006a. Overview of Record Linkage and Current Research Directions. U.S.
Bureau of the Census, Statistical Research Division Report Series, n.2006/2.
WP1
23
2. Literature review on statistical matching
2.1 Statement of the problem of statistical matching
Marcello D’Orazio (ISTAT)
The words Statistical Matching (or data fusion or synthetical matching) refer to a series of
methods whose objective is the integration of two (or more) data sources (samples) drawn
from the same target population. The data sources are characterized by the fact they all share a
subset of variables (common variables) and, at the same time, each source observes distinctly
other sub-sets of variables. Moreover, there is a negligible chance that data in different
sources observe the same units (disjoint sets of units).
2.1.1. Differences with record linkage and preliminary definitions
This set of procedures is quite different in the inputs (i.e. the data sets to be integrated) and in
the output.
As far as the input is concerned, the data sets are usually two distinct samples without any
unit in common (no overlap between the data sources). On the contrary record linkage
requires at least a partial overlap between the two sources.
In the simplest case of two samples, the classical statistical matching framework can be
represented in the following manner (Kadane, 1978, D’Orazio et al, 2006):
Y
X
Data source A
X
Z
Data source B
In this situation X is the set of common variables, Y is observed only in A but not in B and Z
is observed in B but not in A (Y and Z are not jointly observed).
A second difference between record linkage and statistical matching is the output. In record
linkage, the objective is to recognize records belonging to the same unit in two distinct but
partially overlapping data sets. For this reason the focus is only on the X variables and on how
to deal with the possibility that these variables are reported with error.
On the contrary, statistical matching methods aim at integrating the two sources in order to
study the relationship existing among the two sets of variables not jointly observed, i.e. Y and
Z or, more in general, to study how X, Y and Z are related. This objective can be achieved by
using two seemingly distinct approaches.
WP1
24
Bibliography
D’Orazio, M., Di Zio, M. and Scanu, M., 2006. Statistical matching: theory and practice. John
Wiley, Chichester.
Kadane, J.B., 1978. Some statistical problems in merging data files. 1978 Compendium of
Tax Research, Office of Tax Analysis, Department of the Treasury, pp.159-171. Washington,
DC: U.S. Government Printing Office. Reprinted in Journal of Official Statistics (2001),
Volume 17, pp. 423-433.
2.2 The statistical matching workflow
Marcello D’Orazio, Marco Di Zio and Mauro Scanu (ISTAT)
The statistical matching workflow is extremely simple. It consists of three sequential steps:
harmonization of the data sets to match, application of an algorithm, quality evaluation of the
results.
This simplicity is due to the fact that the statistical matching problem is essentially a simple
inferential problem: the estimation of joint information of the variables that are never jointly
observed. This estimation problem can be either explicit or implicit in a statistical matching
approach, nevertheless it is always present. This chapter focuses mainly on the second node,
explaining all the different approaches that are available, according to the researcher’s goals
and available information (Section 2.3). The following table illustrates the possibilities
available so far.
Stat. matching
objectives
Macro
Micro
Approaches to statistical matching
Parametric
Nonparametric
Mixed





As in every inferential statistics problem, the specification of the framework of estimation can
be either parametric or nonparametric. In the case of statistical matching, there is also a third
possibility, given by a mixture of the two frameworks. This last option basically consists of a
two step procedure.
Step 1) a parametric model is assumed and its parameters are estimated.
Step 2) a completed synthetic data set is derived through a nonparametric micro approach.
On the contrary, it is a specific feature of statistical matching to distinguish two kinds of
objectives, in the sequel denoted as micro and macro approaches.
In the micro approach, the statistical matching objective is the construction of a complete
“synthetic” file. The file is complete in the sense that it contains records where X, Y and Z are
jointly present. The term “synthetic” refers to the fact that this file is not the result of a direct
WP1
25
observation of all the variables on a set of units belonging to the population of interest, but it
is obtained exploiting information in the observed distinct files. For example, in the case of
the data sets as in the previous figure, a synthetic file can be
Y
X
~
Z
Synthetic file
(file A with Z filled in).
In a macro approach, the distinct data sources are used in order to provide an estimate of the
joint distribution function of the variables not jointly observed ( f  y, z  in the example, or
f x, y, z  ) or of some of its key characteristics such as a correlation matrix ( ρ YZ ), a
contingency table, a measure of association, etc.
The micro approach seems the most popular one. There are different reasons for this. Firstly,
in the initial statistical matching applications, the problem is regarded as an imputation
problem (see e.g. Okner 1972): the missing variables have to be imputed in one of the source
data sets called recipient (or host file). Therefore, for each record of the recipient file the
missing variables are imputed using records (chosen suitably) from the other sample, the
donor file. Secondly, the micro approach has some nice features from a practical point of
view. It permits to “create” huge data sources to be used as input for micro simulations
(Cohen, 1991). In other cases, a synthetic data set is preferred just because it is much easier to
analyze than two or more incomplete data sets. Sometimes building a synthetic data set may
be preferred due to the difficulties in managing complex models capable to handle at the same
time several categorical and continuous variables. Finally, a synthetic data set can be used by
different subjects with different informative needs.
On the contrary, when the objective of the integration is simply the estimation of a
contingency table or of a correlation matrix for variables not jointly observed, the macro
approach can result to be much more efficient than the micro one. This approach basically
consists in obtaining direct estimates of the parameters of interest from the partially observed
data sets. However, there is not a clear-cut distinction between the two approaches. In fact, in
some cases the synthetic file can be obtained as a by-product of the estimation of the joint
distribution of all the variables of interest. For instance, a parametric model is assumed, its
parameters are estimated and then these estimates are used to derive a synthetic file, e.g. by
predicting the missing variables in one of the source data files.
A crucial element for tackling the statistical matching problem is the availability of
information, apart from that contained in the two data sets. It is possible to distinguish three
different situations:
CIA: it is believed that Y and Z follow a very simple model, the conditional independence of
Y and Z given X; this model can be easily estimated by means of the data sets at hand (note
that only some models are estimable using the data sets A and B).
WP1
26
Auxiliary information: external information is available, either on some parameters or
provided by a third data set where Y and Z are jointly observed; other models than the
conditional independence assumption become estimable.
Uncertainty: no assumptions are considered; in this case there is a particular form of
uncertainty which characterizes the statistical matching problem: uncertainty due to lack of
joint information on Y and Z.
In Section 2.3, the parametric, nonparametric, and mixed methods for the micro and macro
approaches are discussed under the CIA and the use of auxiliary information. Section 2.4 is
devoted to the assessment of uncertainty when no assumptions are made on the model
between Y and Z. Finally, Section 2.5 details how quality evaluations can be performed.
Bibliography
Cohen, M.L., 1991. Statistical matching and microsimulation models. Improving information
for social policy decisions: the uses of microsimulation modelling, Volume II, Technical
Papers (C.F. Citro and E.A. Hanushek (eds.)). National Academy Press, Washington, DC.
Okner, B. A., 1972. Constructing a new database from existing microdata sets: the 1966
merge files. Annals of Economic and Social Measurement, Volume 1, pp. 325-342.
2.3 Statistical matching methods
Marco Di Zio (ISTAT)
In the description of matching methods, different elements should be considered: the nature of
the method, i.e. parametric, nonparametric and mixed techniques; the main goal of the
integration task, i.e. micro and macro objective; the model assumed for the variables X, Y, Z in
the population. The first two characteristics have been already introduced in the previous
section. As far as model assumptions are concerned, it is necessary to distinguish between the
situations when the conditional independence assumption (CIA) between Y and Z given X
holds and when it cannot be assumed.
When the CIA holds, the structure of the joint (X, Y, Z) distribution is
f  x, y , z   f  y | x  f  z | x  f  x  .
This assumption is especially important in the context of statistical matching, as most of the
methods usually applied, explicitly or implicitly, assume this structure. The reason is that the
data at hand are sufficient to estimate directly the quantities concerning the conditional
distributions of Y given X, of Z given X, and the marginal distribution of X.
The CIA cannot be tested from the data set A and B. This implies that it has to be assumed by
making some general considerations on the investigated phenomenon, or by testing on some
historical or similar data sets where all the variables are jointly observed. In order to avoid the
CIA, auxiliary information must be used.
In order to exemplify the concept of conditional independence, an illustrative example is
provided. Let us suppose we have observed the variables Gender (Y), Income (Z) and
Country (X) on a sample of units. In the case of independence of Gender and Sex, if we are
interested in the computation of probability of being woman with an income higher than the
average income, it is sufficient to determine the proportion of women, the proportion of
WP1
27
people with an income higher than the average, and compute the product of the two
probabilities. If you consider also the conditioning variable X (Country), the previous
statement should be true across the countries, i.e. the simple combination applies to each and
every country provided differences in the proportions of women and income, and this
represents the conditional independence of Gender and Income given the Country. The
violation of conditional independence means that this independence relationship is not true for
at least one country, for instance women systematically earn less than men, or vice versa.
In the following, a discussion of statistical matching methods characterised by these different
elements is given. The first part of the section deals with methods assuming the conditional
independence. The second part will discuss methods exploiting the use of auxiliary
information.
2.3.1. Conditional independent assumption
2.3.1.1 Parametric methods
In parametric methods, the model is determined by a finite number of parameters. Models
introduced in literature for continuous variables are mainly based on the assumption of
multinormality of data. The CIA ensures that data are sufficient to estimate the parameters of
the model, but when integrating two data sets, it is a crucial point how to estimate the
parameters of the joint distribution from the information available in the two separate data
sets. In the following, we consider only three variables X, Y, and Z. The generalization to the
multivariate case is straightforward. The critical parameters are those involving relations
between the variables Y and Z that are not jointly observed. In the multinormal case the
critical parameter is the covariance (or analogously the correlation)  yz . By assuming the
conditional independence, this parameter is determined by other estimable parameters because
 yz   xy  xz /  2 x .
However, also in this case that the critical parameter is estimable, great care should be posed
when combining estimates from different data sets. For instance, the naive way for the
estimation of the parameters obtained through their observed counterpart in the data sets, that
is for  xy the sample covariance sxy computed on the file A (sxy;A), for  xz the sample
covariance sxz computed on the file B (sxz;B), and for  2 x the sample variance s2x;AUB may
lead to unacceptable results, like a not positive semi definite covariance matrix. A solution to
this problem (Anderson, 1956) is to use the maximum likelihood approach that leads, for the
estimation of the covariance  yx , to correct the sample covariance syx by a factor that is the
regression coefficient of Y on X, i.e. ˆ yx  ˆ yx s x2; AUB where the regression coefficient is
estimated through ̂ yx  syx;A / s2x;A . Analogous results hold for the couple Z and X.
A discussion of the problems concerning the combination of estimates obtained from the
different data sets is in Moriarity and Scheuren (2001), Rubin (1974), D’Orazio et al.,
(2006a). Formulae to estimate coherently the parameters by means of a likelihood approach in
a general multivariate context are in D’Orazio et al. (2006a). A Bayesian approach is
described in Rässler (2002, 2003).
WP1
28
Similar considerations hold for categorical variables. Let  ijk denote the probability that X
assumes category i (for i=1,…, I), Y assumes category j (for j=1,…, J) and Z assumes
category k (for k=1,…, K). The CIA implies that the critical parameters involving Y and Z are
 ijk 
 ij . i.k
  i.. j|i k |i
 i..
where  j|i and  k|i represent the conditional probabilities of Y given X and Z given X
respectively, that can be estimated through the data sets A and B. Also in this case, how to
combine estimates is an important issue to deal with. A coherent combination of the
estimates, under a multinomial model, is obtained through the maximum likelihood estimates
nijA.
nB
n A i.. n B i.. ˆ
ˆ
 i.. 
,  j|i  A , ˆk |i  iB.k ,
n A  nB
ni..
ni..
where nA is the size of the data set A, niA.. is the frequency of observations assuming the
category i in the data set A, and so on. The model is illustrated in detail in D’Orazio et al.
(2006a,b).
When the goal is a micro approach, the estimation of the model parameters is still required. In
this case, synthetic data are generally obtained by drawing values from the estimated model or
by imputing with the estimated conditional expectation.
2.3.1.2 Nonparametric methods
Nonparametric methods differ from parametric ones in that the model structure is not
specified a priori. The term nonparametric is not meant to imply that such models completely
lack parameters but that the number and nature of the parameters are flexible and not fixed in
advance. The most used nonparametric techniques in statistical matching are those belonging
to the hot deck methods. Hot deck methods are widely used for imputation, especially in
statistical agencies. They aim at the completion of one data set (say A) denoted as recipient
file by substituting a missing value with a value observed in a similar statistical unit observed
in the other data set (say B) denoted as donor file. The way of measuring the similarity of
units characterizes the hot deck method. Random hot deck and distance hot deck are generally
the most used techniques in statistical matching.
Random hot deck consists in randomly choosing a donor record in the donor file for each
record in the recipient file. The random choice is often done within strata that are determined
through variables leading to a homogeneous set of units.
Distance hot deck is widely used in the case of continuous variables. In the simplest case of
one variable X (the generalization to the multivariate case is straightforward), the donor for
the ath observation in the recipient file A is chosen so that
d ab*  xaA  xbB*  min xaA  xbB
1bnB
(different distances than the one used in this example can be chosen).
An interesting variation of the distance hot deck is the constrained distance hot deck. In this
approach, each record in B can be chosen as donor only once. This guarantees the
preservation of the marginal distribution of the imputed variable (in this case the variable Z).
WP1
29
A discussion about the use of hot deck techniques is in D’Orazio et al. (2006a) and Singh et
al. (1993). Limits of those methods are deepened in Paass (1985) and Conti et al. (2006)
The nature of the previous nonparametric techniques, that are essentially imputation methods,
makes clear their use for micro objective. As far as the macro objective is concerned, it is still
not developed a clear theoretical framework. A first attempt is in D’Orazio et al. (2006a), and
Conti et al. (2006).
2.3.1.3 Mixed methods
This is a class of techniques that makes use of parametric and nonparametric methods. More
precisely, initially a parametric model is adopted, and then a completed synthetic data set is
obtained by means of some hot deck procedures. The main reason for the introduction of this
approach is that they exploit the advantages of models (being more parsimonious for the
estimation) and that the final data are ‘live’ data, i.e. really observed. Despite this intuitive
justification, theoretical properties of those methods are still to be investigated in depth.
In the following two mixed procedure examples are given.
For continuous variables the general mixed procedure usually adopted is
1. Estimate (on B) the regression parameters of Z on X.
2. For each of the ath observations in file A (a=1,…,na), generate an intermediate value
ža by using the regression function.
3. Impute the missing observations in file A by using distance hot deck methods, where
the distance is computed on the values ža and zb.
For categorical variables a procedure is
1. Estimate the expected cell frequencies through a log-linear model.
2. Impute through a hot deck procedure, by accepting the donor only if the frequency of
the cell the donor belongs to does not exceed the estimated expected cell frequency
(computed in the first step).
A not exhaustive list of references, that relates to more complicated settings than the CIA, are
Rubin (1986), Singh et al. (1993), Moriarity et al. (2001, 2003). Also in this case, the nature
of the methods strongly leads their use towards the micro objective problems.
2.3.2. Auxiliary information
The CIA can be easily used in the statistical matching framework, because it can be easily
estimated by the available data sets. Anyway, the model that generates data can be quite
different from the CIA, as when the conditional correlation coefficient of Y and Z given X is
different from zero. In this case, the data sets at hand are unable to estimate the model, thus it
is necessary to resort to auxiliary information to fill this gap. In this setting it is important to
take into account the characteristics of auxiliary information. It may have the form of (Singh
et al., 1993):
1) a third file where either (X,Y,Z) or (Y,Z) are jointly observed;
2) a plausible value of the inestimable parameters of either (Y,Z|X), or (Y,Z).
WP1
30
Also in this framework, the considerations about the previously described characteristics, like
parametric, nonparametric, mixed, and micro, macro objectives are valid. However, when
using auxiliary information, it is a crucial point how to include the information in the method
that is going to be used. The problem of coherence between the external auxiliary information
and the estimates obtained through the data at hand may arise.
2.3.2.1 Parametric methods
As far as continuous variables are concerned, let us assume that data are normally distributed.
Moreover, let us assume that the inestimable parameter  yz is equal to a certain value  * yz
(obtained for instance from a past survey). In this setting it is not necessarily true that the
covariance matrix determined by this value and the maximum likelihood estimators of the
other parameters is definite positive. In fact, in this case a constrained likelihood approach
should be used. Further studies are needed for this case.
The simplest case happens when auxiliary information is on the partial parameter, e.g. the
partial correlation coefficient (  yz| x   * yz| x ). In this setting, this value can be coherently
combined with the maximum likelihood estimates of ˆ y| x and ˆ z| x (the same obtained under
the CIA) through the formula
ˆ yz| x   * yz| x ˆ 2 y| xˆ 2 z| x .
The critical covariance can be estimated through the formula
ˆ yz  ˆ yz| x  ˆ yx ˆ zx / ˆ 2 x .
A generalized version of these algorithms can be found in D’Orazio et al. (2006a). A slight
different version for estimating the parameters with a given value for  yz| x is also given in
Rässler (2003).
As far as the case of categorical variables is concerned, similar considerations are valid. It is
worth to remark that in this context information on Y and Z are sufficient to determine only at
maximum a loglinear model without the triple interaction. In the latter case information on the
triple Y, Z and X is needed.
2.3.2.2 Nonparametric methods
Let us consider A as the recipient file, B as the donor file, and C as the auxiliary file. A
general algorithm for imputing missing data through hot deck is
1) the ath observation of file A is imputed with a value za*, taken from C, for a=1,…,na by
means of a hot deck procedure. When file C contains all the three variables distances (or
imputation cells) are computed on (X,Y). When file C contains only the two variables (Y,Z),
distances (or the imputation cells) are computed by only considering Y.
2) the ath observation of file A is imputed with a value za**, taken from B, by means of a hot
deck procedure, that takes into account for computing the distance (or imputation cells) the
variables (xa,za*) in A and the variables (xb, zb) in B.
2.3.2.3 Mixed methods
Considerations concerning mixed methods are similar to those described under the CIA. The
main difference consists in the use of auxiliary information. It may be introduced in the
parametric estimation phase when an intermediate imputed value is computed (see previous
WP1
31
section 1.3 on Mixed Methods). Auxiliary information can also be used as a constraint to be
fulfilled, for instance
1. Estimate the regression parameters of Z on Y and X (through auxiliary information).
2. For each of the ath observations in file A (a=1,…,na), generate an intermediate value
ža by using the regression function.
3. Impute the missing observations in file A by using distance hot deck methods, where
distance is computed on the values ža and zb. The potential donor observation is
accepted as an actual donor if and only if the frequency of the cell it belongs to
(having introduced a discretization for the variables X, Y and Z) does not exceed the
frequency of the same cell in the auxiliary file C.
As far as the use of auxiliary information is concerned, parametric techniques based on
normal and multinomial data are described in D’Orazio et al. (2006a). Nonparametric
methods are mainly described in D’Orazio et al. (2006a) and Singh et al. (1993). Mixed
methods are illustrated in Kadane (1978), Moriarity and Scheuren (2001, 2003), D’Orazio et
al., (2006a), and Singh et al. (1993).
2.3.3. Partial auxiliary information
A final consideration is about the use of partial auxiliary information. Sometimes, it is
possible to make use of practical considerations leading to the construction of constraints. As
an example, in the case of social surveys, logical rules based on law may be derived, e.g. it
cannot be accepted that a ten years old person is married. In general, to be useful, the
constraint has to refer to the variables never jointly observed. Nevertheless, in practice it is
common to have only partial information. with respect to a specific combination of (Y,Z) (as
for structural zeros). For instance, if we represent the frequency distribution of a population
with respect to Age (in class of years) and the Marital Status (married/non married) in a
contingency table, the previous logical rule about the age/marry results to constrain some (not
all) cells at zero. This partial knowledge is not general enough to determine a specific
distribution, but it can be usefully applied to decrease the degree of uncertainty about the
relationships concerning (Y,Z), see D’Orazio et al. (2006a, 2006b) and Vantaggi (2005). This
concept will be clarified in the next section dedicated to the study of uncertainty.
Bibliography
Anderson, T.W., 1957. Maximum likelihood estimates or a multivariate normal distribution
when some observations are missing, Journal of the American Statistical Association, 52, pp.
200-203.
Conti P.L., Marella, D., Scanu M., 2006. Nonparametric evaluation of matching noise.
Proceedings of the IASC conference “Compstat 2006”, Roma, 28 August – 1 September
2006, Physica-Verlag/Springer, pp. 453-460.
D’Orazio, M., Di Zio, M. and Scanu, M., 2006a. Statistical matching for categorical data:
displaying uncertainty and using logical constraints. Journal of Official Statistics, Volume 22,
pp. 137-157.
D’Orazio, M., Di Zio, M. and Scanu, M., 2006b. Statistical Matching: Theory and Practice.
John Wiley, Chichester.
WP1
32
Kadane, J.B., 1978. Some statistical problems in merging data files. In 1978 Compendium of
Tax Research, Office of Tax Analysis, Department of the Treasury, pp.159-171. Washington,
DC: U.S. Government Printing Office. Reprinted in Journal of Official Statistics (2001),
Volume 17, pp. 423-433.
Moriarity, C. and Scheuren, F., 2001. Statistical matching: a paradigm for assessing the
uncertainty in the procedure. Journal of Official Statistics, Volume 17, pp. 407-422.
Moriarity, C. and Scheuren, F., 2003. A note on Rubin’s statistical matching using file
concatenation with adjusted weights and multiple imputations. Journal of Business and
Economic Statistics, Volume 21, pp. 65-73.
Paass, G., 1985. Statistical record linkage methodology, state of the art and future prospects.
Bulletin of the International Statistical Institute, Proceedings of the 45th Session, LI, Book 2.
Rässler, S., 2002. Statistical matching: a frequentist theory, practical applications, and
alternative Bayesian approaches. Springer-Verlag, New York.
Rässler, S., 2003. A non-iterative Bayesian approach to statistical matching. Statistica
Neerlandica, Volume 57(1), pp. 58-74.
Renssen, R.H., 1998. Use of statistical matching techniques in calibration estimation. Survey
Methodology, Volume 24(2), pp. 171-183.
Rubin, D.B., 1976. Inference and missing data. Biometrika, Volume 63, pp. 581-592.
Rubin, D.B., 1986. Statistical matching using file concatenation with adjusted weights and
multiple imputations. Journal of Business and Economic Statistics, Volume 4, pp. 87-94.
Singh, A.C., Mantel, H., Kinack, M. and Rowe, G., 1993. Statistical matching: use of
auxiliary information as an alternative to the conditional independence assumption. Survey
Methodology, 19, pp. 59-79.
Vantaggi, B., 2005. The role of coherence for the integration of different sources. Proceedings
of the 4th International Symposium on Imprecise Probabilities and Their Applications (F.G.
Cozman, R. Nau and T. Seidenfeld (eds.)), pp. 269-278.
2.4. Uncertainty in statistical matching
Mauro Scanu (ISTAT)
Statistical matching is essentially a problem characterized by uncertainty. The available
information (i.e. the two samples A and B) is not enough to estimate the joint distribution of
X, Y and Z, unless non testable assumptions as the CIA are believed to hold, or external
auxiliary information is at hand.
2.4.1. Definition of uncertainty in the statistical matching context
When a practitioner avoids the use of non testable assumptions (as the CIA) and when
auxiliary information is not available, the statistical matching problem consists of evaluating
WP1
33
uncertainty for the joint distribution of X, Y and Z, understanding whether a unique solution
can be taken into account, reducing uncertainty if possible. In order to define what uncertainty
means in statistical matching, it is necessary to introduce the notion of estimable and
inestimable parameters. In the case of samples A and B, the parameters of the marginal
distribution of X, and of the conditional distributions of Y|X and Z|X are estimable: it is
possible to use maximum likelihood methods or appropriate unbiased and efficient estimators
on the observed data. On the contrary, there is no data for estimating any parameter of the
distribution of YZ|X. The latter parameters are inestimable for the samples A and B.
Uncertainty is defined as the set of values that the inestimable parameters can assume given
the estimates of the estimable parameters. In order to make this concept clear, let us consider
the following simplifying example. Let X, Y and Z be normal distributions, and suppose that Z
is composed of two normal variables, Z1 and Z2 respectively. In this situation, almost all the
parameters of the joint distribution of X, Y, Z1, and Z2 are estimable: all the means, variances
and correlation coefficients for all the pairs of variables except for (Y, Z1) and (Y, Z2). The
following figure shows the uncertainty space for the two correlation coefficients given that the
estimable parameters are set to specific values (similar results are in D’Orazio et al, 2006,
p.111, and Rässler, 2004).
Figure 1 – Uncertainty space for YZ1 and YZ 2 when  XY  0.9 ,  XZ1  0.3 ,  XZ 2  0.5 ,
and  Z1Z2  0.4 .
In this example, the uncertain space is an ellipse. The pair of inestimable parameters may
assume any of the values in the ellipse: in other words, YZ1 may assume values between 0.14 and 0.68, while YZ 2 may assume values between 0.07 and 0.82. The two inestimable
parameters YZ1 and YZ 2 under the CIA correspond to the intersection of the two ellipse
axes.
WP1
34
2.4.2. Estimation of the uncertainty space
A seminal paper on uncertainty in statistical matching is Kadane (1978). That paper
investigates the case (X,Y,Z) are normally distributed. Exploration of the uncertainty space for
the inestimable parameters is performed according to this procedure:
a. Estimable parameters are estimated consistently (so that for large sample sizes estimates
coincide approximately with the parameters values).
b. Inestimable parameters may assume all those values that are compatible with the
estimated ones. In the normal case, this problem concerns only the parameters in the
variance matrix  (or equivalently the correlation matrix). This matrix should be
positive semi definite. Hence the inestimable parameter  YZ may assume an interval of
values, defined by the inequality: ||  0 (where || is the determinant of the matrix ).
This approach was followed by Moriarity and Scheuren (2001, 2003, 2004). They extend this
approach to the case X, Y, and Z are multivariate. They also show the performance of different
imputation procedures in this context.
A different approach in evaluating uncertainty in statistical matching was followed by Rubin
(1986) and Rässler (2002). In this case, the statistical matching problem is explicitly seen as a
missing data problem. Missing data is imputed with the use of multiple imputation. Both the
authors still restrict their analysis to the case of normally distributed data, so that uncertainty
reduces to the evaluation of the different values that the correlation coefficient of (Y,Z) can
assume. The method proposed by Rubin, named RIEPS by Rässler, fixes a value from a grid
of possible values for YZ|X between -1 and 1, and then reconstructs piecewise YZ. It can be
proved that this is an improper multiple imputation approach. Rässler (2002) defines a proper
multiple imputation approach, called NIBAS (non-iterative Bayesian approach to statistical
matching). Missing Z in A and missing Y in B are imputed m times drawing imputations from
the predictive distribution. The resulting complete data sets are used in order to estimate YZ m
times. These m estimates describe how uncertain the inestimable parameter is. Rässler extends
this approach also to the case X, Y and Z are multivariate, and suggests some measures for
evaluating how uncertain the inestimable parameters are. Rässler (2002) also includes S-Plus
codes for the application of the algorithms NIBAS and RIEPS.
D’Orazio et al (2006a and b) study parameter uncertainty in statistical matching by means of
maximum likelihood. They consider the case of normal variables, as in the previous
references, but they also tackle the case X, Y and Z are categorical. In these papers the
problem of the possible reduction of uncertainty is also considered. This aspect may be
studied introducing constraints on the parameter space. These constraints are normally used in
official statistics if structural zeros (used for edit rules) are present. Software codes in R are
described in D’Orazio et al. (2006a).
Vantaggi (2005) introduces a very promising procedure for the evaluation of the uncertain
space for categorical variables and the introduction of parameter constraints for reducing
uncertainty. The procedures discussed in this work are based on the De Finetti coherence
approach.
Bibliography
D’Orazio, M. Di Zio, M. and Scanu, M., 2006a. Statistical matching. Theory and practice.
John Wiley, Chichester.
WP1
35
D’Orazio, M., Di Zio, M. and Scanu, M., 2006b. Statistical matching for categorical data:
displaying uncertainty and using logical constraints. Journal of Official Statistics, Volume 22,
pp. 137-157.
Kadane, J.B., 1978. Some statistical problems in merging data files. Department of Treasury,
Compendium of Tax Research, pp. 159-179. US Government Printing Office, Washington
DC. Reprinted in 2001, Journal of Official Statistics, Volume 17, pp. 423-433.
Moriarity, C. and Scheuren, F., 2001. Statistical matching: a paradigm for assessing the
uncertainty in the procedure. Journal of Official Statistics, Volume 17, pp. 407-422.
Moriarity, C. and Scheuren, F, 2003. A note on Rubin’s statistical matching using file
concatenation with adjusted weights and multiple imputation. Journal of Business and
Economic Statistics, Volume 21, pp. 65-73.
Moriarity, C. and Scheuren, F, 2004. Regression based matching: recent developments.
Proceedings of the Section on Survey Research Methods, American Statistical Association.
Rässler, S., 2002. Statistical matching: a frequentist theory, practical applications, and
alternative Bayesian approaches. Springer-Verlag, New York.
Rässler, S., 2003. A non-iterative Bayesian approach to statistical matching. Statistica
Neerlandica, Volume 57(1), pp. 58-74.
Rässler, S., 2004. Data fusion: identification problems, validity, and multiple imputation.
Austrian Journal of Statistics, Volume 33 (1-2), pp. 153-171.
Rubin, D.B., 1986. Statistical matching using file concatenation with adjusted weights and
multiple imputations. Journal of Business and Economic Statistics, Volume 4, pp. 87-94.
Vantaggi, B., 2005. The role of coherence for the integration of different sources. Proceedings
of the 4th International Symposium on Imprecise Probabilities and Their Applications (F.G.
Cozman, R. Nau and T. Seidenfeld (eds.)), pp. 269-278.
2.5. Evaluation of the accuracy of statistical matching
Marcello D’Orazio (ISTAT)
Evaluation of the accuracy of results obtained applying a statistical matching technique is not
a simple task. Formally, it should consist of the estimation of the Mean Square Error
2
( MSE  Var  Bias  ) of the statistic used to estimate an unknown characteristic of the target
population.
Unfortunately, in statistical matching applications the accuracy of final results will depend on
the amount of errors (observation and no observation errors) in the source data sets A and B
and on the properties of the chosen statistical matching method. If it is assumed that the
amount of errors in source data sets is low or “under control” (few nonsampling errors and a
given sampling error), the accuracy of the final results will mainly depend on the capability of
the chosen statistical matching method to provide parameter estimates close to the true
unknown parameter values, in case of statistical matching macro methods, or a complete
WP1
36
synthetic data set that can be considered as a random sample drawn from the true unknown
population, in case of statistical matching micro methods.
At first, evaluation studies have been carried out in order to assess the results of statistical
matching micro methods based on distance hot deck. Barr and Turner (1981) propose the use
of simple measures of agreement between the records joined in the two source data sets.
Furthermore, it is suggested to verify whether the synthetic complete data set maintains the
structure of the original source data sets in terms of summary measures (totals, group totals,
averages,…). Barr and Turner (1990) suggest to compare the relation between X and Z in both
the synthetic and the donor data-sets, as well as some summary characteristics (mean,
variance) of the set of variables Z.
A similar suggestion is also recommended in Rodgers (1984), although emphasis is posed to
univariate and joint distributions of the variables Z in the synthetic data set and in the source
data file B, respectively. At a second level, the comparison should be extended to the
relationship between X and Z.
Cohen (1991) recommends carrying out a “rough” sensitivity analysis on the failure of the
conditional independence assumption (CIA) when using distance hot deck statistical matching
methods. In that paper, it is explained how to perform a sensitivity analysis when
unconstrained distance hot deck methods are applied.
Rässler (2002) proposes a framework for evaluating the “validity” of a statistical matching
procedure. The term validity is chosen to remark that evaluation should go beyond the
evaluation of the efficiency (in terms of MSE). The framework consists in four levels of
evaluation: (1) reproduction of the values of the unknown variables Z in the recipient file; (2)
how the joint distribution of the variables X, Y and Z is reflected in the synthetic data-set; (3)
preservation of the correlation structure and of the higher moments for the joint distribution X,
Y and Z and for the marginal distributions of the X-Y and X-Z; and (4) preservation of, at least,
the marginal distribution of Z and of X-Z in the fused data file.
Obviously, level (1) can be assessed only by means of simulation studies. Levels (2) and (3)
can be assessed by means of simulation studies or by referring to external information
(distributions estimated in other studies,…). Level (4) can be easily assessed in all the
statistical matching applications by using chi-square tests or similar measures such as the
index of dissimilarity based on the absolute differences among relative cell frequencies.
Rässler (2002) suggests using the correlation coefficient of cell frequencies to avoid the
problem of zero cell frequencies in the denominator of the chi-square statistics.
As far as simulations are concerned, Paass (1986) suggests a procedure that is also known as
the folded database procedure. The variables of one of the original data sets (usually the
largest one) are partitioned in three distinct sets, say G , G and G . The chosen database is
randomly partitioned in two sub-samples A and B, then G is deleted from A and G is
deleted in B, reproducing the typical statistical matching set-up. These two sub-samples are
matched according to the chosen procedure: the result is the folded database. The estimates
obtained from the folded database are compared with those derived from the original one so to
compute measures of accuracy. Unfortunately, this evaluation method relies heavily on the
assumption that the variables G , G and G behave approximately as the target variables
X, Y and Z.
Using this procedure, Paass assessed the accuracy of different statistical matching procedures
in terms of reconstruction of the Z, X-Z and Y-Z distributions. Evaluations are performed by
means of distribution free multivariate two-sample tests (Wald-Wolfowiz and Smirnov tests),
chi-square tests and analysis of variance.
WP1
37
Conti et al. (2006) investigate another measure of performance of micro statistical matching
approaches: the matching noise. The matching noise registers the difference between the
genuine and the imputed data generation processes. If this difference is large, the imputed
data set is not appropriate for estimation of relationship parameters between matching and
imputed variables. In same cases, this difference can be computed explicitly. For instance, it
can be seen how distance hot deck methods behave remarkably well when there is a linear
relationship between the matching and the imputed variables, although this performance
deteriorates for non linear regression functions.
Bibliography
Barr, R.S. and Turner, J.S., 1981. Microdata file merging through large-scale network
technology. Mathematical Programming Study, Volume 15, pp. 1-22.
Barr, R.S. and Turner, J.S., 1990. Quality issues and evidence in statistical file merging. Data
Quality Control: Theory and Pragmatics (G.E. Liepins and V.R.R. Uppuluri (eds.)), pp. 245313. Marcel Dekker, New York.
Conti P.L., Marella D., Scanu M, 2006. Nonparametric evaluation of matching noise.
Proceedings of the IASC conference “Compstat 2006”, Roma, 28 August – 1 September
2006, Physica-Verlag/Springer, pp. 453-460
Paass, G., 1986. Statistical match: Evaluation of existing procedures and improvements by
using additional information. Micro analytic simulation models to support social and financial
policy (Orcutt, G.H., Merz, J. and Quinke, H. (eds.)), Elsevier Science, Amsterdam.
Rässler, S., 2002. Statistical matching: a frequentist theory, practical applications, and
alternative Bayesian approaches. Springer-Verlag, New York.
Rodgers, W.L., 1984. An evaluation of statistical matching. Journal of Business and
Economic Statistics, Volume 2, pp. 91-102.
WP1
38
3. Literature review on micro integration processing
3.1. Micro integration processing
Eric Schulte Nordholt (CBS)
Micro integration processing consists of putting in place all the necessary actions aimed to
ensure better quality of the matched results as quality and timeliness of the matched files. It
includes defining checks, editing procedures to get better estimates, imputation procedures to
get better estimates, etc. It should be kept in mind that some sources are more reliable than
others. Some sources have a better coverage than others, and there may even be conflicting
information between sources. So, it is important to recognize the strong and weak points of all
the data sources used.
Since there are differences between sources, a micro integration process is needed to check
data and adjust incorrect data. It is believed that integrated data will provide far more reliable
results, because they are based on an optimal amount of information. Also the coverage of
(sub) populations will be better, because when data are missing in one source, another source
can be used. Another advantage of integration is that users of statistical information will get
one figure on each social phenomenon, instead of a confusing number of different figures
depending on which source has been used.
3.2. Combining data sources: micro linkage and micro integration
Eric Schulte Nordholt and Frank Linder (CBS)
3.2.1. Micro linkage
Most of the present administrative registers in the Netherlands are provided with a unique
linkage key. It is the so-called social security and fiscal number (SoFi-number), a personal
identifier for every (registered) Dutch inhabitant and those abroad who receive an income
from the Netherlands and have to pay tax over it to the Dutch fiscal authorities.
To prevent misuse of the SoFi-number, Statistics Netherlands recodes it for statistical
processing into a so-called Record Identification Number (RIN-person). Personal identifiers,
such as date of birth and address, are replaced by age at the reference date and RIN-address.
This is all done in accordance with regulations of the Dutch Data Protection Authority to
protect the privacy of the citizens.
Since the SoFi-number is in use by social security administrations and tax authorities, one
may expect it to be of excellent quality. A limited amount of SoFi-numbers may be registered
with incorrect values in the data files, in which case linkage with other files is doomed to fail.
However, in general, the percentage of matches is close to one hundred percent. Abuse of
SoFi-numbers, for example by illegal workers, may occur in some cases, which results in a
false match. Sometimes there are indications of a mismatch. An example of this is when the
jobs register and the central Population Register (PR) are linked and the worker turns out to
be an infant. Another example is, when the FiBase (fiscal administration) shows an unusually
high income for a worker, when it is in fact the sum of the incomes of all people using the
same SoFi-number.
WP1
39
All social statistics data files can be linked to the PR. In practice this means that these data
files are all indirectly linked to each other via the PR. Therefore the PR can be considered the
backbone in the set of social data sources. When linking the PR and the jobs register, or the
PR and a register of social benefits, it is a linkage between different statistical units (persons,
jobs, benefits). In that case multiple linkage relationships can exist because someone can have
more than one job or can benefit from several social benefits.
In household sample surveys, like the Labour Force Survey (LFS), records do not have a
SoFi-number. For those surveys an alternative linkage key is used, which is often built up by
a combination of the following personal identifiers:
- sex;
- date of birth;
- address5.
This sort of linkage key will usually be successful in distinguishing people. However, it is not
a hundred percent unique combination of identifiers. Linking may result in a mismatch in the
case of twins of the same sex. False matches may also occur when part of the date of birth or
the postal code and house number is unknown or wrong. Another drawback is that the linkage
key is not person but address related, which may cause linkage problems if someone has
recently moved. When linking the PR and the LFS with this alternative key, and tolerating a
variation between sources in a maximum of one of the variables sex, year of birth, month of
birth or day of birth, the result is that close to hundred percent of the LFS records will be
linked.
In its linkage strategy, Statistics Netherlands tries to maximize the number of matches and to
minimize the number of mismatches. So, in order to achieve a higher linkage rate, more
efforts are made to link the remaining unlinked records by means of different variants of the
linkage key. For example, leaving out the house number and tolerating variations in the
numeric characters of the postal code. To keep the probability of a mismatch as small as
possible, some 'safety' devices are built in the linkage process. This last linking attempt
accomplishes an extra one percent matches.
In the end about two to three percent of the LFS records could not be linked to the PR. All
together this is a good result, but selectivity in the micro linkage process is not to be ruled out.
If the unlinked records belong to a selective subpopulation, then estimates based on the linked
records may be biased, because they do not represent the total population. Analysis in the past
has indicated that the young people, in the 15-24 age bracket, show a lower linkage rate in
household sample surveys than other age groups. The reason for this is that they move more
frequently, therefore they are often registered at the wrong address. The linking rate for
persons living in the four large cities Amsterdam, Rotterdam, The Hague and Utrecht is lower
than for persons living elsewhere. Ethnic minorities also have a lower linkage probability,
among other things because their date of birth is often less well registered (Arts et al., 2000).
Nowadays, the PR is serving as a sampling frame for the LFS. Therefore, the matching rate is
almost hundred percent, and no more linkage selectivity problems occur.
3.2.2. Micro integration
Successfully linking the PR with all the other data sources mentioned, makes much more
coherent information on the various demographic and socio-economic aspects of each
5
In fact, the combination of a postal code (mostly related to the street) and house number is used as substitute
for the address. The postal code in the Netherlands consists of four figures, followed by two letters.
WP1
40
individual's life available. One has to keep in mind, however, that some sources are more
reliable than others. Some sources have a better coverage than others, and there may even be
conflicting information between sources. So, it is important to recognize the strong and weak
points of all the data sources used.
Since there are differences between sources, we need a micro integration process to check
data and adjust incorrect data. It is believed that integrated data will provide far more reliable
results, because they are based on an optimal amount of information. Also the coverage of
(sub) populations will be better because when data are missing in one source we can use
another source. Another advantage of integration is that users of statistical information will
get one figure on each social phenomenon, instead of a confusing number of different figures
depending on what source has been used.
During the micro integration of the data sources the following steps have to be taken (Van der
Laan, 2000):
a. harmonisation of statistical units;
b. harmonisation of reference periods;
c. completion of populations (coverage);
d. harmonisation of variables, in case of differences in definition;
e. harmonisation of classifications;
f. adjustment for measurement errors, when corresponding variables still do not have the same
value after harmonisation for differences in definitions;
g. imputations in the case of item nonresponse;
h. derivation of (new) variables; creation of variables out of different data sources;
i. checks for overall consistency.
All steps are controlled by a set of integration rules and fully automated.
Now an example follows of how micro integration works in the case in which data from the
jobs register are confronted with data from the register of benefits. Both jobs and benefits are
registered at volume base, which means that information on their state is stored at any
moment in the year instead of at one reference day. Analysts of the jobs register know that the
commencing date and the termination date of a job are not registered very accurately. It is
important though to know whether or not there is a job at the reference date, in other words
whether or not the person is an employee. With the help of the register of benefits it is
sometimes possible to define the job period more accurately.
Suppose that someone becomes unemployed at the end of November and gets unemployment
benefits from the beginning of December. The jobs register may indicate that this person has
lost the job at the end of the year, perhaps due to administrative delay or because of payments
after job termination. The registration of benefits is believed to be more accurate. When
confronting these facts the 'integrator' could decide to change the date of termination of the
job to the end of November, because it is unlikely that the person simultaneously had a job
and benefits in December. Such decisions are made with the utmost care. As soon as there are
convincing counter indications of other jobs register variables, indicating that the job was still
there in December, the termination date will in general not be adjusted.
3.2.3. The Social Statistical Database (SSD)
The micro linkage and micro integration process of all the available data sources result in the
end in the Social Statistical Database (SSD), a whole set of integrated microdata files in their
WP1
41
definitive stage. The SSD contains coherent and detailed demographic and socio-economic
statistical information on persons, households, jobs and (social) benefits. A major part of the
statistical information is available on volume base. An extensive discussion on the SSD can
be found in Arts and Hoogteijling (2002).
In trying to imagine what the SSD looks like, one should not think of a large-scale file with
millions of records and thousands of variables. It would be very inefficient to store the
integrated data as such. Furthermore, the issue of data protection prevents Statistics
Netherlands from keeping so much information together. Instead, all the integrated files in
their final stage are kept separately. There is just one combining element which is the linkage
key RIN-person, present in every integrated file. So, whenever users demand a selection of
variables out of the SSD set, only the files with the variables demanded will be supplied.
These can easily be extracted from the set and linked by means of the linkage key.
Bibliography
Arts, C.H. and E.M.J. Hoogteijling. 2002. The Social Statistical Database of 1998 and 1999.
Monthly Bulletin of Socio-economic Statistics. Vol. 2002/12 (December 2002), pp. 13-21,
2002. [in Dutch]
Laan, P. van der, 2000. Integrating Administrative Registers and Household Surveys.
Netherlands Official Statistics, Vol. 15 (Summer 2000): Special Issue, Integrating
Administrative Registers and Household Surveys, eds. P.G. Al and B.F.M. Bakker, pp. 7-15.
Schulte Nordholt, E. and F.S. Linder, 2007. Record matching for Census purposes. In:
Statistical Journal of the IAOS, 24, 2007, pp. 163-171.
3.3. Key reference on micro integration
Miguel Guigo (INE), Paul Knottnerus (CBS) and Eric Schulte Nordholt (CBS)
National and international programs have been run to check the quality of the output.
Regardless of the actual purpose of the micro integration procedure, either through modelbased, donor-based, simulation-based or repeated-weighting coefficients, an assessment of the
source data as well as measures on the reliability of the estimators must be taken into account.
Several simulation studies show that the method of repeated weighting leads to estimates with
lower variances than usual estimation methods, due to a better use of auxiliary information.
An open issue remains on how small areas can be estimated in case of no available register
information. The estimation of small areas, that is to say, to get a valid and efficient
estimation of population parameters for sub national domains, both geographically-based
domains or categories in classifications at a very disaggregate level, is a task that can be
performed by use of administrative registers (as long as those variables from the registers are
correlated at the micro level to those to estimate). For efficient small area estimation, records
from an administrative source can play the role of auxiliary information.
When no such external information is available, estimation of parameters in those small
domains could be performed by means of some modern estimation techniques, but then the
open issue is how to keep consistency of the set of tables.
WP1
42
Schulte Nordholt, E., 2005. The Dutch virtual Census 2001: A new approach by combining
different sources. Statistical Journal of the United Nations Economic Commission for Europe,
Volume 22, Number 1, 2005, pp. 25-37.
Abstract
Data from many different sources were combined to produce the Dutch Census tables of
2001. Since the last Census based on a complete enumeration was held in 1971, the
willingness of the population to participate has fallen sharply. Statistics Netherlands found an
alternative in the Virtual Census, using available registers and surveys. The table results are
not only comparable with the earlier Dutch Censuses but also with those of the other countries
in the 2001 Census Round.
For the 2001 Census, more detailed information is required than was the case for earlier
Census Rounds. The acquired experience in dealing with data of various administrative
registers for statistical use enabled Statistics Netherlands to develop a Social Statistical
Database (SSD), which contains coherent and detailed demographic and socio-economic
statistical information on persons and households. The Population Register forms the
backbone of the SSD. Sample surveys are still needed for information that is not available
from registers.
To achieve overall numerical consistency across the Census tables set of 2001, the
methodologists at Statistics Netherlands developed a new estimation method that ensures
numerically consistent table sets if the data are obtained from different data sources. The
method is called repeated weighting, and is based on the repeated application of the regression
method to eliminate numerical inconsistencies among table estimates from different sources.
Key words
Census, consistent table estimates, repeated weighting
3.3.1. Definition of the problem
In this publication it is explained how the method of repeated weighting has been used to
produce consistent table estimates using available registers and surveys only.
Although statistical offices belonging to the ESS are not obliged to conduct a Census every
ten years, nor supplying census data, Eurostat has provided general advices which, in the form
of a gentlemen's agreement, compose the frame to compile co-ordinated and harmonised data
for European countries. Moreover, the general trend is to reach not only voluntary but binding
agreements in order to provide data on population enumeration. The European Parliament will
discuss the new regulation on European Population and Housing Censuses in December 2007.
Then, to produce Census tables by use of already available data on population from
administrative sources, is a valuable option to be considered for reducing costs related to data
collection as well as for facing non-response and low participation problems. In these
particular issues, Statistics Netherlands (Schulte Nordholt, 2005 and Corbey, 1994) has
provided documentation on its own experiences, since the last traditional Census in the
Netherlands, in 1971, met with much privacy objections against the collection of integral
information about the population living in the Netherlands.
WP1
43
Regardless of issues related to cost reduction opportunities, the choice of a virtual Census that is to say, to produce Census data based on administrative registers- can result from
considering at least three key aspects of a traditional Census:
a) unit non-response: a certain part of the population will not participate in a traditional
Census survey;
b) item non-response: even the part of the population that does participate will not answer
some questions;
c) to deal with these problems, traditional correction methods fall short of the need to be able
to publish reliable results, that is to say, some of the cells in the set of tables cannot be
disseminated.
When using techniques as massive imputation, there are not enough degrees of freedom to get
a sufficiently rich imputation model (Kroese and Renssen, 2000). Therefore, an alternative to
usual weighting and imputation procedures must be developed to be able to produce a
consistent set of tables using available registers and surveys only.
3.3.2. Phases characterising the problem
a.
Study the required output and the available data sources: in order to produce a set of
tables concerning housing, commuting, demography, occupation, level of education and
economic activity, a complete statistical database must be built. In the case of Statistics
Netherlands, this role is played by the Social Statistical Database (SSD), which microdata are
obtained through the integration of the central Population Register and several surveys as the
Labour Force Survey, the Employment and Earnings Survey, and the Survey on Housing
Conditions. Schulte Nordholt, Hartgers and Gircour (2004) give details on the overall
proceeding for the Dutch Virtual Census of 2001.
b.
Define and apply the estimation strategy: once an appropriate statistical database is a
database made up of several sources, the so-obtained set must be treated in order to
reconstruct missing values for a record or an item (total or partial non-response). Little and
Rubin (1987) introduce some overall methods of imputation. When the lack of information
affects the whole unit or record, macro integration procedures are adequate to meet data
needs.
Denk and Hackl (2003, 2004) propose some macro integration methods in the context of a
comprehensive database of enterprise data for taxation microsimulations, in order to get
micro-founded indicators. So, the model-based approach estimates the probability for the
occurrence of observed values of a variable, and missing data are then randomly imputed; the
estimated probability distribution can be obtained by means of, e.g. auto-regressive models
built from previously available data in former surveys. Donor-based approaches (e.g. hotdeck, nearest-neighbour), look for a record that is similar to the incomplete record (the socalled donor); similitude between both is determined via matching variables. Finally, the
simulation-based approach estimates more than a single value for each missing item in a
record.
Houbiers et al. (2003) and Houbiers (2004) provide an alternative procedure called repeated
weighting (RW), which aim it is to cut out numerical inconsistencies among table estimates
from different sources. It is based on the repeated application of the regression estimator and
generates a new set of weights for each table that is estimated. Let y be a variable of which the
population parameter -either total or average- ought to be obtained for a table through a set of
explanatory variables x from a register. The linear regression estimator of the population
average for y is obtained through
1
YˆREG  y s  bs X p  x s  ;
bs   X ' s X s  X ' s y s ,
WP1
44
where X p , xs , Y p and y s are, respectively, the population and sample averages of x and y, and
bs the estimated linear regression coefficients. Instead of these traditional regression
estimators, the repeated weighting procedure uses a set of coefficients in the form
1
bw  Z ' s Ws Z s  Z ' s Ws y s ,
where Zs is the matrix of sample observations on the variables in the margins of the table with
variable y. The averages of the marginal variables z have been estimated already in an earlier
table or are known from a register. Denoting these estimates or register counts by Zˆ RW , the
repeated weighting estimator of Y is defined by
YˆRW  YˆREG  bw  Zˆ RW  Zˆ REG  .


It turns out that the weights of the records in the microdata are adapted in such a way that the
new table estimate is consistent with all earlier table estimates.
c.
Analyse the results: for a detailed case study, Schulte Nordholt (2005) gives some key
results of the 2001 Census in the Netherlands. Schulte Nordholt (2005) provides tables on
population by sex, type of household and age group; population by economic activity and sex;
employees by working hours and sex; working population by occupation and sex; and
population by level of education and age group, together with a comparison with former
censuses. A deeper analysis of these results specially focusing on historical comparison,
regional distributions -paying special attention to major cities- and corresponding results in
other European countries is also available at Schulte Nordholt, Hartgers and Gircour (2004).
3.3.3. References for these different phases
a.
b.
c.
d.
e.
National and Eurostat documents
National documents
Houbiers (2004), Houbiers et al. (2004) and Kroese and Renssen (2000)
This key reference
Schulte Nordholt et al. (2004)
3.3.4. Quality assessment
National and international programs have been running to check the quality of the output.
Regardless of the actual purpose of the micro integration procedure, either through modelbased, donor-based, simulation-based or repeated-weighting coefficients, an assessment of the
source data as well as measures on the reliability of the estimators must be taken into account.
Knottnerus and Van Duin (2006) give the variance formulae for the repeated weighting (RW)
estimator, and test RW estimators under various conditions. Several simulation studies, e.g.
Boonstra (2004) and Van Duin and Snijders (2003) show that the method of repeated
weighting leads to estimates with lower variances than usual estimation methods, due to a
better use of auxiliary information.
3.3.5. Open issues
It remains an open issue how small areas can be estimated in case of no available register
information. The estimation of small areas, that is to say, to get a valid and efficient
estimation of population parameters for sub national domains, both geographically-based
domains or categories in classifications at a very disaggregate level, is a task that can be
properly done by making use of administrative registers. For efficient small area estimation,
WP1
45
records from an administrative source can play the role of auxiliary information, see e.g.
EURAREA Consortium (2004). Saralegui et al. (2005) develops a case study applied to
quarterly data taken from the Spanish Labour Force Survey (EPA) using external
administrative sources in order to area-level covariates; an overall evaluation of the output is
also provided. When no such external information is available, estimation of parameters in
those small domains could be performed by means of some modern estimation techniques
(see e.g. Rao, 2003) could be applied, but then the open issue is how to keep consistency of
the set of tables.
Bibliography
Corbey, P., 1994. Exit the population Census. Netherlands Official Statistics, Volume 9,
summer 1994, pp. 41-44.
Duin, C. van and V. Snijders, 2003. Simulation studies of repeated weighting. Discussion
paper 03008, Statistics Netherlands, Voorburg / Heerlen.
http://www.cbs.nl/en/publications/articles/general/discussion-papers/discussion-paper03008.pdf.
Houbiers, M., 2004. Towards a social statistical database and unified estimates at Statistics
Netherlands. Journal of Official Statistics, Volume 20, No. 1, pp. 55-75.
Houbiers, M., P. Knottnerus, A.H. Kroese, R.H. Renssen and V. Snijders, 2003. Estimating
consistent table sets: position paper on repeated weighting. Discussion paper 03005, Statistics
Netherlands, Voorburg / Heerlen.
http://www.cbs.nl/en/publications/articles/general/discussion-papers/discussion-paper03005.pdf.
Kroese, A.H. and R. H. Renssen, 2000. New applications of old weighting techniques,
constructing a consistent set of estimates based on data from different sources. ICES II,
Proceedings of the second international conference on establishment surveys, survey methods
for businesses, farms, and institutions, invited papers, June 17-21, 2000, Buffalo, New York,
American Statistical Association, Alexandria, Virginia, United States, pp. 831-840.
OECD, 2003. Education at a glance. OECD Publications, Paris, France.
Rao, J.N.K., 2003. Small area estimation. Wiley, New York, United States.
Schulte Nordholt, E., M. Hartgers and R. Gircour (Eds.), 2004. The Dutch Virtual Census of
2001, Analysis and Methodology, Statistics Netherlands, Voorburg / Heerlen, July, 2004.
http://www.cbs.nl/en-GB/menu/themas/dossiers/volkstellingen/publicaties/2001-b57-epub.htm
Statistics Netherlands, 2003. Urban Audit II, the implementation in the Netherlands. Report,
BPA no. 2192-03-SAV/II, Statistics Netherlands, Voorburg.
http://www.cbs.nl/en/publications/articles/regional/urban-audit-II-Netherlands.pdf.
WP1
46
3.4. Other references
Boonstra, H.J., 2004. A simulation study of repeated weighting estimation. Discussion paper
04003, Statistics Netherlands, Voorburg / Heerlen.
Denk, M. and P. Hackl, 2003. Data integration and record matching an Austrian contribution
to research in official statistics. Austrian Journal of Statistics, Volume 32, pp. 305-321.
Denk, M. and P. Hackl, 2004. Data Integration Techniques and Evaluation. Austrian Journal
of Statistics, Volume 33, Number 1&2, pp. 135-152, Vienna.
EURAREA Consortium, 2004. Main report. Enhancing Small Area Estimation Techniques to
meet European Needs, Deliverable D.7.1.4.
http://www.statistics.gov.uk/eurarea/downloads/EURAREA_PRV_1.pdf
Heerschap, N. and L. Willenborg, 2006. Towards and integrated statistical system at Statistics
Netherlands. International Statistical Review, Volume 74, Number 3, December 2006, pp.
357-3378.
INE, 2002. Monte Carlo simulation to evaluate small area estimators of single person
households (Unpublished manuscript).
Kamakura, W.A. and M. Wedel, 1997. Statistical data fusion for Cross-Tabulation. Journal of
Marketing Research, Volume 34, pp. 485-498.
Knottnerus, P. and C. van Duin, 2006. Variances in repeated weighting with an application to
the Dutch Labour Force Survey. Journal of Official Statistics, Volume 22, No. 3, pp. 565-584.
Little, R.J.A. and D.B. Rubin, 1987. Statistical Analysis with Missing Data. John Wiley &
Sons, New York.
Saralegui, J., M. Herrador, D. Morales and A. Agustín Pérez, 2005. Small Area Estimation in
the Spanish Labour Force Survey. Proceedings of Challenges in Statistics Production for
Domains and Small Areas. http//www.stat.jyu.fi/sae2005/abstracts/saraleg.pdf.
WP1
47
4. Practical experiences
4.1. Record linkage of administrative and survey data for the EU-SILC
survey: the Italian experience
Paolo Consolini (ISTAT)
The EU-SILC (European Union Statistics on Income and Living Conditions) Italian team has
developed a pioneer strategy in the measurement of self-employment income since 2004. This
strategy consists in a multi-source data collection, based on a paper and pencil face-to-face
interview and on linkage of administrative with survey data. The aim of combining
administrative and survey data is to improve data quality on income components (target
variables) and relative earners by means of imputation of item non-responses and reduction of
measurement errors. Integration of administrative and survey data at micro level is performed
by linking individuals through common key variables.
Target variables - In the first edition (survey 2004), this process has involved only two
income components which are self-employment income and pensions. Nevertheless for the
second edition (survey 2005) it has included a third one: the employment incomes.
Target population - The target population is represented by the Italian reference population
of EU-SILC: all private households and their current members residing in Italy at time of data
collection. Persons living in collective households and in institutions are excluded from target
population. The analysis units are adult members (15+ aged) who live in private households 6.
The EU-SILC 2004 survey includes approximately 52.5 thousand interviewees aged 15 years
or over. Among these, about 49.2 thousand units have a tax statement or a declaration in the
administrative data-sources.
Problems in the available data sets - With regard to the measurement of self-employment
incomes in household surveys there are two clear-cut statements, taken from the “Canberra
Handbook”, that depict the state of the art: “Income data for the self-employed are also
generally regarded as unreliable as a guide to living standards” (Canberra Group, 2001,
p.54); “Household surveys are notoriously bad at measuring income from capital and selfemployment income” (Canberra Group, 2001, p.62).
Figure 1 below shows, in a simplified sketch, the problem of collecting self-employment
incomes when either survey or administrative data are available, and the objective is to obtain
disposable income: the shaded areas correspond to the income available to an individual for
his/her personal use.
6
The Eurostat database of the Italian EU-SILC personal incomes data refers to all adults aged 16+ living in each
sample households.
WP1
48
Figure 1
Personal gross, taxable, reported and disposable income
TAX AVOID. &
DEDUCTIONS
TAXES
+ CONTRIB.
UNDERREPORTING
GROSS
INCOME
TAXABLE
INCOME
NET
TAXABLE
INCOME
(administrative data)
DISPOSABLE
INCOME
INCOME
REPORTED
(survey data)
The different sources of microdata on earnings from self-employment may not contain the
variable ‘disposable income’. Survey data may be affected by under-reporting. On the other
hand, administrative data gathering individual tax returns do not take account of illegal tax
evasion and may not display all the authorized deductions allowed in the calculation of
taxable income (tax avoidance).
Privacy issues – The Personal Tax Annual Register, including all the Italian tax codes, cannot
be used directly from Istat. Therefore, record linkage has to be performed by the tax agency
on Istat’s behalf. The exact linkage performed by Tax Agency produces 7.8% unmatched
records that are partially retrieved by means of auxiliary information (1.5%). Transfer of
know-how in performing record linkage from Istat to Tax Agency could improve the
effectiveness of the matching procedure in the future.
Integration methodology - In the EU-SILC project, the standard procedure to measure net
self-employment income requires to collect “the amount of money drawn out of selfemployment business” only when the profit/loss from accounting books or the taxable selfemployment income (net of corresponding taxes) are not available. For the Italian EU-SILC
project, both tax and survey microdata are available, through an exact matching of
administrative and survey records. However, both sources may be affected by underestimation of self-employment incomes. Moreover, some individuals report self-employment
incomes in only one data source. This is the case of some individuals whose professional
status at the time of the interview is different from that of the income reference period and of
many percipients of small and/or secondary self-employment incomes7. The integration
procedure consists of the following 4 phases.
1. Key (individual identifier) Each sample person has been identified with her/his tax
code (i.e. the personal identification number assigned to each individual by the Italian
7
The survey data include as self-employment incomes those small compensations for minor and informal
services that are frequently unnoticed for tax purposes. For example, the earnings of baby-sitters. On the other
hand, some minor self-employment incomes shown in the tax returns may be disregarded during the interview to
ease the response burden.
WP1
49
tax authorities). The tax codes have been primarily retrieved out of the Population
Registers by the Statistical Offices of the Municipalities who participated in the
survey. As the information released by the local statistical offices may be missing or
inaccurate, Istat has also requested them to collect auxiliary data on the individuals to
be interviewed. Indeed, the personal tax code is univocally determined by the values
of selected individual characteristics (name, surname, sex, date of birth and place of
birth). Thus, the collected tax codes were compared with those resulting from the
computation based on the available individual characteristics and, when necessary,
corrected.
2. Linkage of survey and tax records In a second step, tax codes of the previous phase
were matched to those in the Personal Tax Annual Register, consisting of all the
Italian tax codes. The procedure searched for the tax codes of the persons in the EUSILC sample among the ones in the tax files. More precisely, linkage focuses mainly
on adults (15 years and over) that actually participated in the survey. The rate of
successfully matched records was 93.7%. In other words, the tax source covers 93.7%
of the adults interviewed for the 2004 Italian EU SILC survey. The unmatched units
(6.3%) are either individuals with no tax code available in the Population Registers
(4.4%) or persons not included in the initial survey frame but later registered as
additional household’s members by the interviewers (1.8%).
3. Loading tax data The third step consisted in reading and checking information on selfemployment income included in the tax records. At this stage, two relevant sources of
microdata have been uploaded: (i) “UNICO persone fisiche” and: (ii) “730” tax
returns8. Once implemented, the reading procedures lead to a suitable database of tax
records that has been used to build the net (taxable) self-employment income. The
Italian tax system distinguishes between two broadly defined categories of selfemployment income: ‘redditi da libera professione’ (earnings from free profession)
and ‘redditi d’impresa’ (business incomes). The latter may also include income
attributed to sham co-helpers for tax splitting purposes. Income splitting within a small
family business occurs when there is a transfer of taxable income from a person in a
higher income bracket to a person in a lower income bracket. Because of the
progressive tax schedule, it is thus possible for some self-employed people to lower
total household income liability. We define as ‘sham co-helpers’ persons who appear
as percipients of self-employment income in tax returns but, at the same time,
convincingly report themselves in the survey as inactive, non-working persons during
the income reference period (students, housewives etc.). Income received by sham cohelpers has been assigned to the active self-employed household members.
4. Imputation through integration The assumption underlying the fourth step has been
that true disposable self-employment income may be under-reported by both sources.
In order to minimise under-estimation, self-employment income has been set to the
maximum value between the net income resulting from the tax source and the net
income reported in the survey. In most cases, comparisons of self-employment income
reported in the two sources has been made at the individual level. However, for small
family businesses, comparisons have been made at the household level, that is by
With few exceptions, the ‘UNICO persone fisiche’ form must be filled by the generality of percipients of selfemployment incomes. In particular, by any person who is the sole or joint owner of an unincorporated business
for which he/she works and by those taxpayers who perceive incomes from unincorporated businesses. The
‘730’ form must be filled by the percipients of secondary and/or occasional self-employment incomes (for
example, an employee who adds to the wage a self-employment income from a secondary job as a free-lancer).
8
WP1
50
comparing the sums of the self-employment income received by all household
members in the two sources.
Conclusions - The use of administrative data has changed the tails of the distribution of selfemployment incomes (Figure 2). Indeed, with respect to survey data, the final (i.e. integrated)
dataset contains a lower percentage of self-employment incomes in the range 2,000 - 12,000
Euros per year and a higher proportion of percipients with incomes greater than 20,000 Euros.
Figure 2 - Distributions of self-employment incomes drawn by: survey, administrative and final
dataset (all percipients)
16
14
12
10
%
Survey
Administrative
Final
8
6
4
2
0
-10
0
10
20
30
40
50
60
70
80
90
100
thousands of euro
As a result, combining administrative and survey data brings about a rise of 15.6 % in the
number of percipients and an increase of 11.9 % in the average of self-employment income
compared to the exclusive use of survey data. When both sources report information on selfemployment incomes, there is some evidence of a higher under-estimation rate on the tax data
compared to the survey data.
Bibliography
Canberra Group, 2001. Final Report and Recommendations, Ottawa, Canada.
4.2. Record linkage applied for the production of business demography
data
Caterina Viviano (ISTAT)
The Business Demography (BD) project: background - The harmonised data collection on
BD is a European project launched in 2001 with the aims to result in comparable data on
business demography for the European Union (EU). In particular it aims to satisfy the
growing requirements for structural indicators regarding births, deaths and survival. Until now
data have been produced for reference years from 1998 to 2004.
The number of enterprise births is a key variable in the analysis of business demography as
other variables such as the survival and growth of newly born enterprises are related to this
WP1
51
concept. The production of statistics on newly born enterprises is based on a definition (The
Commission Regulation No 2700/98):
“A count of the number of births of enterprises registered to the population concerned in the
business register corrected for errors. A birth amounts to the creation of a combination of
production factors with the restriction that no other enterprises are involved in the event.
Births do not include entries into the population due to: mergers, break-ups, split-off or
restructuring of a set of enterprises. It does not include entries into a sub-population resulting
only from a change of activity.”
The aim is to produce data on the creation of new enterprises that have started from scratch
and that have actually started activity. An enterprise creation can be considered an enterprise
birth if new production factors, new jobs in particular, are created.
The identification process - In practice to obtain enterprise births (as well as for deaths) it is
necessary to carry out an identification process with the aim of obtaining births and deaths.
This process is described for births in the following steps:
Step1 - The populations of active enterprises (N) at reference time t, t-1 and t-2 are obtained
from the Business Register ASIA (in Italian "Archivio Statistico delle Imprese Attive", the
frozen annual file of the Statistical Archive of Active Enterpises, obtained through the
integrated use of different administrative and statistical sources).
Step2 –The new enterprises in year t (Et) are a subset of the population of active enterprises
Nt, identified by comparing the population of active enterprises in year t with the population
of active enterprises in year t-1 and with t-2 to exclude reactivations. New enterprises are
identified as enterprises that are only present in year t. The three populations are matched by
exact codes i.e. the business register id code.
Step3. -The identification of births is carried out by eliminating creations due to other events
than births from the population of new enterprises, that is break-ups, split-offs, mergers and
one-to-one take-overs. The method for identifying other creations compares the new
enterprises with the population of active enterprises for the current year (Nt) and the previous
year, using a linkage process. The linkage process includes matches on name, economic
activity and location (variables already available into the BR). This linkage technique allows
the application of the continuity rules9 i.e. when two enterprises are deemed to be the same.
The continuity rules consider three continuity factors, continuity of control, economic activity
and location. These rules generally follow the approach that if two out of three of the
continuity factors change, there is discontinuity of the enterprise.
In addition to the linkage process described above, it is also necessary to check for links
between units, which may indicate that a new enterprise is not a real birth, and to carry out
additional matching or checking using any other available information such as the database of
events of structural changes (mergers, take over), administrative sources supplying
information on links like the partnership archive by the Chamber of Commerce, and so on.
It is necessary to be aware that some activities naturally tend to be concentrated in certain
locations, such as retailing (shopping malls), construction (large sites), and the “liberal
professions” (shared premises), where there is an increased risk of false matches.
At last, to finalise the identification of the enterprise births, the largest enterprises are
investigated manually to detect whether the event actually can be considered a real birth.
9
Following Eurostat (2003), the continuity of an enterprise depends on the continuity of its production factors.
An enterprise is considered to be continued if it modifies without any significant change in its identity that is in
terms of its production factors. The production factors include the set of means (employment, machines, raw
material, capital management, buildings) that the enterprise uses in its production process and leading to the
output of goods and services.
WP1
52
The RL procedure - The data matching process to identify pairs of records to validate
continuity rules is carried out applying a (non probabilistic) record linkage technique (RL).
The method is applied in such a way that the decision to link 2 or more records is the result of
a complex decision-making procedure. In this process the first decision is based on a
deterministic choice of link and non link record pairs, according to the outcomes produced by
the compared variables. Afterwards results are assessed, integrated and validated with other
deterministic rules aiming to reduce the problem of multiple links and to evaluate the actual
connection of the link with real conditions.
The RL process can be summarized as follows:
Phase 1: treatment of variables and rules for the matching
The BR record units are compared through agreement/disagreement rules delineated for three
variables: enterprise name (N), location (L) and economic activity (S). For a better
specification information on legal status and fiscal code have been used.
 Step1. Before matching, each variable is standardized and parsing is made. Parsing means
that a free format field - like the enterprise name – is divided in a set of components that
can be identified and that can be automatically matched. Each component is parsed using
dictionaries (for the variable Enterprise Name). For example words representing
surnames, names, activities, legal status, alias names, etc. are identified. Rules used to
establish agreement/disagreement between characters are described in the following
schemes.
Enterprise name
According to the Italian enterprise structure, an enterprise name takes different formats
according to its legal form. Aggregating the legal form (J) into 3 classes: (I= Sole
proprietorship; Sp= Partnership; Sc= Limited liability company) the adopted rules are:

RULE – Enterprise Name

A=agreement

PA=partial agreement

D=disagreement
(J=I, Sp)  Surname and Name are the same
(J=Sp,Sc)  % of matched words 100%
(J=I, Sp)  Surname is the same
(J=Sp,Sc)  % of matched words >%, min < 100%
(J=I, Sp)  Surname is different
(J=Sp,Sc)  % of matched words <=% min
Address
Standardization of address consists of identifying 3 components: toponymic (T), street name
(sN) and street number (sN). Matching rules are delineated according to different
combinations of components outcomes

RULE – Enterprise Address



A=agreement
and equal
PA=partial agreement
D=disagreement
T =equal, missing; sN= % of matched strings high; sN= present
T =equal, missing; sN= % of matched strings high; sN= missing
T =differ; sN= % of matched strings high/low; sN= differ
WP1
53
Economic activity
Nace rev.1 codes at 4 digits are compared to produce outcomes.

RULE – Economic activity



A=agreement
PA=partial agreement
D=disagreement
Nace code equal
Nace code consistent10
Nace code differs
The composite comparison vector for each record pair allows identifying subpopulations of
matched records.
 Step2: Blocking and reductions of the comparison space
A blocking strategy is important to reduce the large amount of possible record pairs to be
compared. A block means that only records belonging at the same block are compared.
The chosen blocks are: 1) municipality, as the main block; 2) then economic activity code (3
digits)–postal code (CAP) (4 digits), as alternative blocks. In order to increase quality of data,
postal codes have been standardized using for the first time a file downloaded from the Italian
Post Office site. Results of this process allowed improving the overall quality of blocking
strategy.
Phase 2: Implementation of the RL procedure (only deterministic):
In this process, the probabilistic step (EM parameters estimation method; calculation of
weights to associate to each record pairs that allow choosing between link and not link) has
not been applied.
The main reason is due to estimation problems of a specific parameter that is based on a
random sample of data chosen from the population. Given that these estimates become biased
as population size increases, there is a problem concerning the sample size and its
representativeness; stratification by small geographical areas would be necessary but this
strategy would need a longer data processing. To face this problem alternative solutions are
under consideration.
Main steps for the choice of matched records:
 inclusion of pairs having two or three agreements according to the continuity rules;
 investigation and selection of some particular comparison vectors having partial
agreements for some components (for example in the comparison between
cooperatives/consortiums with other companies the link is accepted in presence of Name
(Partial Agreement) and Address and Nace code (Agreement));
 clustering of pairs of records and reduction of multiple matches choosing the better
comparison vector;
 exclusion of clusters with more than 3 couples of linked records (very big multiple
matches);
 exclusion of comparisons between sole proprietorship and limited liability company (in
this case a change in the control variable is a strong evidence of discontinuity);
 The sub-population determined by the agreement on both Locations and Sector of activity
(L+S) is analysed with more attention, due to its large amount in comparison to the others.
A list of economic activity “at risk” is built up so that combination carrying particular
10
Economic activity codes note equal at four digits can be consistent whether the activities carried out use the
same production process (for example production and sell of bread are consistent activities).
WP1
54
activities are first excluded and then they are recovered after a check for telephone
numbers and date of registration/deregistration. For example in the year 2004, the weight
of L+S record pairs, in the match between stock and new enterprises, was 55%, after the
check only 9% of them were re-integrated. The list of such activities in terms of Nace
codes is the following: 45, 5262, 5263, 65, 66, 67, 70, 741, 742, 744, 7484, 6025, 633,
634, 851.
Conclusions - The RL technique is of fundamental importance in detecting links for
continuity between creations of enterprises and active units. Its application allows identifying
and cleaning data. Results are significant as it accounts for a high percentage of detected links
(Table 1).
Table 1 – Identification of enterprise births - year 2004
1. Active enterprises
2. Creations of enterprises
because of:
2.1 reactivations
2.2 errors
2.3 state of activity
3. New enterprises (2.3 less Nace 7415)
3.1 creations due to events of take-oker, merger,etc
3.2 creations due to continuity
3.3 creations due to events of changes of juridical status
3.4 creations due links by administrative sources
3.5 exclusion due to corrections
3.T Total exclusions (3.1-3.5)
Enterpirse births
rate of new enterprises excluded for links (3.T/3.%)
rate of new enterprises linked for continuity (3.2/3.%)
rate of exclusions due to continuity (3.2/3.T%)
BD 2004
4,366,679
447,419
36,873
1,624
408,922
408,607
10,928
44,695
1,260
12,159
266
69,308
339,299
17.0
10.9
64.5
Bibliography
Eurostat, 2003. Business registers - recommendations manual, Office for Official Publications
of the European Communities, Luxembourg.
4.3. Combination of administrative, register and survey data for
Structural Business Statistics (SBS) - the Austrian concept
Gerlinde Dinges (Statistics Austria)
The European SBS-Regulation11 is the basis for compilation of Structural Business Statistics
in Austria from the reference year 1997 onwards. From reference year 2002 onwards the new
national regulation12 based on the Federal Statistics Act 2000 foresees a completely new data
collecting and estimation concept for SBS statistics in Austria. To satisfy national regulation
11
Council Regulation (EC, EURATOM) No 58/97 of 20 December 1996 concerning structural business
statistics.
12
National Regulation for SBS (Leistungs- und Strukturstatistik-Verordnung, BGBl. II 428/2003).
WP1
55
and requirements of EU the new concept is implemented by conducting a yearly cut-off
survey in combination with use of administrative sources and statistical calculation methods
instead of former applied stratified random sampling (grossing up of 42.000 sampled units).
For NACE sections C-F information from short-term statistics (STS) can be used additionally.
The selection frame for SBS13 is the Business Register (BR) of Statistics Austria which
provides links between the various administrative sources and the statistical units
(enterprises). Administrative data from Social Security Authorities (employment information)
and Tax Authorities (turnover) are used in combination with the survey and also as basic data
for the non surveyed units. For NACE divisions 65, 66 the Austrian Central Bank and
Austrian Financial Market Supervisory Authority are single sources.
Main Target – the main target of the new concept with its combination of administrative
data, surveyed data and model based estimation is the reduction of respondent’s burden on the
one hand and the retention of high data quality on the other hand.
Strategy – the principle of data concentration on a small number of large and medium sized
enterprises in the SBS should be used (see fig. 3). Hence based on a cut-off-survey the ‘most
important information’ should be obtained from the units directly and the ‘less important’
information should be obtained by model based estimation. Thus, there is no general loss of
information and detailed information can be provided for all units on record level.
Figure 3: The principle of data concentration for selected variables (SBS 2003)
Cut-Off-Survey - In accordance with the Austrian SBS-Regulation in the new concept only
about 32.000 enterprises (12%) above a NACE-specific threshold are in the yearly SBS
Survey. For production (NACE C-F) the threshold is defined in terms of persons employed (in
general 20 depending on the coverage), for services a turnover threshold (1.5 Mio EUR for
trade and NACE 633 and 634, 750 Tsd. EUR for other services) is used.
In the service sector for the variable number of employees (as well as the breakdown by sex
and status) administrative data from social security authority is combined with surveyed data
from the SBS Survey. The variable number of employees in total is collected in the survey for
13
NACE sections C-K.
WP1
56
checking the link to administrative sources in the course of plausibility checks. For most of
the enterprises all details concerning employment variables are taken from Social Security
data directly. In the case of major deviations an update of the link in the business register
mostly solves the problem. In manufacturing industries and constructions some main
variables (like employment, personnel expenditures,…) can be taken from STS.
Estimation Model - For smaller enterprises below legal thresholds (about 240.000 units) all
variables are estimated on record level by using the sources Business Register, Social Security
Register and Tax Register to obtain basic variables (economic activity, number of employees,
turnover) and by applying model based estimation for the other variables.
Model parameters for microdata estimation for enterprises below legal thresholds are based
on the ‘most similar’ enterprises in the surveyed SBS-data. The ideal case - the structure of
variables for enterprises above thresholds is the same as for enterprises below – cannot
generally be assumed. That can be seen from Figure 4 which shows the box plots for
personnel expenditures per employee for a selected NACE subclass and different turnover
size classes in the last SBS-Census (1995).
Figure 4: Distribution of ‘personnel expenditures’ per employee for a selected NACEsubclass and different turnover size classes in the last SBS-Census (1995)
personnel
expenditures
per employee
MM
60
50
40
30
20
10
0
0
1
2
3
4
5
6
7
8
9
Gr K l
turnover size
classes
observable in cut-off-survey
not observable
basis for parameter estimation
legal threshold
Therefore, a step by step approach based on turnover size classes and economic activities has
been applied, to allow for structural differences in model parameter estimation. If enough
enterprises are available in the cut-off-sample, the calculation of model parameters is based
on the most detailed NACE classification level (subclasses) and the smallest possible size
classes. Starting with a turnover-limit of EUR 999.000 the size classes were raised step by
step up to a maximum of 5 Million EUR. If the number of enterprises in the relevant NACE
subclass was too small then the parameters were calculated for a higher NACE aggregation.
Since outliers can have a great influence on the quality of the model adaptation a robust
WP1
57
regression method (LTS14) was applied to estimate the main variables (e.g. personnel
expenditures, purchases of goods and services,…). For detailed variables like breakdown of
turnover or breakdown of purchases of goods and services ratio estimation has been used.
Quality of Administrative Sources – In order to define the population (units above and
below the legal thresholds) and to get basic variables for model based estimation, employment
data from Social Security Authority by sex and status level and annual tax declaration values
or aggregated monthly tax declarations (VAT tax advance return) from administrative sources
are used. The quality and completeness of social security data is very satisfactory. Because of
missing tax declarations or incomplete links of administrative data with Business Register, for
about 15% of enterprises below thresholds turnover imputation is necessary. Additionally
turnover definitions from tax declarations and SBS do not correspond to each other by 100%.
Deviations between SBS definitions and administrative data depend on different reasons such
as foreign tax accounts, definitional differences, structural changes, group company tax
declarations, financial year records (for SBS), calendar year (tax) or deficient tax declarations
etc. However, analyses have shown that differences for small observable enterprises are rather
negligible and the large or medium sized enterprises are in the survey anyway.
Model Effects – For insufficiently covered inhomogeneous economic activities and economic
activities with a very different structure of the enterprises above and below thresholds (e.g.
NACE categories in which trade and intermediary activities are put together) a systematic bias
cannot be avoided. In this case only expert rating can increase the quality of the results
(subjective assessment of results by qualified experts of data editing staff).
Conclusions – In general the concentration principle of the surveyed data in combination
with administrative data and model based estimation works very well. For most of the NACEclasses the small percentage of surveyed enterprises provides a relatively high data coverage
(ratio of the surveyed data). In this case possible model effects as a result of inhomogeneous
branches or deviations between SBS and administrative data may have less influence.
Figure 5: Coverage for basic variables on detailed NACE-level (SBS 2003)
Number of NACE-classes
Number of NACE-classes
Number of NACE-classes
182
200
200
200
147
150
100
50
150
150
120
78
76
56
80
100
41
28
17
3
5
50
33
6
0
0
11
9
8
18
28
100
49
50
Percentage coverage of number of enterprise
at NACE 4 digit level (NACE-classes)
36
5
13
9
13
24
37
58
0
0
0-10% 11-20% 21-30% 31-40% 41-50% 51-60% 61-70% 71-80% 81-90% >90%
82
0-10%
11-20% 21-30% 31-40% 41-50% 51-60% 61-70% 71-80% 81-90%
Percentage coverage of turnover
at NACE 4 digit level (NACE-classes)
>90%
0-10% 11-20% 21-30% 31-40% 41-50% 51-60% 61-70% 71-80% 81-90%
>90%
Percentage coverage of number of employees
at NACE 4 digit level (NACE-classes)
Applying the concentration principle for all economic activities will require an adaptation of
the legal thresholds of economic activities not well covered. This would cause a change of
national regulation15 for SBS which is not foreseen at the moment.
The Austrian SBS method was developed with consideration of all legal, technical and
mathematical circumstances by using all data sources available for the moment. Various test
calculations were carried out in advance and analyses have shown high quality results for
basic data like turnover and number of persons employed and an improvement to former
concepts in particular for results on regional level. Results for all the other variables were also
14
15
FAST-LTS (Least Trimmed Square Regression) algorithm (Rousseeuw and Van Driessen, 1999).
The next change of the national regulation is planned with implementation of NACE Rev. 2.
WP1
58
very satisfying. In case of basic legal changes or when new (re)sources are available, further
adaptations will be carried out.
Documentation on the project methodologies
http://ec.europa.eu/comm/eurostat/ramon/nat_methods/SBS/SBS_Meth_AU.pdf
Bibliography
Rousseeuw, P.J. and A. M. Leroy. Robust regression and outlier detection, John Wiley &
Sons, Inc., August 1987.
Rousseeuw, P.J. and K. van Driessen. Computing LTS Regression for Large Data Sets,
Springer Netherlands 2006
(http://www.springerlink.com/content/06k45m57x01028x6/fulltext.pdf).
4.4. Record linkage for the computer-assisted maintenance of a Business
Register: the Austrian experience
Alois Haslinger (Statistics Austria)
Since the Federal Statistics Act 2000 has become effective, Statistics Austria receives
monthly copies from at least four administrative data sources or registers (AR) all covering
information about Austrian enterprises or subunits of them. This information is not only used
for the updating and maintenance of the Business Register (BR) but also as a partial or total
surrogate for censuses and surveys. The idea is to reduce the response burden as much as
possible. If any information which is needed for statistical purposes is already stored
somewhere in the public administration, then Statistics Austria should take that information
instead of surveying the enterprises or the citizens again.
The Austrian Business Register - The Business Register of Statistics Austria serves as an
instrument for all surveys conducted in economic statistics and even for some in social
statistics. It has been designed according to the requirements of Council Regulation No.
2186/93 on business registers for statistical purposes within the EU and contains about
410.000 enterprises including their establishments and local units. All in all it has about
560.000 active and 210.000 inactive units (i.e. enterprises, establishments and local units) and
has been in operation since mid-1995. The BR is a central register held in Statistics Austria
for statistical purposes. It is designed to comply as far as possible with European
requirements, but generally does not recognize any difference between a legal unit and an
enterprise, like the registers of most EU member states.
AR used for the maintenance of the BR - Basically four administrative registers are used for
the continuous servicing of the BR:
1. Register of the Federal Economic Chamber (FEC) Until 2000 it was the sole
administrative source used for the updating of the BR. Most physical and legal persons
who want to pursue a business have to apply for a trade licence. A separate licence is
necessary for each different trade in which someone wishes to engage. The register of
members of the Economic Chamber has about 350.000 entries. The register of the
Economic Chamber still is an indispensable source of information for maintaining the
quality of the BR (local units, status of enterprises). Unfortunately, not every economic
WP1
59
unit has to become a member of the Federal Economic Chamber, e.g. that holds for
physicians, lawyers and civil engineers, which have their own chambers.
2. Register of Incorporated Companies (RIC) This register is a public electronic register
which is operated by special courts. It keeps record and informs about all facts of
enterprises, which have to be stored according to commercial law (name and address of
the company, company number, legal form, enrolment of the submission of accounts,
changes of the persons authorised to represent the company). The register of companies
contains only about 150.000 entries of corporations or merchants who have been entered
as such in the register. Generally, such a person must have a net turnover above 400.000 €
per year or above 600.000 € for food retailers or general stores.
3. Social Security Register (SSR) Each Austrian employer has to register his employees in
one of about 20 different social security insurance institutions. It depends on the region
and the kind of contract of employment, which insurance institution is responsible for a
certain employee. It is possible that the employees of an employer are registered at two or
more different insurance institutions (e.g. if an employer has local units in more than one
province). For each employment of a certain person by a certain employer, a data record is
stored in the responsible social security insurance institution containing, among other
things, the social security number of the person, an identification code of the employer, a
code of the insurance institution, sex of the person and the kind of contract of
employment. Because of the federal organisation of the social insurance system most of
the institutions are member of an umbrella organisation called Main Association of
Austrian Social Security Institutions. The Main Association has access to the employment
registers of its members and additionally maintains a register which holds one record for
each combination of insurance institution and employer. This register of employer
accounts (SSR) contains the name of the employer, postcode, address, place of the
enterprise, NUTS-3 and NACE codes, and contains about 350.000 units. The units of this
register are not comparable with the units of the BR. Usually, one enterprise of the BR
consists of 0 to n units of the Social Security register.
4. Tax Register (TR) The most comprehensive administrative register used for the updating
of the BR is the register of the tax authorities. It contains basic information like name and
address, date of birth, sex and civil status (the last 3 primarily for persons), legal status
and economic classification according to NACE (primarily for enterprises) for about 6
million taxable units (persons, business partnerships, corporations, institutions,
associations…). The coverage of this basic tax file is much broader than that of the BR.
To get a sub-file from the basic tax-file which is comparable with the BR it has to be
merged with the turnover taxation file from the tax authorities containing about 600.000
units. Statistics Austria receives both files of the tax register monthly. Both files include a
unique subject identification key which can be used for merging the two files. The
turnover taxation file contains all units from the basic file which did a turnover tax return
in at least one of the last 3 years. A problem is the lag between a fiscal year and the time,
when all units have received their tax assessment. This lag is about 2-3 years. To get a
realistic value of total turnover in 2000 you have to wait at least until mid 2003. The
merged file of that date covers units which are no longer active at present. On the other
hand, in the merged file units of the BR are lacking which are not liable for turnover
taxation (e.g. turnover from medical activity). Nevertheless, most of the units of this
merged file are in accordance with the enterprises of the BR. From the start of 2003 on,
the problem of the time lag of the turnover returns has been reduced, because now each
enterprise with a turnover above 100.000 € in a year has to do a monthly turnover tax
WP1
60
advance return beginning with January of the next year. Therefore, new enterprises are
registered earlier than in the past in the basic tax file.
5. Other administrative sources: The staff of the unit responsible for the manual updating
of the BR uses also more or less regularly additional administrative sources which are not
supplied in an electronic file and/or not monthly. Examples of such sources are the
register of reliability of the borrowers (Kreditschutzverband), membership directories of
the Medical Chamber, the Chamber of Lawyers, of Civil Engineers, Patent Agents,
Notaries etc.,…
Problems in the available administrative registers and record linkage - The units of the
AR do not exactly agree with the units of the BR (enterprises, establishments and local units)
and each register has its own system of identification keys for its units. Some information in
the AR is incomplete or wrong (e.g. the NACE classification) or the timeliness of some units
in the AR is different from that in the BR. The greatest problem is the non-existence of a
unique numerical identifier for the units in different registers. The matching of the register
units has to be done by comparing mainly text fields like name of the company and address.
These fields are not standardised and of different length in different registers. Fortunately,
both the BR and the different AR store the postal and/or municipality code for each unit
which diminishes the number of necessary comparisons highly. Generally the most suitable
kind of unit of the BR for linkage with the units of an external AR is the enterprise, only the
licenses of the Federal Economic Chamber become linked with the local units of the BR. For
each AR a table is loaded in the DB/2-database of the BR where each enterprise/local unit
gets assigned the identification keys of the corresponding units in the AR.
Record linkage methodology - For record linkage of the units of two registers Statistics
Austria uses the bigram method: For comparison of the name of an unit a in register A and the
name of an unit b in register B the two names are decomposed in overlapping bigrams (e.g.
‘MAYER KARL’ is decomposed in the bigrams ‘MA’, ‘AY’,’YE’,’ER’,’KA’,’AR’,’RL’).
The names length is measured by the number of different bigrams of the name and the
similarity of two names by the number of bigrams belonging to both names divided by the
square root of the product of the length of the compared names and multiplied by 100.
The result is always a value between 0 and 100, a value of 0 stands for no common bigram
and a value of 100 signifies that the two compared phrases are identical. We have also
experimented with other similarity measures but found no great differences between all of
them. In the special situation when the text field in one register is generally shorter than the
corresponding one in the other register it can be of advantage to divide by the minimum
number of bigrams in the two strings instead of by the geometric mean.
The bigram method is simple to implement, is usable for each language and robust against
permutation of words in a phrase.
Parsing - To achieve satisfying results with the bigram method it is necessary that the
compared text fields of the same unit in different registers are not written too differently.
Before the text variables of two registers are compared for similarity, they must be
standardised and parsed. This is essentially a statistical process. It is usually done by
computing the frequency of all words in both texts. If the frequency of a string (like ‘corp’,
‘inc’, ‘ltd’, ‘doctor’) is very different in both registers, either one can delete that string in both
registers, abbreviate it identically in both registers or replace it by a synonym at least in one
WP1
61
register. Other steps are the converting of lowercase characters in uppercase, converting
special characters, etc.
Blocking - The comparison of thousands or even millions of records of one register with all
records of another big register would need even for a modern computer very long elapse
times. Therefore only a small subset of all possible pairs of units is compared and their
similarity is measured e.g. in the monthly matching of the Social Security units with the units
of the BR in one run the units with identical names in both registers are compared for
similarity of name, postal code and legal status, in a second run the units with identical
address, in a third run units with the same postal code and the same Christian name and in the
last run units with the same first name are compared. The total similarity of name, postal code
and legal status is a weighted average of the three separate similarities. The weights and the
threshold above which a pair is a candidate for a possible link are determined empirically. At
the moment the weights are .70 for name, .10 for postal code and .20 for legal status, the
threshold for the total similarity is 87. Units with a total similarity above 87 are listed for
manual checks.
Monthly updating of the Austrian BR - Each month Statistics Austria receives copies of the
four AR which are incorporated in the BR in the following 4 phases.
1. Creation of new relations Our SAS record linkage program tries to suggest for each
unit of the AR which is not yet linked with a unit in the BR a possible candidate for a
positive match. It is the pair with the highest similarity above a threshold. All that
pairs are listed for manual inspection.
2. Checking and Storage of new relations If the manual check for a pair of units
succeeds then both the key of the BR unit and the key of the AR unit are stored in a
separate table of the DB2 database. Any other information of the AR is not stored in
the BR because it can be connected whenever necessary by using the stored relation of
the keys. In case of a failure the pair is deleted from the list. For the manual inspection
also information additional to that in the four AR is used (e.g. search engines in the
Internet).
3. Creation of new business units The units of an AR which even after the above two
steps are not linked with a unit in the BR are considered as units belonging to newly
born enterprises. The SAS record linkage program tries to combine all the units in the
four AR belonging to the same enterprise (it needs 6 runs to compare each AR with all
the others). At the best a new unit is found in all four AR, but usually it is found only
in two or three AR. All the pairs, triples and quadruples belonging to enterprises
which have at least one employee or a yearly turnover above 22.000 € are listed for
manual inspection.
4. Checking and Storage of new business units If the manual check for a pair, triple or
quadruple of units succeeds then a new unit is created in the BR and the relations
between that new unit and the corresponding units in the AR are stored as in step 2.
5. Deletion of dead units If an incorporated enterprise is deleted from the register of
companies it is also immediately canceled in the BR. In the past not incorporated
enterprises have been checked only quarterly for possible deaths, from mid-2007
onwards we will check monthly for possible deaths. An enterprise has died if it cannot
be found any longer in any of the AR or if the yearly turnover in the last two years was
under 22.000 €.
WP1
62
Conclusions - Figure 6 demonstrates the success in raising the coverage of Statistics Austria’s
BR: from 2001 onwards the number of active units of the BR has increased by one third from
300.000 to around 410.000. This increase is partially caused by a better coverage of the public
and non-profit-oriented sector and partially by a lower under-coverage of the profit-oriented
sector.
Not only coverage has increased but also the number of enterprises which are linked to one or
more AR. Now 99% of all active enterprises are linked to a unit in the tax register, 73% to a
unit of the Federal Economic Chamber, 69% to at least one employer account number of the
Social Security System and 37% to a unit of the register of Corporations.
The monthly updating of the BR from administrative data beginning in 2005 results in a
steady development of the BR and a good basis for sample selection and grossing up of the
data. The use of computer-assisted record linkage has reduced the amount of manual work
necessary for the maintenance of the BR quite considerable. It enables the combination of
information (employees, turnover, NACE-code) from AR with the BR and also reduces the
response burden of enterprises.
Figure 6 – Number of enterprises in the Business Register by type of relation to AR
500000
450000
400000
350000
BR
RIC
FEC
SSR
TR
300000
250000
200000
150000
100000
50000
0
1/2001
1/2002
1/2003
1/2004
1/2005
1/2006
1/2007
Bibliography
Council regulation (EEC) No 2186/93 of 22 July 1993 on Community coordination in
drawing up business registers for statistical purposes.
Bundesstatistikgesetz 2000. Federal Statistics Act of 2000, BGBL I Nr.163/1999, idF BGBL I
Nr.136/2001, Vienna.
WP1
63
Haslinger, A., 1997. Automatic Coding and Text Processing using N-grams. Conference of
European Statisticians. Statistical Standards and Studies – No. 48. Statistical Data Editing,
Volume No. 2, Methods and Techniques, pages 199-209. UNO, New York and Geneva.
Haslinger, A., 2004. Data Matching for the Maintenance of the Business Register of Statistics
Austria. Austrian Journal of Statistics, Volume 33, No. 1&2, pp. 55-67.
http://www.stat.tugraz.at/AJS/ausg041+2/041+2Haslinger.pdf
4.5. The use of data from population registers in the 2001 Population and
Housing Census: the Spanish experience
INE (Spain)
For the 2001 Spanish Censuses, the chosen option was a classical census with a first-time
exploitation of the Continuous Population Register (CPR, named as Padrón Continuo de
Habitantes). Specifically, an operation based on a thorough itinerary around the territory,
strongly supported by the CPR and by a single, comprehensive questionnaire, which is
implemented exhaustively.
The population count based on the Population Register (PR) is not immediately accepted as
the best possible, but checked and corrected against reality through the complete enumeration.
The census is thus an exhaustive evaluation of the coverage of the PR, and allows accurately
adjusting population counts and reducing the main sources of under-coverage (typical of the
classical censuses) and over-coverage (typical of some PRs).
Technical and legal measures must be applied to ensure that the PR information to be checked
is treated in a proper and different way, all over the whole operation, from the rest of census
information.
The key points of the process can be summarised as:
a) The Population census is based on the Register data to improve precision and cut costs and
bother the citizens as little as possible, taking advantage of the fact that register data can be
used legally with statistical purposes;
b) The data collected in the census questionnaires are not transferred to the Register (as this
would violate statistical secrecy);
c) The modifications entered by the inhabitants in their register are noted on specific sheets
and sent to the Council so that, after performing the necessary verifications, the Register is
updated with the proper corrections.
Purpose of the integration activities: advantages and drawbacks, and complexity of the
problem - The aim of this kind of Census is to reduce both costs as response burden, through
the use of relevant administrative registers, and complemented with an exhaustive statistical
operation, with a twofold aim: to improve the accuracy of population counts, and, in addition,
to obtain from the census variables not available from the combination of registers.
Although a long-term aim, a Census based exclusively on administrative registers was still
considered unfeasible; due to the need for delicate legislative reforms, problems for social
acceptance, lack of a common identity number for each person, and non-standardised and
easily exploitable administrative information.
WP1
64
It also presents several drawbacks: register information gives rise to rights and duties. So,
they are prone to contain 'convenient' rather than 'true' information’; e.g., PR address is not
always the usual residence of a person (even though legislation provides so) but their best
trade off between rights and duties. Moreover, trusting indefinitely in the reliability of PR
counts without any kind of periodical check is risky, because of cumulative errors that, to
more or less extent, every PR inevitably contains (difficulties in measuring accurately the
departures from the country is, perhaps, the best example).
Furthermore, a classical Census, as a comprehensive population count, with slight or no
relationship with the Register, would not be appropriate either: as well as not suitably
maximising the potential savings brought about by supporting the information with data from
the Register, it would not satisfy the common benefit relationship established in article 79 of
the Population Regulation (see INE 2001, 12-14), which states that the carrying out of the
Population Census must rely on data from all the Municipal Registers, and municipalities,
owners of the registers, must help the INE in whatever it needs.
On the other hand, a Census consisting in a combination of registers and a complete
enumeration offers
a. More precise population counts than in a classical census, thanks to the previous
information contained in the PR (preventing undercoverage) and more precise than an
exclusively register-based census, thanks to the checking against reality that complete
enumeration supplies (preventing cumulative errors of the PR).
b. Information not available through register integration, which is obtained in an exhaustive,
classical, way, allowing maximum geographical and conceptual detail.
c. The longitudinal perspective allowed through the use of registers as its main support.
d. The use of more efficient collection methods, achieved via the previous knowledge of the
location where every person is registered.
However, drawbacks also come from that intermediate-point condition.
a. They are more expensive than exclusively register-based censuses, because of the
exhaustive collection operation (anyway, it should be cheaper than classical censuses).
b. Response burden, ceteris paribus and ruling out other factors, is also somewhere between
the minimum achieved in censuses without specific collection operation and the maximum
of censuses with no previous information support.
Specification of the sources to be included and other previous conditions - Availability of
a PR, at least reliable enough as an initial solution to state how many people, whom and
where will be counted in census figures, is obviously needed.
It is also advisable to have another administrative registers usable for census purposes, as
Cadastre, Tax declarations, Social Security General Affiliation files, public unemployment
registers, educational qualification records, and so on.
This type of census, as regards its relationship with the PR, has also two variants, depending
on whether the Census is simply supported by the PR, or whether benefits are mutual, such
that the PR uses the census operation to update and improve its information. Then, the law
WP1
65
governing the PR must explicitly provide such use of the census operation to update the PR
(while preserving the statistical confidentiality in the strictly census-related information)16.
So, the Population Regulation enables the INE to perform operations to control the precision
of register data (article 78), and establishes (article 79) that these operations will be performed
on the occasion of the Population censuses. Councils must be notified of the results of these
operations. The modification of the register data should be entered in a specific document,
thus avoiding the legal problems that could derive from the direct use of the census
questionnaire to collect changes to be performed in the Register.
Scope and methods - The Population census only includes persons, regardless of their
nationality, whose regular address is located in the national territory.
As regards the Housing Census, the population scope considers dwellings and group
establishments. Dwellings are considered to be all venues used for human habitation, that are
family dwellings, and those others that, although they are not designed for that purpose, are
actually inhabited on the date the Census is performed; these are called Accommodations.
The research includes the whole national territory, and the counts of the different units
referred to a single census date, in this case, November 1st 2001.
The infrastructure was obtained using two relevant administrative files: The Register and the
Cadastre. The former provided the location of the buildings containing main dwellings (in
which there were people who resided there regularly) and the latter allowed the identification
of the other buildings (buildings without resident persons and commercial establishments).
-
The CPR (Municipal Register of Inhabitants) information, established in the Law on the
Basis of Local Regimes, limits the information to: name and surname, ID number,
address, sex, place and date of birth, nationality and school or academic education of the
person’s resident in the municipality. This information was also useful to role out some
additional questions which were collected to set directly, together with kinship, which is
the family core and which is the structure of the most complex families.
-
The Urban Cadastre database, combined with CPR allowed a single census itinerary and
also implied economic savings, since the preparation process that was traditionally
performed in years ending in 0, called Censuses of Buildings and Commercial Premises
was replaced by crossing both computerised databases, thus resulting in great advantages.
It also allowed to gather characteristics of all of the units: buildings, households and
individual persons. That is to say, as the Housing and the Building Censuses have been
jointly performed for the fist time, it has been possible to study the characteristics of the
population, on the basis of kind of building used as a residence.
The different nature of PR information, with both administrative as statistical purposes, and
the rest of census data, only with statistical purposes, must be clearly explained in the
questionnaires. Separated sheets for PR information may help to emphasize this essential
distinction. And all along the processing, a proper separation between the two types of
information must be assured; for instance, files containing personal identifications should
never contain statistical non- PR variables.
16
To avoid the violation of the Fundamental Principles of Official Statistics (see UNECE / EUROSTAT, 2007,
Chapter I, paragraph 17), stating that “individual data collected by statistical agencies […] are to be used
exclusively for statistical purposes”.
WP1
66
The 2001 Census operation included four models of questionnaires, although not all the
persons resident in the dwellings had to complete them. The following documents were used:
 Register information. This document verified if the information the Councils have in the
Register, which is printed on this sheet, is correct.
 Dwelling Questionnaire. This questionnaire garners the most important characteristics of
main dwellings (those where a person lives regularly).
 Household questionnaire. Gathering census variables that have to be completed by all
persons (kinship, marital status, type of studies, municipality of residence on March 1st
1991, etc.).
 Individual questionnaire. Only completed by persons aged 16 years old or older who study
or work. Containing information on the types of work or studies, and the place where they
are carried out.
Pre-filling questionnaires with PR information is a complex technical task, especially when
associated with the large census volume and with the constraints imposed by optical reading
technology; for example the necessity of finding the information in a very exact location of
the questionnaire to be effective, or the convenience of using blind colours for the fixed, noninformative, parts of the questionnaires.
In order to update the CPR information, the INE was in charge of channelling the proposed
changes to the Councils. Proposals for the modification of register data, which were
performed by some citizens, were compiled and sent to each Council involved. In turn, after
carrying out additional verification procedures, the Council sent the INE the accepted
variations, which were introduced in the corresponding Register. Finally, the INE, after
receiving the confirmation, consolidated the variations in their copies of the register files.
The building and updating of the CPR - The INE keeps a backup of all the Municipal
Registers which was built from the files corresponding to the last ex-novo Register Renewal referred to 1 May 1996- and the monthly variations occurred in the Municipal Register data
and issued by the Town Councils from then on, with the aim of co-ordinating them and
preventing duplicates, in fulfilment of its obligations as imposed by the law in force.
The co-ordination of all the Municipal Registers (MRs) consists of checking and adding to
that backup all those monthly variations, and then reporting to the Town Councils any
inconsistencies detected. To that end, not only the data issued by the municipalities, but also
data from other administrative sources is used.
It is important to remark that, unlike other sources with an exclusively statistical purpose,
inconsistencies are detected and they are noticed to the corresponding municipality, provided
that relevant information is kept in order to know at any moment whether the municipality has
modified the data or not, and the reason for it, but the corrected data is never added directly to
the database.
Therefore, the external sources with which the CPR is checked, are as follows:
Source
Relevant facts for the CPR
Civil Register
Births and Deaths. Changes of nationality,
name, surname and sex.
WP1
67
Issues and renewals of National Identity
Cards (DNIs) or in the case of foreigners,
the document that replaces it: temporary
residence permits, residence cards (NIEs)
Issues of educational certificates
Ministry of the Interior
Ministry of Education and Science
CPR of Spaniards Resident abroad (PERE)
registerd / unregistered Spaniards because
of moving abroad
Ministry of Foreign Affairs
Some basic indicators

Information from the Town Councils and other administrative sources used for the initial
load and (approximately) monthly updates.
Source
Regular Basis
Town Councils
Ministry of
Education and
Science
Ministry of the Interior
Complete MR
database and
variations
DNIs
Residence
cards
Certificates
Initial load
40.200.000
39.015.395
2.111.725
28.037.758
Monthly
procedure
1.062.902
493.973
150.883
(1)
Source
Ministry of Foreign
Affairs
PERE
Civil Register
Births
Deaths
Regular Basis
Initial load
887.857
612.848
1.724.561
Monthly
procedure
12.647
34.858
30.892

(Approximately) monthly cross-checks between updated administrative sources and
updated CPR tables
Monthly new updates of the CPR database (including variations reported by the Town
Councils) are themselves crosschecked against information from administrative sources (both
(1)
Only the initial file has been received.
WP1
68
updated databases and updates). The total amount of checks, as an average, is shown in the
table below.
Source
Civil Register
Ministry of the Interior
Regular
Basis
Monthly
DNIs
Residence cards
Births
Deaths
888.893
669.465
78.999
65.321
Source
Ministry of Foreign Affairs
Regular
Basis
PERE
Monthly
procedure
18.725

Monthly reversals issued to the Town Councils
The table below shows the number of transmissions to the Town Councils that are carried out
during the (approximately) monthly procedure of checking variations; and, amongst them,
how many of them are related to the cross-checks against the different sources: first, after the
initial issue and then, when performing the monthly procedure.
Source
Total
Ministry of the Interior
Regular Basis
Initial issue
-Monthly
procedure
Source
462.776
DNIs
Residence
cards
Card expirations
3.004.512
521.136
1.807.084
60.123
43.479
75.849
Civil Register
Births
Deaths
Ministry of Foreign
Affairs
PERE
Regular Basis
Initial issue
28.113
Monthly
procedure
1.599
188.588
Not available
20.065
2.562
WP1
69
It
is
also
planned
to
transmit, along
2008,
new
citizenships
acquired
by
foreign
nationals and
changes in the
data on name, surname or sex held by the Civil Register.
Cross-checks - The process of identifying pairs of records when browsing variation datasets
from the Town Councils and the Civil Register consists of a sequence of the following
comparisons between common identifiers:
1. Complete agreement in name, surname, date of birth and DNI.
2. For those not matched in 1): partial agreement in name, surname, date of birth and
DNI.
3. For those not matched in 2): clerical review.
4. For records that were not yet matched, a new search is performed using several
agreement criteria, and a table containing all those data issued by the corresponding
source and related to the CPR database is stored, in order to identify, monthly, all
those variations not previously matched.
Conclusions - An integration of registers and an exhaustive collection operation improves
flexibility in the content, while reducing the response burden in comparison with a classical
census with the same information. Compared with integrating registers and sample surveys,
the main advantage is the complete geographical and conceptual detail of all the variables,
whether available in the registers or not.
Information from previous censuses and related administrative data also improves appropriate
processing and editing of data as well as imputation when incoherence or missing values are
detected.
Dissemination also gets benefits from the previous censuses, because of the longitudinal
perspective it allows.
Finally, to trust indefinitely in the PR counts without any periodical check against reality has
several, very tempting, advantages (drastic reduction of census costs and respondent burden)
but also a potentially very negative consequence: population figures may become very apart
from true figures.
However, the issue of the risks associated with the use of the census to update the population
registers is very important, so it deserves a more complete assessment. It is completely true
that this risk exists, but it must be compared with the disadvantages of the opposite.
So, each country having a PR should weigh up the relative advantages and disadvantages and
decide whether it is advisable a two-way relationship between census and PR or not. In case
of opting for doing so, some technical and legal aspects related to privacy issues, and
differences between PR and Census data, will arise as especially relevant.
Bibliography
Instituto Nacional de Estadística (2001): "Population and Housing Census 2001: 2001 Census
Project". INEbase18, INE, Madrid. http://www.ine.es/en/censo2001/infotec_en.htm
Instituto Nacional de Estadística (2005a): "Outline of the census type planned for 2011 in
Spain". Submission to the United Nations Statistics Division Website on 2010 World
Population and Housing Censuses. INE, Madrid.
18
INEbase is the system the INE uses to store statistical information on its corporate web page for the Internet.
WP1
70
http://unstats.un.org/unsd/Demographic/sources/census/Spainpdf.pdf
Instituto Nacional de Estadística (2005b): "Description of the census type consisting in a
combination of registers and a complete enumeration", in Summary of Comments on the Draft
Version of the CES Recommendations for the 2010 censuses of Population and Housing. Joint
UNECE/EUROSTAT Meeting on Population and Housing Censuses Organised in
cooperation with UNFPA, Working paper No.5. Geneva, pp 36-38.
http://www.unece.org/stats/documents/ece/ces/ge.41/2005/wp.5.e.pdf
UNECE/EUROSTAT (2007): Appendix II: Alternative approaches to census-taking", in
Recommendations for the 2010 censuses of Population and Housing. Conference of European
Statisticians (CES). UN, New York and Geneva, pp. 153-165.
http://www.unece.org/stats/documents/ece/ces/ge.41/2007/mtg1/zip.1.e.pdf
4.6. Administrative data source (DBP) for population statistics based on
ISEO register in the Czech Republic
Jaroslav Kraus (CZSO)
Database of person (DBP) is a fully historical register of all persons with permanent place of
residence on the territory of the Czech Republic. Data of ISEO (Integrated system of persons’
evidence) are the main constitutive source for creating DBP. According to legislative rules
only information about persons (individuals) is involved – without any relation to other
individuals. Thus, no information about families or households is accessible.
Data model - There are some basis principles for the data model:
o Full history for all defined n-tities;
o Sharing of n-tities for common data storing;
o Data-driven contents of monitored attributes;
o Lossless management of changes of monitored n-tities.
Monitoring of history - Full monitoring of main defined n-tities is based on time validity of
its contents. Status of DBF to historical data is given by attributes and date of view to DBF.
Sharing classes of n-tities - Sharing classes of n-tities warrant saving on one defined place.
This approach is used in n-tities of attributes of interest and attributes of dates.
Nomenclatures - The content of nomenclatures of DBP gives the range of attributes of
interest and dates concerning physical persons. In case of increasing number and structure of
attributes, it is not necessary to change the structure of the data model, but only to change the
content of nomenclatures.
Lossless management of changes of monitored n-tities - This approach to data model is
given by:
o Only editing of attributes and availability of record;
o Creating of record of changes in separate n-tity;
o Impossibility of physical deleting of records in DBP;
o Identification of all changes by time stamp and signature of author.
These principles warrant storing of all information with possibility of historical development
of information.
WP1
71
Data contents
Data source (ISEO) - According to the agreement between the Ministry of the Interior and
the Czech Statistical Office data from ISEO (e.g. Integrated System of Personal Evidence) are
accepted as a data source for DBP. ISEO will be used as records without relations to each
other. It means that the structure of households is not – for the time being – accessible.
However, there is a possibility to generate supplement records for obtaining additive
information about the population. For example: to generate for record with information about
death person means to add so many records how many persons (e.g. members of household)
are connected to the death person (for example spouse, children, etc.).
Process of ISEO data transformation - The data coming from ISEO are transformed on
separate workplace where 1,…,n records are generated from ISEO. Thus two data structures
are created:
Copy of original ISEO data records - This part of DBP contains copies of records for given
persons <person_DBP>, followed by possible records of parents (<father>, <mother>),
partner <partner> and all children (<children>).
Related persons (parents, partners, and children) are stored in RelatedPerson structure
<related_person>. There is no personal identification of these records; the rest of the structure
is the same as for records in the previous paragraph.
The relation between these two types of records is not permitted.
Derived records of (related) persons - Derived records of related persons are created from
(ISEO) related persons (and their records). This derivation means that ISEO RelatedPerson is
defined as a DBP MainPerson with attributes information coming from ISEO
RelatedPerson. This record is stored in <person_related> structure:
ISEO/Father:
ISEO/father
ISEO/Person_DBP


Derived/Person_DBP
Derived/child/children
ISEO/Mother:
ISEO/Mother
ISEO/Person_DBP


Derived/Person_DBP
Derived/child/children




Derived/Person_DBP
Derived/partner
Derived/Person_DBP
Derived/father or Derived/Mother
by ISEO/Person_DBP/sex
ISEO/Partner:
ISEO/partner
ISEO/Person_DBP
ISEO/child/children: ISEO/child/children
ISEO/Person_DBP
This approach respects the restriction of ISEO data using without loss of information about
family relationships. However, the restriction of ISEO data using does not enable to keep
family information, but it is possible to construct proxy households of nuclear households
(parents and children). These proxy households may be sufficient approximation for many
analytical needs. However, this idea should be tested in the future.
WP1
72
4.7. An experiment of statistical matching between Labour Force Survey
(RFL) and Time Use Survey (TUS)
Gianni Corsetti (ISFOL, Italy)
The ISTAT social surveys are a privileged framework for testing the potential and produce a
critical evaluation of statistical matching methods.
Taking advantage of such an opportunity, it was tackled the objective of creating a synthetic
archive through the integration of data collected in two important ISTAT surveys: the current
Labour Force Survey (RFL) and the Time Use Survey (TUS).
The creation of such a dataset could allow the study of the relationships between the specific
variables of each survey. The presence of a wide set of common variables and of similarities
in the design of the survey model helps the data integration process. The two surveys also
share many themes, so the definitions used and the questions asked were harmonized during
the planning phase.
The TUS survey is very rich and articulated as far as organization of time use and perception
of quality of life are concerned. The survey conducted in 2002-2003 in compliance with
Eurostat’s Guidelines, foresaw, along with an individual questionnaire and a family
questionnaire, the collection of daily time use for those people who live in Italy through the
use of a daily journal and a weekly journal.
The survey on labour forces represents the primary source of information for studying the job
market.
Using together the data of the specific variables of the surveys could allow a researcher to
analyse at the same time the characteristics of labour force and time usage. In this way it is
possible on one side to enrich the analysis on the job market with subjective information and
with the detail of daily life organisation and, on the other side, the more general analysis on
the quality of life provided by TUS can be integrated with in-depth explanations on the
characteristics of the work condition collected by the RFL.
The data - In order to undertake a first experiment of integration of the two surveys studied
through techniques of statistical combination, Tus has been considered as the recipient and
Rfl as the donor survey, respectively.
More precisely, the analysis was restricted only to the employed people, and the Tus dataset
consisted of all the records (22,312 records) of the individuals observed during the entire
collection period of the survey (April 2002 – March 2003).
As far as the RFL survey is concerned, only the observations collected during the first three
months of 2003 (30,526 records) were considered, so that the records in the two surveys could
be considered as coming from homogeneous populations.
Statistical matching was also performed on individual non weighted data, while for the expost evaluation of the results the weights of the recipient survey, i.e. Tus, were used on the
overall completed data set.
The common variables present in both surveys and the ones that, appropriately harmonised,
were used to perform the statistical unconstrained matching are:





Sex;
Age groups;
Status;
Education;
Family size;
WP1
73




Geographical Area;
Type of town;
Area of employment;
Type of contract.
The marginal distributions of the common variables in the two surveys do not show
significant differences. The most considerable differences are in the weighted distributions of
the variable “Family Size”, where the RFL estimates present an unbalance towards the
smaller families compared to the Tus estimates, and of the variable “Type of town” for what
concerns some categories.
As a matter of fact some of these variables have been used as stratification variables. This
implies that the search for a donor was performed amongst those units with an identical value
for such variables. The same stratification variables were not always used: because of
computational constraints the first two stratification variables were always kept
(“Geographical Area” and “Type of contract”), while the other variables were introduced
when necessary, so that the number of records in a stratum was small enough to allow the
software for statistical matching to execute.
Table 1 describes the sixteen strata built in this way and the relative number of records in the
two archives.
Table 1 – Description of the strata created for the matching process
Stratification Variables
Stratum
1114c
1114e
1114o
1115
1124c
1124e
1124o
1125
1214
1215
122
2114
2115
2124
2125
221
222
Total
Geographical
Area
Type of
Contract
North-centre
North-centre
North-centre
North-centre
North-centre
North-centre
North-centre
North-centre
North-centre
North-centre
North-centre
South-islands
South-islands
South-islands
South-islands
South-islands
South-islands
Permanent
Permanent
Permanent
Permanent
Permanent
Permanent
Permanent
Permanent
Self-employed
Self-employed
Self-employed
Permanent
Permanent
Permanent
Permanent
Self-employed
Self-employed
Sex
Male
Male
Male
Male
Female
Female
Female
Female
Male
Male
Female
Male
Male
Female
Female
Male
Female
Age
<=44
<=44
<=44
>=45
<=44
<=44
<=44
>=45
<=44
>=45
<=44
>=45
<=44
>=45
Geographical
Area
Centre
North-east
North-west
Centre
North-east
North-west
Number of
Records Tus
2002-2003
Number of Records
Rfl 1st Trimester
2003
1.084
1.262
1.739
1.994
907
1.210
1.638
1.451
1.432
1.322
1.333
2.018
1.241
1.154
629
1.395
503
22.312
1.315
1.956
2.012
2.488
1.071
1.697
1.774
1.838
1.979
1.856
1.963
2.974
1.920
1.692
1.030
2.107
854
30.526
Matching technique used – In this work, we have chosen to work under the hypothesis of
conditional independence between the specific variables in Tus and RFL given the matching
variables. The synthetic archive was created through the implementation of the hot-deck
donation techniques.
Gower's similarity index is the distance function used for selecting the record from the donor
survey (Rfl) to associate to each record in the recipient survey (Tus). This index is calculated
amongst every pair of records from the donor and recipient files; if there are two or more
observations in the donor survey that present the same minimum distance from the recipient
WP1
74
record, then a unit is chosen randomly by assigning to each of them the same probability. The
software described in Sacco (2008) was used for the application of matching methods.
During the statistical matching process and the computation of the Gower distance, all the
common variables that do not represent a stratification variable in the n-th stratum, were given
a weight equal to one. In this way, it was decided to give all the variables the same
importance.
First Results - The reliability of the results obtained by a matching process depends on the
level of preservation of the hypothesis of conditional independence that is not empirically
verifiable. In this project it was not possible to use external auxiliary information to evaluate
the integrity of the combination and the respect of the hypothesis of conditional
independence. In order to perform a first evaluation of the results, some common variables
that do not play the role of matching or stratification variable were used. These are usually
called “control variables”.
Table 2 shows one of the possible results obtained with the implemented statistical matching.
It is a cross-table of two specific variables: paid hours worked in an average weekday (Tus)
and the willingness to work a number of hours different to those worked in the week
preceding the data collection (Rfl)19.
Table 2 – Frequency distribution between two variables specific to the two surveys in the
synthetic archives (absolute and percentual values).
Willingness to work a number of
hours different to those worked in
the week analysed. (synthetic
archive: variable donated by Rfl 1st
trimester 2003)
Hours worked in an average weekday
(average generic length)
0-3 hours
4-8 hours
9 hours and
more
Total
509
238
2.736
54
3.537
2.784
1.650
16.651
302
21.387
ABSOLUTE FREQUENCIES
Yes, less hours
Yes, more hours
No, the same hours
Don’t Know
Total
901
577
5.669
85
7.232
1.374
835
8.247
163
10.619
PERCENTUAL FREQUENCIES FOR 100 PEOPLE WHO WORK THE SAME HOURS
Yes, less hours
Yes, more hours
No, the same hours
Don’t Know
Total
12,5
8,0
78,4
1,2
100,0
12,9
7,9
77,7
1,5
100,0
14,4
6,7
77,3
1,5
100,0
13,0
7,7
77,9
1,4
100,0
The correctness of such result can be verified if the marginal distribution of the combined
variable is maintained compared to its original marginal distribution in Rfl. Table 3 shows
that this first control is successful because for both absolute and relative frequencies the
marginal distribution of each variable are homogeneous.
19
The reference population of this table is represented by those employed and who had performed at least one
hour work in the week preceding the Rfl data collection.
WP1
75
Table 3 - Marginal distribution of the specific Rfl variable in the original and synthetic archive
(absolute and percentual values).
Willingness to work a number of hours different to those worked in the week
during data collection
Yes, less
hours
Yes, more
hours
No, the same
hours
Don’t Know
Total
ABSOLUTE FREQUENCIES
Rfl 1st semester 2003
Synthetic archive: variable donated by Rfl 1st trimester 2003
2.780
1.857
15.928
275
20.840
2.692
1.691
16.757
276
21.416
13,3
8,9
76,4
1,3
100,0
12,6
7,9
78,2
1,3
100,0
PERCENTUAL FREQUENCIES
Rfl 1st semester 2003
Synthetic archive: variable donated by Rfl 1st trimester 2003
Another quality check consists in evaluating if also the joint distribution of a specific Tus
variable with a control variable is comparable (the control variable can be either the genuinely
observed variable in the TUS or the corresponding imputed variable from RFL). In this case,
the variables chosen are:
 working hours (full time, part-time);
 presence of a second job (yes or no);
 night work (yes, no or don’t know).
First of all it should be underlined that as far as the marginal frequency distributions are
concerned, the control variables used are very similar and, above all, the marginal
distributions of the control variables imputed in the synthetic archive are almost completely
unaltered in comparison to those of the original Rfl archive (Table 4).
Table 4 – Percentual marginal distribution of the shared control variables in the receiving
archive, in the donating archive and in the synthetic archive.
Weighed Data
Non weighed Data
Tus
20022003
Synthetic
archive:
variable
donated by Rfl
1st trimester
2003
Rfl 1st
trimester
2003
Tus
20022003
Synthetic
archive:
variable
donated by Rfl
1st trimester
2003
Rfl 1st
trimester
2003
Working Hours
Full Time
Part Time
88,7
11,3
87,8
12,2
87,3
12,7
88,7
11,3
87,5
12,5
87,6
12,4
SECOND JOB
Yes
No
3,6
96,4
2,6
97,4
2,7
97,3
3,7
96,4
2,6
97,4
2,6
97,4
NIGHT WORK
Yes
No
Don’t know
13,7
86,3
-
11,2
88,5
0,3
11,4
88,3
0,3
13,9
86,1
-
11,3
88,5
0,3
11,3
88,3
0,4
Common Control
Variable
WP1
76
Table 5 shows the contingency tables obtained by crossing the specific Tus variable (paid
hours worked in an average weekday) and the control variables in the original Tus archive and
in the synthetic archive (variables donated by Rfl). Also in this case the results are quite
satisfactory. This allows to state that, for the variables analysed, the joint frequency
distribution produced by the statistical combination represents quite closely the same
distributions of the original Tus archive.
WP1
77
A more punctual check is obtained by observing whether the relationship that connects the
variables of interest (specific Tus variable and common control variables) is adequately
reproduced in the synthetic archive. Table 6 presents the Odds Ratio calculated on the
previous double frequency distributions for each pair of categories of the specific Tus
variable. The type of relationship between variables is constant, even if in the synthetic
archive this correspondence always seems to be less intense than in the Tus donor archive.
There is, however, a strong differentiation in the results obtained for the three different
control variables: for the “Working Hours” variable the Odds Ratio calculated in the two
different archives is more distant than those calculated for the other two control variables.
Table 6 - Odds ratio calculated on the frequency distribution between a specific variable “Time
Use” and the shared control variables coming from the original archive Time Use and synthetic
archive donated by Rfl.
Hours worked in an average weekday
Common Control Variable
Synthetic archive: variable
donated by Rfl 1st trimester
2003
Tus 2002-2003
1st and 2nd
type
1st and 3rd
type
1st and 2nd
type
1st and 3rd
type
ODDS RATIO
Working Hours
Second Job
Night Work
0,37
0,91
0,84
0,11
0,64
0,54
0,82
0,92
0,89
0,43
0,67
0,86
Bibliography
Sacco, G., 2008. SAMWIN: a software for statistical matching. Available on the CENEX-ISAD
webpage (http://cenex-isad.istat.it), go to public area/documents/technical reports and documentation.
WP1
78
5. Results of the survey on the use and/or development of
integration methodologies in the different ESS countries
5.1. Introduction
Mauro Scanu (ISTAT)
One of the objectives of the CENEX-ISAD is the investigation of the state of the art on the
use and/or development of integration methodologies in the different ESS countries. In order
to tackle this task, it was necessary to take into consideration the following facts:
1. usually the NSIs do not have a centralised office whose aim is the development and/or the
application of integration methodologies.
2. the area “integration” is rather large, including tasks as metadata management (for the
harmonization of the different sources to integrate) and imputation methodologies.
3. the use of integration procedures is justified by informative needs that may occur in the
most diverse situations: most of the projects on integration are directly conducted by
survey managers on households or enterprises.
The previous aspects did not allow the definition of a list of people in the NSIs to contact in
order to understand the state of the art of the use and/or development of integration
methodologies in the ESS. For this reason, we decided to investigate a more difficult target
population: the population of the projects that involve the integration of two or more sources.
The ESS consists of European countries of which 34 (the 27 member countries of the EU plus
Switzerland, Norway, Iceland, Turkey, Croatia, Macedonia, Liechtenstein) have been
contacted. In a limited period of time (1 month and a half) we received answers from 21 NSIs,
19 of them have at least one project concerning integration of two or more sources (Greece
and Macedonia declared that no integration projects are currently active).
Country
Number of
projects
Italy
5
Czech Republic 4
Austria
3
Spain
3
UK
3
France
3
Netherlands
2
Finland
2
Switzerland
2
Malta
1
Romania
1
Hungary
1
Sweden
1
Cyprus
1
Latvia
1
Slovenia
1
Denmark
1
Belgium
1
Germany
1
We received 37 filled in questionnaires (projects). The previous table illustrates the number of
projects of the respondents by country.
WP1
79
The questionnaire focuses on only one project involving integration of two or more sources.
For each project, the following topics are investigated in detail:
a) objective of the integration process and characteristics of the files to integrate;
b) privacy issues;
c) problems of the integration process and methods;
d) software issues;
e) documentation on the integration project;
f) planned changes;
g) possibility to establish a link of experts.
In the following, each topic is illustrated in a paragraph with a description of the 37 filled in
questionnaires.
5.2. Objective of the integration process and characteristics of the files
to integrate
Luis Esteban Barbado (INE)
This part of the questionnaire allowed the creation of a list of experts on the application of
methodologies whose aim is the integration of sample surveys and archives. This list is
composed of 44 names (for some projects, more than one name was available).
The other questions on this part of the questionnaire aim at the description of the main
characteristics of the data integration project and of the sources used for the project. The main
features of the observed projects are that the integration process turns out useful especially for
the construction of archives and registers, they are generally multipurpose, anyway the main
objective is the reduction of the response burden; apart some project, they make use of not
more than 10 sources; finally, it is inevitable to make use of sources which are not managed
by the statistical office, although this leads to many problems in terms of harmonization and
quality of the collected data.
Question 5 Main area of interest of the data integration activity
Of all the projects observed in this survey 41% can be classified as activities to improve the
statistical infrastructure (censuses or registers), and some 32% are devoted to activities related
to statistical production, where both social and economic surveys are included.
Main area of interest
To carry out a business survey
To carry out a social survey
To produce a population or housing census
To produce a business census
To produce a population archive/register
To produce a business archive/register
Other
6
6
6
1
1
7
10
Question 6 Objectives of the data integration activity
The category "other" represents 27% of all responses. Apart from the existence of some
projects with several areas of interest, there are also outstanding projects which aim to give an
answer to areas not included in the questionnaire list, as for example National Accounts.
WP1
80
Generally, they can be classified as projects devoted to satisfy specific needs of statistical
production in a wide variety of areas of interest.
Concerning the objectives, a total of 117 responses have been collected, which means an
average of a bit more than 3 objectives per project. Generally all of the options, unless some
exception (microsimulation policies), are well represented, and they are consistent with the
current challenges and problems at Statistical Offices: how to combine the needs of
rationalizing costs and information demands with the growing commitments to the provision
of more and better quality information for users.
An analysis by project typology allows to detect different priorities for the objectives. For
projects related to statistical production, the aims of reducing costs and response burden,
enhancement of editing and imputation processes, and improvement of estimation methods
are the most outstanding, were indicated for about 70% of all projects. However, and as was
to be foreseen, infrastructure projects are more connected to objectives related to maintenance
or setting up a sampling frame, and 41% of all projects indicated this objective.
Finally, regarding projects included in the category "other", and given their nature, a much
more uniform distribution of objectives is detected, since neither of the options represents
more than 20% of the total; improvement of estimation methods (19%) and reducing costs
and response burden (15%) are the most frequent.
Objectives
Traditional census activities (for instance post-enumeration surveys)
Reduction of costs and response burden
Enhancement of editing and imputation processes
Analysis of statistical relations
Microsimulation policies
Register/archive maintenance
Improvement of estimation methods (weighting, imputation, small area estimators)
Set up of a sampling frame (for instance improvement of coverage)
Other
9
23
11
16
4
14
18
17
5
Question 7 How many other sources have been used in the project?
Question 8 For the main sources used for the data integration project (maximum 14) specify
the following details: source name and nature
As a whole, the 37 observed projects use a total of 327 sources, which means an average of
about 9 sources per project. There are 10 projects which use more than 10 sources, and one of
them uses more than 40 sources.
Taking into account some constraints in the questionnaire (only a maximum of 14 sources per
project can be documented), question 8 allows to set up a basic typology of used sources. The
distribution related to this criterion shows a clear predominance of archive/register sources,
which represent about 64% of the total. The sampling sources are the second most important
group, with 20% of the total.
WP1
81
Among the sources of the first group, the intensive use of sources managed by Tax or Social
Security Authorities is remarkable, no matter what project is considered.
It is also possible to detect the different relevance of every typology of sources in accordance
with the field of the project, as it is shown in the table below.
Main area of interest
To carry out a business / social survey
To produce a population / business
census
To produce a population / business
register
Other
Total
Archive Sample
/ register
58,8%
28,0%
79,7%
6,1%
Census
Other
5,9%
12,2%
7,3%
2,0%
78,7%
17,0%
0,0%
4,3%
49,4%
63,7%
24,7%
20,4%
2,5%
4,9%
23,4%
11,0%
Question 9 Following the sources order given in question 8, answer the following questions
about each source: number of units (approximate figures are acceptable); specify whether the
source is managed by your institute or not; whether the source is well documented; whether
each record of the source is integrated with other records of other files, or whether the source
is used only in order to compute aggregated values (totals, frequencies) to be used in the
integration process.
The information collected in this question allows some additional conclusions. With regard to
the Manager Institution, the survey shows that 42% of the sources are managed by the
Statistical Office itself. This rate grows to 45% for projects related to statistical production,
where obviously sample sources play an important role.
On the contrary, access to external sources in management processes of business censuses or
registers is a common practice. For these projects, external sources represent, as a whole, 84%
of the total.
The quality level of the documentation available for each source must be understood as a
subjective perception from each informant, since a priori parameters to quantify this concept
have not been set. Broadly, the quality of documentation seems profitable enough for the
users of the sources: 62% of them stated it as good, and 28% as excellent. Just for projects
related to produce economic surveys, a certain degree of disappointment is noticed, since 26%
of the sources show a poor quality level.
The third criterion to be analysed was the use of the data source, either at the unit level or as
aggregate data. From the beginning, it is noticeable that both procedures are not exclusive, as
the responses provided show. About 88% of all sources are used only at the unit level; 8% of
them are used jointly, that is to say, both basic units and aggregates obtained from the source
are used. This case has occurred solely in projects of which the main area of interest was
classified as "Other". Finally, only 4% of the sources were solely used as aggregate data.
As far as the number of units for each source is concerned, it is to mention that this item
shows partial non-response or non-numerical information in some questionnaires. Neither it is
WP1
82
possible to know whether each source is comprehensively or just partially used, which makes
more difficult a detailed analysis. As it should be expected, and taking into account these
constraints, provided responses allow to meet the sources with a largest amount of individual
records among projects related to social statistics or population registers.
5.3. Privacy issues
Eric Schulte Nordholt (CBS)
In this section the results of the privacy issues in the questionnaire are discussed. It is
interesting to start the discussion with a comparison with the legal section of the inventory on
Statistical Disclosure Control that has last year been filled in by 25 countries in the CENEX
on SDC. Most of these countries consider the legal protection of their data to be very
important. It does not make a difference in importance if these data concern natural persons or
enterprises. Most countries pay attention to the legislative and administrative aspects of
confidentiality. Most countries pay also often or very often attention to the mathematical and
computing aspects of confidentiality as well as to the organisational aspects. Most countries
answered to have a data protection law.
France introduced a data protection law (to the protection of statistical data) in 1978, the other
countries participating in the inventory introduced or changed such a law in the last fifteen
years. Most countries answered to have principles and laws on public access to government
information to the protection of statistical data. Most countries have specific regulations on
statistical confidentiality. Internal regulations of the statistical office on statistical
confidentiality exist only since recently or were changed over the last couple of years in most
statistical offices.
Different countries have different definitions of confidential data. Some countries mention the
aspect that data should be (directly) identifiable to consider them as confidential. In some
countries the statistical law only refers to personal data. In some countries the statistical law
does not mention the concept of confidential data at all, and sometimes different definitions
are used depending on the context. In the statistical laws of some countries a reference can be
found to the implementation of EU legislation.
Most countries have no statistics specific rules for the release of confidential data. In almost
all countries almost all enterprise data are considered confidential. Most countries have no
special rules that apply to the transmission of data to Eurostat. Most staff members of
statistical institutes in European countries have to sign confidentiality warrants. Penalties can
be imposed for (intentional) breaches of statistical confidentiality.
Only two statistical agencies answered that they very often conduct assessments of public
attitudes, perceptions and reactions to confidentiality. A majority of the statistical institutes in
this inventory uses registers as e.g. the population register and the business register very often.
Specific confidentiality rules concerning the use of these register data for statistical purposes
are applied in some countries.
In most offices a certain number of staff members have been made responsible for ensuring
statistical data confidentiality. In almost all countries universities (and research centres) have
the option to use individual data concerning natural persons for research purposes. In the
majority of the countries this option also exists for individual data concerning enterprises.
WP1
83
Business organisations, fiscal authorities and marketing organisations can in general not get
access to individual data. However, remarkable is that legal authorities as e.g. the police can
get access to individual data in a considerable minority of the countries. Finally, for other
governmental organisations the picture is somewhat mixed. In about half of the countries
these organisations can get access to individual data.
In a majority of the countries a review panel (e.g. an ethical or statistical committee) exists to
judge whether statistical data are sufficiently safe for use by persons outside the statistical
office. Most of these review panels are internal committees. In a majority of the countries
respondents can authorise the agency to provide their own individual data to a specified third
party (informed consent).
In most countries the variables on racial or ethnic origin, political opinions and religious or
philosophical beliefs were considered as sensitive. Also data concerning health and sex life
were considered sensitive in most countries. In addition in a majority of the countries trade
union membership, data relating to offences, criminal convictions and security measures and
data related to incomes were seen as sensitive. Data about professions and educational data
were considered as sensitive in a minority of the countries only.
Special licensing agreements exist in a minority of the countries. Access under contract for
named researchers exists in a majority of the countries. About half of the countries have the
option of access only for specially sworn employees. Also about half of the countries screen
the results with respect to disclosure control but only a minority screens the users. For
statistics on persons and administrative data the option of access only in controlled
environment does not exist for the majority of the statistical institutes, but for statistics on
enterprises there is a small majority where this option exists. The option of trusted third
parties that keep the keys necessary for identification hardly exists in Europe.
Most countries release microdata concerning natural persons and enterprises. However, Public
Use Files (PUFs) exist only in a minority of the countries. However, the majority of the
countries releases Microdata Under Contract (MUCs), although not on administrative data.
Synthetic data files are hardly produced in Europe and the majority of the countries also has
no on-site facility or online access option.
Many offices have organisational, methodological and software problems concerning
statistical confidentiality development. Most countries do not receive technical assistance
from other countries to help with the implementation of disclosure control. However, many
countries would like to receive help to solve in particular their software problems.
Now we move to the two questions asked in this CENEX on ISAD on confidentiality. It is
apparent that there are legal issues connected with the use of administrative data in almost all
countries, and that each country is characterized by different regulations and restrictions on
the use of microdata.
Question 10. Is there a legal foundation which regulates the supply of administrative data and
its usage by the NSI?
Yes
No
32
5
WP1
84
For 32 (86%) of the 37 data integration projects there is a legal foundation which regulates the
supply of administrative data and its usage by the NSI.
Question 11. In which way regulations and laws on privacy affect the data integration
project? (please put an ‘X’ in one or more cells; multiple answers are allowed).
Problems
Unit identifiers have to be cancelled in some data sets
Some data sets can provide only aggregate data
One or more archives/samples useful for integration goals can not be used
Linking of some groups of administrative data is prohibited by law
Other
13
7
5
4
8
None of the four problems identified as answering categories in the questionnaire were
mentioned for a majority of the data integration projects. For 8 (22%) of the 37 projects the
option ‘Other’ was chosen. For those eight data integration projects it was specified what
other problem could effect the project.
 In one of the Austrian projects the identifiers in each source will be / were replaced by
a protected pin. It was / would be produced by the data protection office for each
source due to name, date of birth, birth place via the central population register. This
way data twins and persons which are not found are the problem.
 In another Austrian project it was stated that the Austrian federal law
(“Bundesstatistikgesetz”) forces Statistics Austria to use administrative data as main
sources.
 In the Census project of the Czech Statistical Office the possibilities will be clear after
acceptance of the Census Act.
 In one of the UK projects it was mentioned that the ONS has developed statistical
disclosure control methods specific for the outputs of research facilities.
 The Finnish Statistics Act states that whenever possible, administrative data must be
used. The data received from administrations records for statistical purposes are
confidential. According to the Statistics Act data collected may be released for
purposes of scientific research and statistical surveys concerning societal condition in
anonymised form (e.g. EU-SILC microdata is transmitted to and distributed by
Eurostat).
 In one of the ISTAT questionnaires it was mentioned that there were no problems with
the sources currently used. Data quality on foreign control could be improved by using
additional sources maintained by the Bank of Italy, but this is prevented by legal
problems (as they are not part of the Italian statistical system).
 In another ISTAT questionnaire it remained to be seen to which extent the required
information will be available at unit level, according to Italian regulations regarding
confidentiality.
 A common identificator is missing for all administrative data in Destatis, the German
Statistical Office.
WP1
85
5.4. Problems of the integration process and methods
Nicoletta Cibella and Tiziana Tuoto (Istat)
This part of the questionnaire investigates the main methodological aspects of the data
integration project. It turns out that almost all the data integration projects must deal with the
problem of harmonising the different sources. Probabilistic methods as well as statistical
matching methods are still seldom used, although there are projects that apply all the
methodologies. Finally, the last questions of this section investigate if and how quality
evaluations are performed.
Question 12. Which problems affect the data integration process? (Please put an ‘X’ in one or
more cells; multiple answers are allowed).
Problems
Harmonization of statistical units
Harmonization of reference periods
Completion of coverage
Harmonization of variables definitions
Harmonization of classifications
Parsing20
Adjustment for measurement errors
Imputation for item non-response
Derivation of new variables
Check the overall consistency
Other:
Harmonization of editing procedures
Construction of key for the linkage
27
18
27
28
17
9
11
19
7
18
2
1
1
No response
1
The previous table describes almost all the collected questionnaires (some of the collected
integration projects are still under development, and their characteristics could still not clearly
be defined).
Generally speaking, the activities regarding the harmonization phases (Harmonization of
statistical units, of reference periods, of variables definitions, of classifications, Completion of
coverage, Parsing, Derivation of new variables, Harmonization of editing procedures) seem to
be the most demanding ones, involving 133 answers; on the other side, the presence of non
sampling errors (Adjustment for measurement errors, Imputation for item non-response,
Check the overall consistency) are less relevant occurring 48 times.
20
Parsing divides a free-form name field into a common set of components that can be compared. Parsing
algorithms often use hints based on words that have been standardised. This approach is usually used for
comparing person names and surnames, street names, city names (see Section 20.3 of Winkler, W.E. (1995).
Matching and Record Linkage. In Business Survey Methods (Cox B.G., Binder D.A., Chinnappa B.N.,
Christianson A., Colledge M., Kott P.S. (eds.)), pp. 355-384. Wiley, New York).
WP1
86
Question 13. Main method used.
Method
Exact record linkage (linking records corresponding to the same unit
from two data sources by merging identifying variables)
Probabilistic record linkage (e.g. Fellegi- Sunter approach; linking
records corresponding to the same unit from two data sources by
probabilistic methods)
Statistical matching (linking records corresponding to ‘similar units’
from two distinct sample surveys)
Other data integration procedures
32
6
5
3
By far, projects on integration of different sources consist of the application of exact record
linkage procedures. Record linkage procedures are considered in 32 situations, while
statistical matching is applied in 5 cases. In one case, imputation has been used in order to
integrate data (this approach can be associated to the statistical matching procedures).
Actually, some projects report the presence of more than one main integration method (for
this reason, the total of the previous table is more than 37). This is possible especially when
the integration phase should combine a large number of data sources with different
characteristics.
In the following table, the total number of projects that make use of more than one integration
method is illustrated.
Combination of methods
Exact record linkage + other data integration procedures
(imputation)
Exact record linkage + probabilistic record linkage
Exact record linkage + statistical matching
3
Exact record linkage + probabilistic record linkage + statistical
matching
2
Exact record linkage + probabilistic record linkage + statistical
matching + other data integration procedures
1
2
3
From the joint-analysis of the questions number12 and number13, no evidence of a
dependence of the problems affecting data integration process with the method implemented
appears. In other words, the issues concerning the harmonization phase or presence of non
sampling errors seem to affect similarly exact record linkage, probabilistic record linkage,
statistical matching and other data integration procedures. Due to the fact that almost all the
answers are concentrated on the exact record linkage method, the comparison of the two
classes of problems among different methods cannot be investigated deeply.
WP1
87
Method
Problems
Harmonization phase
Non sampling errors
Exact record
linkage
Probabilistic
Statistical
record linkage matching
Other
30
20
6
4
3
2
4
3
In the table below the number of problems which affect the data integration projects are
reported distinguished by the adopted integration methods. During the data integration
process more than one type of problem usually arose. Only two projects declared no problem
at all in integrating different sources while the majority of the processes faced more than 5
types of problems simultaneously.
Method
Number
of
problems arose
simultaneously
Exact record Probabilistic Statistical
linkage
record linkage matching
No error affects the process
Only 1 type of error
2 types of error
3 types of error
4 types of error
5 types of error
6 types of error
7 types of error
8 types of error
9 types of error
10 or more types of error
Other
2
2
2
5
5
2
2
3
1
2
1
2
2
2
1
1
1
1
1
Question 14. Brief description of the method used.
Regarding the exact record linkage procedures, in many projects the respondents highlight
mostly the variable or variables used for merging the units. The most frequent merging
variable is a single identification code. In business areas the code is a VAT registration
number (or a tax number) or the enterprise number; whereas, in case of socio-demographic
fields, instead, the personal identification number is obtained on the basis of a single variable
or of a combination of variables (name, place and date of birth). In addition, in a few cases the
integration phase is preceded by specific quality and editing controls (mainly on identifiers)
so as to determine the best matching variables. Sometimes the missing and/or the erroneous
WP1
88
values in the key variables are imputed on the basis of the time series or from other external
sources. At the end, for the cases not resolved by the automatic exact record linkage
procedure, the links are controlled and assigned manually.
Few questionnaires treat the probabilistic record linkage method; in this context the link
probability weights are based on the level of agreement between the matching variables.
The statistical matching methods are performed via logistic and regression models or via a
random selection of donors from classes determined by stratification and matching variables;
proxy variables are used to avoid the conditional independence assumption.
Another data integration procedure consists of imputation of unknown information on a small
scale.
Question 15. The quality assessment of the results of the data integration process consists in
the following tools (please put an ‘X’ in one or more cells; multiple answers are allowed).
Tool
Quality indicator of the integration process
Comparisons with respect to previous experiences
Published reports (available also for people outside the institute)
Other
20
23
12
2
Some questionnaires present missing values.
Other tools for assessing the quality of the data integration procedures are:
- the comparison of the results obtained with external sources, in particular by comparing
registers data with survey data at individual level using identification numbers;
- the analysis of the percentages of not matched individuals compared with the same data
aggregated from other surveys.
Question 16. Describe briefly the methods used for evaluating the data integration process. If
one or more quality indicators are used, please mention their definition.
The evaluation of the data integration procedures is performed mostly by means of external
sources, of the experts’ analyses or of previous experiences. One project is validated by
clerical review because of a small amount of data involved. The use of a re-matching
procedure via the double independence strategy gives the level of quality in only one project;
whereas in others some of the quality indicators used are:
- the geographical and sectorial coverage;
- the number of linked/non linked data;
- the imputation fraction in case of link failed;
- the quality of the link when it succeeded.
In only one questionnaire the validation of the procedure is carried out by studying the
statistical properties of the linked data, in detail the impact of the weights, both partial and
whole regression, model re-specification and the clerical review of the outlier and anomalous
values.
WP1
89
In some cases the quality indicators are still being considered because of the early
development of the projects.
5.5. Software issues
Ondrej Vozár (CZSO)
In this section there is a description of the answers on the software issues related to the
collected integration processes. It is apparent that these projects make generally use of
internally developed software, that can also be reused in other occasions (generalized
software). A brief overview of the software tools used for the projects is also given, as well as
documentation on internally developed software tools.
Question 17. Has a generalized software been used?
The following table shows the frequency of projects using generalized software
Overall number of projects
using generalized software
%
Yes
No
N.A.
Total
20
15
2
37
54,1
40,5
5,5
100,0
More than half of the projects are using generalized software (20 of 37 cases); missing
answers belong to planned projects not started yet.
Question 18. If Yes, write if any of the following software characteristics is available (please
put an ‘X’ in one or more cells; multiple answers are allowed)
The following table shows the software used by its characteristics
Overall number of software
used by their characteristics
%
Free SW
Open source
SW
0
3
Internally
developed
SW
26
0
8,1
70,3
N.A.
8
21,6
Mostly internally developed software (26 of 37 cases) is used. Missing answers belong mostly
to planned and not yet started projects and projects without generalized software (mostly
solved by ad hoc queries sql etc.).
Question 19. With reference to the software in question 17, please, describe briefly other
characteristics of the software (name, main characteristics, what phase(s) of the data
integration project does the software deal with, programming language(s), hardware and
software platforms and requirements, operating system).
WP1
90
The broad scope of software is used (mostly databases and statistical software). Commercial
statistical software tools are utilised in 12 projects. The most used is SAS (10 cases), the
remaining software is used just in 1 project – SPSS, GAUSS and STATA.
Database software is applied in 12 projects – Oracle (5 cases), MS SQL Server (2 cases),
FoxPro (1 case), ADABASE (1 case), MS ACCESS 2003 (1 case) and SYBASE (1 case).
Combination of SAS and Oracle occurs in the following projects:
i. Database of Social Statistics (Statistics Slovenia) – linking population register,
administrative social survey data (improve social statistics surveys, improve
population register and domain estimates),
ii. Compilation of National Accounts data (Statistics Hungary) – linking of both quarterly
and yearly individual data, survey aggregates and macro aggregates.
Special software BLAISE is applied in the Social Statistics Database (Statistics
Netherlands) where different registers and social survey data are linked by exact record
linkage.
A probabilistic linkage method is implemented by means of an internally developed software
CAMS (Computer Assisted Matching System in Visual C++). The software is developed for
needs of the UK 2001 One Number Census. It enables matching records using combination of
automatic, probability and clerical matching. It is run on a Windows desktop PC with access
over the secure network to the central Census Sybase databases.
Special methods and software for matching addresses are also used. The open source software
URBIS SPW is applied to link addresses among different registers in Statistics Belgium’s
Microcensus 2006 Project. In Statistics Austria Business Register Maintenance Project, data
of different sources are linked by names and addresses by the bigram method implemented
both in PLI (for batch job in IBM Host) and in SAS (both IBM Host and PC under Windows
XP).
Question 20. With reference to the software in question 17, if it is proprietary and internally
developed, please let us know if you are able to provide documentation regarding proprietary
software tools or algorithms / source codes.
Documentation on proprietary software is available in 8 projects (including 3 projects with
source code available).
Projects where both documentation and source code of proprietary software is available:
1. Statistics Italy, Non Cash Pension Benefits,
2. Statistics Italy, Social Accounting Matrix,
3. Statistics Switzerland, New Concept for the Federal Population Census (from 2010).
Project where only documentation of proprietary software is available:
1. Statistics Denmark, Improvement of Quality of Social and Business Surveys,
2. Statistics Slovenia, Database of Social Statistics,
3. Statistics Netherlands, Social Statistical Database,
4. Statistics Hungary, Compilation of National Accounts,
5. Czech Republic, RegCensus (linking of administrative and survey data in
demographic statistics – in preparation).
5.6. Documentation on the integration project
Nicoletta Cibella and Tiziana Tuoto (Istat)
This set of questions investigates whether there is any documentation on the integration
projects, in terms of methodology and results.
WP1
91
Question 21. The institution/unit/subunit where I work has produced documentation
(technical reports, articles in journals, manuals,…) on the implemented methodologies.
Yes
23
No
13
In progress
1
Among the 23 projects for which documentation is available more than the half (to be exactly:
13) provide documents in English.
Question 22. If yes, provide not more than three main references on the implemented
methodologies. Please specify the language, and if a main reference is not in English answer
the question whether a translation (of the abstract) is available.
The documentation available in English can be sub-divided into three main topics:
1. General Papers which regard the whole production of statistical information, not only
the integration process. In some projects the reported documentation is strictly linked
with the imputation and data editing procedures;
2. Documentation for ad hoc project described only some particular aspects arose and/or
some solution implemented for the specific project considered;
3. Official Documentation written for meeting the Eurostat or other National or
International Statistical Offices requirements
General
Papers
Documentation for Official
ad hoc project
Documentation
8
2
5
One questionnaire provides a Handbook of Best Practices.
Question 23. The unit where I work bases the data integration activities on documentation
(technical reports, articles on journals, manuals,…) produced by other institutes/universities,
as far as methodological issues are concerned.
Question 24. If yes, provide not more than three main references. Please specify the language,
and if a main reference is not in English answer the question whether a translation (of the
abstract) is available.
Only two projects answered that the unit where they work bases the data integration activities
on documentation produced by other institutes/universities, as far as methodological issues
are concerned, they show basically the use of regulations and manuals for the treatment of
administrative data.
5.7. Possible changes
Alois Haslinger (Statistics Austria)
The following questions deal with the planned modifications of the integration projects, in
terms of the implemented methodology and software tools. It is important to underline that
almost 40% of the collected projects plan to improve some aspects of the integration process.
WP1
92
Question 25: Do you plan to modify the data integration process in the near future?
Modification
Yes
No21
16
21
For 16 (43%) out of 37 data integration projects the respondents plan to modify the data
integration process in the near future.
Question 26: If ‘yes’, which aspects do you plan to change (multiple answers allowed)?
Aspects to change
Methods
Software
Other
9
9
6
This question addresses the aspects of the data integration process which are planned to be
modified in the near future. There is nearly a balance between the 3 mentioned aspects: The
methods will be changed in 9 projects, the software also in 9 projects and other aspects in 6
projects.
Actually, some projects plan to change more than one aspect of the integration process (for
this reason, the total of the previous table is more than 16). In the following table, the total
number of projects that plan to change at least one aspect of data integration is tabulated
according to the combination of aspects for which a change is planned:
Combination of aspects for which a change is planned
Only Methods
Only Software
Only other Aspects
Methods + Software
Software + Other Aspects
Methods + Software + Other Aspects
2
1
4
6
1
1
Obviously, there are two clusters of projects for which a change is planned. On the one hand
you have 6 projects for which methods and software shall be changed, on the other hand only
other aspects shall be changed in 4 projects. Only methods shall be changed in 2 projects. All
other combinations of changes are mentioned in at most one project.
Question 27: Please, describe briefly the planned modifications.
For 15 projects details on their planned modifications have been provided.
 In Austria a further improvement and optimization of the data integration process for
producing the SBS-statistics is mentioned. At the moment the methods and software
for the Test Census 2006 are evaluated. It is not yet clear which changes are needed
for the register-based Census 2011.
21
The response option ‚No’ includes the option ‚No answer’.
WP1
93







The Czech Republic tries to include more variables from income tax return and
monthly data of the social insurance in their data integration projects.
Italy tries to improve all steps in business demography, especially to create
generalized software for that project. The feasibility of the longitudinal use of a
sample survey integrated with data on causes of death and hospitalization shall be
evaluated further by using a broader database for Italy as a whole and more recent
years.
Better and more administrative data from State Revenue Service will be used for
imputation in the SBS of Latvia. The administrative data shall be analyzed more
deeply.
Hungary plans to establish a joint database for SBS and taxation, to eliminate
duplicate data publication.
In the UK the 2001 Census was linked to a large post-enumeration survey and the
results compared with demographic estimates and other aggregate administrative data.
For the 2011 Census much the same process is planned but with more sources of data.
The software may be expanded to allow integration of these other sources. IT
developments in the NHS will change tracing methods and software used for the ONS
Longitudinal Study. The Virtual Microdata Laboratory team began investigating the
unit-record linking of social and personal data. The project is a pilot and expected to
develop the data fusion methods used before.
In Switzerland at least 3 administrative sources are linked with the business register. It
is planned to improve the record-linkage methods, the construction of databases and
the data treatment functionalities. For the 2010 register census a modification of data
collection, data integration, pooling technologies, small area estimations and collating
information from separate registers is planned.
Germany plans an enlargement of sources used for the business register maintenance
and the development of new software.
5.8. Possibility to establish links between experts
Alois Haslinger (Statistics Austria)
The following questions investigate the opportunity to establish a connection between all the
interested people on some aspects of the integration projects. It turns out that there is great
interest in establishing a connection on methodological issues.
Question 28: Do you believe that the work of a committee/group of experts could provide a
useful external support for your current activities?
External support useful
Yes
No22
28
9
The majority of project leaders (28 of 37 or 76%) appreciates the work of a group of experts
on data integration and expects support from that group for their own activities.
22
The response option ‚No’ includes the option ‚No answer’.
WP1
94
Question 29: If ‘yes’, which aspects should the committee/group coordinate (multiple answers
allowed)? If ‘Other’, please specify:
Main focus of work of expert group
Methodological aspects
Software developments
Other
27
13
3
This question addresses some aspects of work which a committee/group of experts should
coordinate. From the specified response categories ‘methods’ have been marked most
frequently (27 times) followed by ‘software development’ (13 entries). Only 3 projects
wanted also other aspects as topic for the work of the expert group.
Some people wanted that the expert group should focus on more than one aspect (for this
reason, the total of the previous table is more than 28). In the following table, the total number
of entries in question 29 is tabulated according to the combination of aspects on which the
group of experts should focus:
Combination of work aspects
Only Methods
Only other Aspects
Methods + Software
Methods + Other Aspects
Methods + Software + Other Aspects
13
1
12
1
1
For 13 projects the expert group should concentrate only on methodological aspects, nearly as
much (12) opt for Methods and Software. All other combinations of work aspects are
mentioned in at most one project.
Only in 3 projects proposals for other work aspects of a new expert group are submitted.
 The Czech Republic brings quality assessment forward.
 From Finland comes the idea to review the effects of using different administrative
data sources and data integration on across-countries comparability.
 Spain remarks that several Working Groups and Seminars have been created wit the
aim of discussing the best management procedures for the Business Registers in the
European Union. A high-priority matter refers to develop national techniques that
allow fulfilling the requirements of the Community Regulation. The
Recommendations Manual on Business Registers (harmonised methodology) is
available for all countries.
WP1
95
Annex. Survey on the use and/or development of integration
methodologies in the different ESS countries
The Annex includes the letter of invitation and the questionnaire disseminated to the ESS MS,
in collaboration with Eurostat.
WP1
96
In 2005 Eurostat has launched the idea of establishing European Centres of Excellence in
the field of Statistics as a way to reinforce cooperation between National Statistical Institutes.
In this way the various institutes in Europe could benefit from each others experiences and
together raise the level of their statistical production process.
This CENEX project will be a one and a half year project and be active from December 2006
to June 2008. The area of interest of this CENEX is integration of surveys and administrative
data (ISAD). An overview of the activities in this CENEX can be found at the CENEX website
(http://cenex-isad.istat.it).
One of the activities consists of having an overview of the state of art on the area of interest
in different ESS countries. The objective of this overview is to identify possible areas in which
convergence can be achieved and to develop a plan to meet common needs on tools and
criteria for a harmonised treatment of data integration. In this context the CENEX on
Methodology, area ISAD, has developed a questionnaire. The aim of the questionnaire is
twofold: on the one hand, it investigates details on data integration projects; on the other
hand, it aims to collect documents, software and tools on integration of surveys and
administrative data. If possible, documentation (or details on how to download them) can be
provided together with the filled in questionnaire. Documentation will be distributed on the
CENEX-ISAD web site.
The questionnaire should be filled in by the personnel responsible for the main projects on
data integration within the institute (project manager or methodologist in charge of the
project). The questionnaire is focused on only one project. If you are responsible for
more than one project, please compile one questionnaire for each project.
By data integration project we mean any project that implies the joint use of multiple
sources of data for the data production process. The joint use of two or more sources has to
include the following aspect: the link of records at the unit level (e.g. administrative data with
records of another administrative source, a sample survey, a census, or a register/archive)
for different objectives. These objectives include: construction and maintenance of
registers/archives; enrichment of statistical surveys for improving coverage; enlargement of
the set of variables; improvement of weighting, imputation, small area estimators; analyses of
variables observed in two different data sources; construction of virtual censuses. A data
integration project does not include the substitution of a survey with a single administrative
source, or the simple collection of results from different sources (e.g. a statistical
information system containing aggregated data computed distinctly from different sources).
For additional terminological queries, please refer to the glossary at the end of this
questionnaire.
Replies should be sent to Mauro Scanu (E-mail: scanu@istat.it), project leader CENEX on
Methodology, area ISAD, before 27 April 2007.
After the processing we will publish a report of the results on the CENEX-ISAD website. We
very much appreciate your cooperation.
Mauro Scanu
CENEX-ISAD project leader
Istituto Nazionale di Statistica (ISTAT)
Via Cesare Balbo 16
00184 Roma - ITALIA
Email: scanu@istat.it
Telephone: +39 06 46732887
Fax: +39 06 46732972
WP1
97
Contact person details
1) Name and surname:
2) Institute:
3) e-mail (to be included in a mailing list of experts)
Data integration project details
In the following part of the questionnaire, a single project on data integration is described in
detail.
4) Please, describe briefly the project mentioning its name, if available
5) Main area of interest of the data integration activity (please put an ‘X’ in one cell only)
Main area of interest
To carry out a business survey
To carry out a social survey
To produce a population or housing census
To produce a business census
To produce a population archive/register
To produce a business archive/register
Other
WP1
98
If ‘Other’, please specify:
……………………………………………………………………………………………………………
……..
6) Objectives of the data integration activity (please put an ‘X’ in one or more cells; multiple
answers are allowed)
Objectives
Traditional census activities (for instance post-enumeration surveys)
Reduction of costs and response burden
Enhancement of editing and imputation processes
Analysis of statistical relations
Microsimulation policies
Register/archive maintenance
Improvement of estimation methods (weighting, imputation, small area estimators)
Set up of a sampling frame (for instance improvement of coverage)
Other
If ‘Other’, please specify:
……………………………………………………………………………………………………………
……………………………………………………………………………………………………………
…………
Sources used for the project
7) How many other sources have been used in the project?
Total number of sources
WP1
99
8) For the main sources used for the data integration project (maximum 14) specify the
following details: source name and nature (please put an ‘X’ in the relevant cell to specify
whether the source is an archive, a sample, a census or other)
Source name
Archive/
register
Sample
Census
Other
1
2
3
4
5
6
7
8
9
10
11
12
13
14
If ‘Other’, please specify the kind of source
……………………………………………………………………………………………………………
……………………………………………………………………………………………………………
……………………………………………………………………………………………………………
………………
9) Following the sources order given in question 8, answer the following questions about
each source: number of units (approximate figures are acceptable); specify whether the
source is managed by your institute or not; whether the source is well documented; whether
WP1
100
each record of the source is integrated with other records of other files, or whether the
source is used only in order to compute aggregated values (totals, frequencies) to be used in
the integration process. Please put an ‘X’ in the relevant cells.
Number
of units
Managed by
your institute
Yes
No
Quality of documentation
Use of the data source
Poor
Unit level
good
1
2
3
4
5
6
7
8
9
10
11
12
13
14
WP1
101
excellent
Aggregated data
Privacy issues
10) Is there a legal foundation which regulates the supply of administrative data and its
usage by the NSI?
Yes
No
11) In which way regulations and laws on privacy affect the data integration project? (please
put an ‘X’ in one or more cells; multiple answers are allowed)
Problems
Unit identifiers have to be cancelled in some data sets
Some data sets can provide only aggregate data
One or more archives/samples useful for integration goals can not be used
Linking of some groups of administrative data is prohibited by law
Other
If ‘Other’, please specify:
………………………………………………………………………….………………………………
……………………………………………………………………………………………………………
…………….
WP1
102
Questions on the data integration process
12) Which problems affect the data integration process? (Please put an ‘X’ in one or more
cells; multiple answers are allowed)
Problems
Harmonization of statistical units
Harmonization of reference periods
Completion of coverage
Harmonization of variables definitions
Harmonization of classifications
Parsing23
Adjustment for measurement errors
Imputation for item non-response
Derivation of new variables
Check the overall consistency
Other
If ‘Other’, please specify:
……………………………………………………………………………………………………………
……………………………………………………………………………………………………………
…………
23
Parsing divides a free-form name field into a common set of components that can be compared. Parsing
algorithms often use hints based on words that have been standardised. This approach is usually used for
comparing person names and surnames, street names, city names (see Section 20.3 of Winkler, W.E. (1995).
Matching and Record Linkage. In Business Survey Methods (Cox B.G., Binder D.A., Chinnappa B.N.,
Christianson A., Colledge M., Kott P.S. (eds.)), pp. 355-384. Wiley, New York).
WP1
103
13) Main method used (please put an ‘X’ in only one cell)
Method
Exact record linkage (linking records corresponding to the same unit
from two data sources by merging identifying variables)
Probabilistic record linkage (e.g. Fellegi- Sunter approach; linking
records corresponding to the same unit from two data sources by
probabilistic methods)
Statistical matching (linking records corresponding to ‘similar units’
from two distinct sample surveys)
Other data integration procedures
14) Brief description of the method used
…………………………………………………………………………………………………….……
……………………………………………………………………………………………………………
……………………………………………………………………………………………………………
…………………
15) The quality assessment of the results of the data integration process consists in the
following tools (please put an ‘X’ in one or more cells; multiple answers are allowed)
Tool
Quality indicator of the integration process
Comparisons with respect to previous experiences
Published reports (available also for people outside the institute)
Other
If ‘Other’, please specify
……………………………………………………………………………………………………………
……………………………………………………………………………………………………………
WP1
104
……………………………………………………………………………………………………………
………………
16) Describe briefly the methods used for evaluating the data integration process. If one or
more quality indicators are used, please mention their definition.
……………………………………………………………………………………………………………
……………………………………………………………………………………………………………
……………………………………………………………………………………………………………
………………
WP1
105
Software issues of the project
17) Has a generalized software been used? (Refer to the main software tool, i.e. to the
software used for the main phase of the integration project)
Yes
No
18) If Yes, write if any of the following software characteristics is available (please put an ‘X’
in one or more cells; multiple answers are allowed)
Software characteristics
Free software
Open source software
Internally developed software
19) With reference to the software in question 17, please, describe briefly other
characteristics of the software (name, main characteristics, what phase(s) of the data
integration project does the software deal with, programming language(s), hardware and
software platforms and requirements, operating system)
……………………………………………………………………………………………………………
……………………………………………………………………………………………………………
……………………………………………………………………………………………………………
………………
20) With reference to the software in question 17, if it is proprietary and internally developed,
please let us know if you are able to provide the following (please put an ‘X’ in one or more
cells; multiple answers are allowed)
Proprietary software
Documentation regarding proprietary software tools
Algorithms/source codes
WP1
106
Please
submit
documentation
and/or
algorithms/source
codes
to
Mauro
Scanu
(scanu@istat.it) with the appropriate contact details. Documentation and codes with the
corresponding contact details will be made available on the webpage http://cenexisad.istat.it.
WP1
107
Documentation on project methodologies
21) The institution/unit/subunit where I work has produced documentation (technical reports,
articles in journals, manuals,…) on the implemented methodologies.
No
Yes
22) If Yes, provide not more than three main references on the implemented methodologies.
Please specify the language, and if a main reference is not in English answer the question
whether a translation (of the abstract) is available.
a- …………………………………………………………………………………………………..
b- …………………………………………………………………………………………………..
c- …………………………………………………………………………………………………..
If possible, submit the documents to Mauro Scanu (scanu@istat.it) with the appropriate
contact details. Documentation with the corresponding contact details will be made available
on the webpage http://cenex-isad.istat.it.
WP1
108
23) The unit where I work bases the data integration activities on documentation (technical
reports, articles on journals, manuals,…) produced by other institutes/universities, as far as
methodological issues are concerned.
Yes
No
24) If Yes, provide not more than three main references. Please specify the language, and if
a main reference is not in English answer the question whether a translation (of the abstract)
is available.
a- …………………………………………………………………………………………………..
b- …………………………………………………………………………………………………..
c- …………………………………………………………………………………………………..
If possible, submit the documents to Mauro Scanu (scanu@istat.it).
WP1
109
Possible changes
25) Do you plan to modify the data integration process in the near future?
Yes
No
26) If Yes, which aspects do you plan to change? (Please put an ‘X’ in one or more cells;
multiple answers are allowed)
Aspects to change
Methods
Software
Other
27) Please, describe briefly the planned modifications.
……………………………………………………………………………………………………………
……………………………………………………………………………………………………………
……………………………………………………………………………………………………………
……………………………………………………………………………………………………………
……………………
WP1
110
Possibility to establish links between experts
28) Do you believe that the work of a committee/group of experts could provide a useful
external support for your current activities?
Yes
No
29) If yes, which aspects should the committee/group coordinate? (Please put an ‘X’ in one
or more cells; multiple answers are allowed)
Aspects to include
Methodological aspects
Software developments
Other
If ‘Other’, please specify:
……………………………………………………………………………………………………………
……………………………………………………………………………………………………………
……………………………………………………………………………………………………………
………………
WP1
111
GLOSSARY
Completion of population (coverage)
Completion of population arises when sources to be integrated present different coverage of
the population of interest.
Derivation of new variables
Every single operation which aims to recode the original variables stored in the database into
a new unambiguous one which is more appropriate for the data integration procedure.
Harmonization of classifications
Harmonization activities that deal with particular categorical variables whose response
categories are used to group units according to a hierarchical structure (e.g. NACE).
Transcoding tables are necessary when different classifications are used in different data
sources.
Harmonization of reference periods
Activities to make data from different sources in terms of output refer to the same period or
the same point in time. When combining different data sources, it is important to ascertain
that data refer to the same period or the same point in time. In the case of administrative data
a clear distinction should be made between the moment that a phenomenon occurs and the
moment that this phenomenon is registered. Also survey data may not always refer to the
same point in time. Harmonisation of reference periods can cover two different kinds of
actions.
Firstly, the dates in one or more sources may be incorrect. The starting date or the ending date
of a given situation may not be correctly registered. For example, the date of emigration of a
person is not always known or correct, because people do not report this to the population
register. Another example is tax data. They usually refer to a whole year, although not all
income is earned during the whole year (e.g. in the case of temporary jobs, like holiday jobs).
In that case the dates should be adjusted in order to reflect the correct reference period or the
correct average period.
Secondly, the dates in the sources are correct, but the time periods do not match. In this case
one needs to make some assumptions. For example, the occupation of a person is measured in
a survey (e.g. the Labour Force Survey) in October of a given year, but in our census we need
data on occupation on 1 January of the next year. In the case that the job of the surveyed
person in question in October still exists on 1 January, we may assume that the information
about the occupation of this person is also valid on 1 January next year. So, in this case the
reference period of the survey (October) is harmonised with the reference period of the census
(1 January).
Source: private communication from Paul van der Laan (Statistics Netherlands).
Harmonization of statistical units
All those activities that transform the units of observation/measurement of two different data
sources (related to the same target population) in order to derive units that share the same
definition across the different data sources.
Harmonization of variables definitions
WP1
112
Activities needed to transform similar characteristics or attributes observed for the same units
but in different data sources in order to derive variables that share the same definition across
the different data sources and therefore can be directly compared.
Imputation
Imputation is the process used to resolve problems of missing, invalid or inconsistent
responses identified during editing. This is done by changing some of the responses or
missing values on the record being edited to ensure that a plausible, internally coherent record
is created.
Source: Statistics Canada Quality Guidelines, 3rd edition, October 1998, page 38;
Working group on quality: Assessment of the quality in statistics: methodological documents Glossary, 6th meeting, October 2003, EUROSTAT.
Item non-response
Item non-response occurs either when a respondent provides some, but not all, of the
requested information, or when the reported information is not usable.
Source: FCSM, Subcommittee on Measuring and Reporting the Quality of Survey Data,
Measuring and reporting sources of error in survey, Statistical Policy working paper 31;
Working group on quality: Assessment of the quality in statistics: methodological documents Glossary, 6th meeting, October 2003, EUROSTAT.
Measurement error
Measurement error refers to error in survey responses arising from the method of data
collection, the respondent, or the questionnaire (or other instrument). It includes the error in a
survey response as a result of respondent confusion, ignorance, carelessness, or dishonesty;
the error attributable to the interviewer, perhaps as a consequence of poor or inadequate
training, prior expectations regarding respondents' responses, or deliberate errors; and error
attributable to the wording of the questionnaire, the order or context in which the questions
are presented, and the method used to obtain the responses.
Source: Biemer P.P., Groves R.M., Lyberg L.E., Mathiowetz N.A., Sudman S. (1991)
Measurement errors in survey. Wiley, New York, p. 760;
Working group on quality: Assessment of the quality in statistics: methodological documents Glossary, 6th meeting, October 2003, EUROSTAT
Microsimulation
Microsimulation (also known as microanalytic simulation) is a modelling technique that
operates at the level of individual units such as persons, households, vehicles or firms. Within
the model each unit is represented by a record containing a unique identifier and a set of
associated attributes – e.g. a list of persons with known age, sex, marital and employment
status; or a list of vehicles with known origins, destinations and operational characteristics. A
set of rules (transition probabilities) are then applied to these units leading to simulated
changes in state and behaviour. These rules may be deterministic (probability = 1), such as
WP1
113
changes in tax liability resulting from changes in tax regulations, or stochastic (probability
<=1), such as chance of dying, marrying, giving birth or moving within a given time period.
In either case the result is an estimate of the outcomes of applying these rules, possibly over
many time steps, including both total overall aggregate change and, crucially, the
distributional nature of any change.
Source: http://www.microsimulation.org
Parsing
The process of parsing and standardisation of linking variables involves identifying the
constituent parts of the linking variables and representing them in a common standard way
through the use of look-up tables, lexicons and phonetic coding systems.
Source: Statistics New Zealand (2006) Data integration manual, Statistics New Zealand
publication, Wellington, August 2006, p. 40.
Post enumeration survey
A sample survey aimed to check the accuracy of coverage and/or response of another census
or survey.
Source: Statistics New Zealand (2006) Data integration manual, Statistics New Zealand
publication, Wellington, August 2006.
Record linkage
Record linkage is the action of identifying records corresponding to the same entity from two
or more data sources or finding duplicates within files. Entities of interest include individuals,
companies, geographic region, families, or households.
Record linkage is defined as exact (or deterministic) if a unique identifier or key of the entity
of interest is available in the record fields of all of the data sources to be linked. The unique
identifier is assumed error-free and there is no uncertainty in exact linkage results. A unique
identifier might either be a single variable (for example: tax number, passport number or
driver’s license number) or a combination of variables (such as name, date of birth and sex),
as long as they are of sufficient quality to be used in combination to uniquely define a record.
Record linkage is defined as probabilistic when there are errors or lacking information in the
record identifiers.
Source: Gu L., Baxter R., Vickers D., Rainsford D. (2003). “Record linkage: current practice
and future directions”. CSIRO Mathematical and information sciences, Canberra Australia.
Statistics New Zealand (2006) Data integration manual, Statistics New Zealand publication,
Wellington, August 2006.
Register
Complete written record containing regular entries of items and details on particular set of
objects. Administrative registers come from administrative sources and become statistical
registers after passing through statistical processing in order to make it fit for statistical
purposes (production of register based statistics, frame creation, etc.).
Source: Daniel W. Gillman (Ed) Common terminology of METIS. Version of: 29 September,
1999, p. 7.
WP1
114
Software, free
Free software is software that comes with permission for anyone to use, copy, and distribute,
either verbatim or with modifications, either gratis or for a fee. In particular, this means that
source code must be available.
Notice that free software may be used as synonymous of freeware software. Freeware
software is software at no cost that typically permits redistribution but not modification (and
the source code is not available).
Source: http://www.gnu.org/philosophy/categories.html#FreeSoftware; http://www.fsf.org/
Software, generalized
Generalized software is software developed for “horizontal” activities, namely activities that
do not depend on a specific application domain, and thus on specific functional requirements.
Software, internally developed
Internally developed software is software whose code has been entirely developed by
programmers internal to a specific organization.
Software, open source
Open source software is software released with an open source licence by the Open Source
Initiative (OSI - http://www.opensource.org/index.php). The licence gives the right to use,
copy, modify and distribute the software's original “source code”.
Open source software is often used to mean more or less the same category as free software.
However, it is not exactly the same class of software: some open source licenses are
considered too restrictive with respect to free software licences, and there are free software
licenses that are not accepted as open source ones. However, the differences are small: nearly
all free software is open source, and nearly all open source software is free.
Source: http://www.opensource.org/docs/definition.php;
http://www.gnu.org/philosophy/categories.html#OpenSource
Statistical matching
Statistical matching, also known as data fusion or synthetical matching, aims to integrate two
(or more) sample surveys characterized by the fact that (a) the units observed in the samples
are different (disjoint sets of units); (b) some variables are commonly observed in the two
surveys (matching variables).
In order to distinguish statistical matching from record linkage, the former is also defined as
procedure that links “similar” units from two sample surveys without any unit in common.
(where similarity concerns the matching variables).
WP1
115
Source: D’Orazio M., Di Zio M, Scanu M. (2006). Statistical Matching: Theory and Practice.
Wiley, Chichester.
Unit identifiers
Any variable or set of variables that is structurally unique for each population unit (person,
place, event or other unit).
Source: Statistics New Zealand (2006). Data integration manual. Statistics New Zealand
publication, Wellington, August 2006; OECD Glossary of statistical terms
WP1
116
Download