ESSnet Statistical Methodology Project on Integration of Survey and Administrative Data Report of WP1. State of the art on statistical methodologies for integration of surveys and administrative data LIST OF CONTENTS Preface (Mauro Scanu – ISTAT) III 1. Literature review on probabilistic record linkage 1.1. Statement of the record linkage problem (Marco Fortini - ISTAT) 1 1 1.2. The probabilistic record linkage workflow (Nicoletta Cibella, Mauro Scanu, Tiziana Tuoto) 2 1.3. Notation and difficulties for probabilistic record linkage (Marco Fortini – ISTAT) 3 1.4. Decision rules and procedures (Miguel Guigo – INE) 4 1.5. Estimation of the distribution of matches and nonmatches (Mauro Scanu – ISTAT) 8 1.6. Blocking procedures (Gervasio Fernandez – INE) 11 1.7. Quality assessments (Nicoletta Cibella, Tiziana Tuoto – ISTAT) 16 1.8. Analysis of files obtained by record linkage (Miguel Guigo – INE) 20 2. Literature review on statistical matching 24 2.1. Statement of the problem of statistical matching (Marcello D’Orazio – ISTAT) 24 2.2. The statistical matching workflow (Marcello D’Orazio, Marco Di Zio, Mauro Scanu – ISTAT) 25 2.3. Statistical matching methods (Marco Di Zio – ISTAT) 27 2.4. Uncertainty in statistical matching (Mauro Scanu – ISTAT) 33 2.5. Evaluation of the accuracy of statistical matching (Marcello D’Orazio ISTAT) 36 3. Literature review on micro integration processing WP1 I 39 3.1. Micro integration processing (Eric Schulte Nordholt – CBS) 39 3.2. Combining data sources: micro linkage and micro integration (Eric Schulte Nordholt, Frank Linder – CBS) 39 3.3. Key reference on micro integration (Miguel Guigo – INE; Paul Knottnerus, Eric Schulte Nordholt – CBS) 42 3.4. Other references 47 4. Practical experiences 48 4.1. Record linkage of administrative and survey data for the EU-SILC survey: the Italian experience (Paolo Consolini – ISTAT) 48 4.2. Record linkage applied for the production of business demography data (Caterina Viviano – ISTAT) 51 4.3. Combination of administrative, register and survey data for Structural Business Statistics (SBS) – the Austrian concept (Gerlinde Dinges – STAT) 55 4.4. Record linkage applied for the computer assisted maintenance of Business Register: the Austrian experience (Alois Haslinger –STAT) 59 4.5. The use of data from population registers in the 2001 Population and Housing Census: the Spanish experience (INE Spain) 64 4.6. Administrative data source (DBP) for population statistics based on ISEO register in Czech Republic (Jaroslav Kraus – CZSO) 71 4.7. An experiment of statistical matching between Labour Force Survey (RFL) and Time Use Survey (TUS) (Gianni Corsetti - Isfol) 73 5. Results of the survey on the use and/or development of integration methodologies in the different ESS countries 79 5.1. Introduction (Mauro Scanu – ISTAT) 79 5.2. Objective of the integration process and characteristics of the file to integrate (Luis Esteban Barbado – INE Spain) 80 5.3. Privacy issues (Eric Schulte Nordholt – CBS) 83 5.4. Problems of the integration process and methods (Nicoletta Cibella, Tiziana Tuoto – ISTAT) 86 5.5. Software issues (Ondrej Vozár – CZSO) 90 5.6. Documentation on the integration process (Nicoletta Cibella, Tiziana Tuoto – ISTAT) 91 5.7. Possible changes (Alois Haslinger – STAT) 92 5.8. Possibility to establish links between experts (Alois Haslinger – STAT) 94 Annex. Survey on the use and/or development of integration methodologies in the different ESS countries 96 WP1 II Preface (Mauro Scanu - ISTAT) This document is the deliverable of the first work package of the Centre of Excellence on Statistical Methodology, Area Integration of Surveys and Administrative Data (CENEXISAD, consisting of the NSIs of Austria, the Czech Republic, Italy, the Netherlands and Spain). The objective of this document is to provide a complete and updated overview of the state of the art of the methodologies regarding integration of different data sources. The different NSIs (within the ESS) can refer to this unique document if they need to: 1) define a problem of integration of different sources according to the characteristics of the data sets to integrate; 2) discover the different solutions available in the statistical literature; 3) understand which problems still need to be tackled, and motivate the research on these issues; 4) look at the characteristics of many different projects that needed the integration of different data sources. This document consists of five chapters that can be broadly clustered in two groups. The first three chapters are mainly methodological. They describe the state of the art respectively for i) probabilistic record linkage, ii) statistical matching, and iii) micro integration processing. Each chapter is indeed a collection of references. As a matter of fact, this part of the document is intended as a tool enabling orientation through the wide amount of papers on different integration methodologies. This aspect should not be considered as a secondary issue in the production of official statistics. The main problem is that methodologies for the integration of different sources are, most of the times, still in their infancy. On the contrary, the current informative needs for official statistics require an increasingly more sophisticated use of multiple sources for the production of statistics. Whoever is in charge of a project on integration of different sources must be conscious of all the available alternatives and should be able to justify the chosen method. The last two chapters are an overview of integration experiences in the ESS. Chapter 4 collects detailed information on many different projects that need a joint use of two or more sources in the participating NSIs of this CENEX. Chapter 5 illustrates the results of a survey on the use and/or development of integration methodologies in the ESS countries. These chapters illustrate the many informative needs that cannot be solved by means of a unique source of information, as well as the peculiar problems that must be treated in each integration process. WP1 III 1. Literature review on probabilistic record linkage 1.1. Statement of the problem of record linkage Marco Fortini (ISTAT) Record linkage consists in identifying pairs of records, coming from either the same or different data files, which belong to the same entity, on the base of the agreement between common indicators. The previous figure is taken from Fortini et al. (2006) and shows record linkage of two data sets A and B. Links aim at connecting records belonging to the same unit, comparing some indicators (name, address, telephone). It is possible that some agreement is not perfect (as in the telephone of the first record of the left data set and third record of the right data set), but the records still belong to the same unit. A classical use of linked data in the statistical research context is the study of the relationships between variables collected on the same individuals but coming from different sources. Other important applications entail the removal of duplicates from data sets and the development and management of registers. Record linkage is a pervasive technique also in a business context where it regards information systems for customer relationship management and marketing. Recently, an increasing interest in e-government applications comes also from public institutions. Regardless of the record linkage purposes, the same logic is adopted in extreme cases: when a pair of records is in complete disagreement on some key issues it will be almost certainly composed of different entities; conversely, a perfect agreement will indicate an almost certain match. All the intermediate cases, whether a partial agreement between two different units is achieved by chance or a partial disagreement between a couple of records relating to the same entity is caused by errors in the comparison variables, have to be properly resolved depending on the particular approach which is adopted. A distinction between a deterministic and probabilistic approach is often made in the literature, where the former is associated with the use of formal decision rules while the latter makes an explicit use of probabilities for deciding when a given pair of records is actually a match. The existence of a large number of different approaches, mainly defined in computer science, that make use of techniques based on similarity metrics, data mining, machine learning, etc., without defining explicitly any substantive probabilistic model, makes the previous distinction more subtle. In the present review only the strictly probabilistic WP1 1 approaches will be discussed, given their natural attitude to acknowledge the essential task of matching errors evaluation, whereas Gu et al. (2003) is referenced for a first attempt at an integrated view of recent developments in all the major approaches. Bibliography Fortini, M., Scannapieco, M., Tosco, L., and Tuoto, T., 2006. Towards an Open Source Toolkit for Building Record Linkage Workflows. Proceedings SIGMOD 2006 Workshop on Information Quality in Information Systems (IQIS’06), Chicago, USA, 2006. Gu, L., Baxter, R., Vickers, D., and Rainsford C., 2003. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, Canberra, Australia, April 2003. (http://citeseer.ist.psu.edu/585659.html) 1.2. The probabilistic record linkage workflow Nicoletta Cibella, Mauro Scanu, Tiziana Tuoto (ISTAT) Probabilistic record linkage is a complex procedure that is composed of different steps. A workflow (adapted from a workflow described in the record linkage manual of Statistics New Zealand) is the following. WP1 2 This document reviews in detail the papers on the different steps of the workflow. a. Start from the (already harmonized) data sets A and B (see WP2, Section 1, for more details). b. Usually the overall set of pairs of records from the data sets A and B is too large, and this causes computational and statistical problems. There are different procedures to deal with this problem, which are listed in Section 1.6. c. A necessary step for probabilistic record linkage is to consider how the variables used for matching pairs remain stable from one data set to another. This information is seldom available, but can be estimated from the data sets at hand (Section 1.5). d. Consider all the pairs of records in the search space created by the procedure in step 2. Apply a decision rule for each pair of records (results are: link, possible link, no link). This is described in Section 1.4. e. Link the two data sets according to the results of the previous step. f. Evaluate the quality of the results (Section 1.7). g. Analyse the resulting completed data sets, taking in mind that this file can contain matching errors (Section 1.8). In the following, all the previous steps will be analyzed starting from the core of the probabilistic record linkage problem (Section 1.3), i.e. the definition of the model that generates the observed data, the optimal decision procedure, according to the Fellegi and Sunter theory (Section 1.4), the estimation of the necessary parameters for the application of the decision procedure (Section 1.5). After reviewing these aspects, the procedures for reducing the search space of the pairs of records will be illustrated (Section 1.6). Appropriate methods for the evaluation of the quality of the probabilistic record linkage are outlined (Section 1.7). Finally, the problem of analysing data sets obtained by means of a record linkage procedure is given (Section 1.8). 1.3. Notation and technicalities for probabilistic record linkage Marco Fortini (ISTAT) The early contribution to modern record linkage dates back to Newcombe et al. (1959) in the field of health studies, followed by Fellegi and Sunter (1969) where a more general and formal definition of the problem is given. Following the latter approach, let A and B be two partially overlapping files consisting of the same type of entities (individuals, households, firms, etc.) respectively of size nA and nB. Let be the set of all possible pairs of records coming from A and B, i.e. ={(a,b): aA, bB}. Suppose also that the two files consist of vectors of variables (XA,YA) and (XB,ZB), either quantitative or qualitative, and that XA and XB are sub-vectors of k common identifiers, called key variables in what follows, so that any single unit is univocally identified by an observation x. Moreover, let γab designate the vector WP1 3 of indicator variables regarding the pair (a,b) so that γjab=1 in the j-th position if xaA, j xbB, j and 0 otherwise, j=1,…,k. The indicators γjab will be called comparison variables. Given the definitions above we can formally represent record linkage as the problem of assigning the couple (a,b) to either one of the two subsets M or U, which identify the matched and the unmatched sets of pairs respectively, given the state of the vector γab. Probabilistic methods of record linkage generally assume that observations are independent and identically distributed observations from appropriate probability distributions. Following Fellegi and Sunter (1969), there is a first bivariate random variable that assigns each pair of records (a,b) to the matched records (set M) or to the unmatched ones (set U). This variable is latent (unobserved), and it is actually the target of the record linkage process. Secondly, the comparison variables γab follow distinct distributions according to the pair status. Let m(γab) be the distribution of the comparison variables given that the pair (a,b) is a matched pair, i.e. (a,b)M, and u(γab) be the distribution of the comparison variables given that the pair (a,b) is an unmatched pair, i.e. (a,b)U. These distributions will be crucial for deciding the record pairs status. Bibliography Fellegi, I. P., and A. B. Sunter, 1969. A theory for record linkage. Journal of the American Statistical Association, Volume 64, pp. 1183-1210. Newcombe, H., Kennedy, J., Axford, S. and James, A., 1959. Automatic Linkage of Vital Records, Science, Volume 130, pp. 954–959. 1.4. Decision rules and procedures Miguel Guigo (INE) 1.4.1. General statement of the statistical problem The key task of all the record linkage process is to determine whether a pair of records belongs to the same entity or not; hence, the quality of the whole result achieved by the linkage procedure relies on the quality of the tool applied to make this choice, that is, the decision rule. From a statistical point of view, following De Groot (1970), a decision problem consists of an experiment the actual outcome of which is unknown, and the consequences of which will depend on that outcome and a decision taken by the statistician. Specifically, let D be the space of all possible decisions d which might be made, let be the space of all possible outcomes of the experiment, and let R be the space of all possible rewards or results r = r(,d) of the statistician decision d and the outcome of the experiment. In most cases, r is actually a loss function. We also assume that there exists a probability distribution P on the space of outcomes whose value P() is specified for each event . Then, the statistician must choose an optimal non-deterministic behaviour in an incompletely known situation. A way to face this is to minimize the expectation of the total loss, and then the decision rule is optimal (Wald, 1950); but the statistician must also face a problem with respect to the probability distribution P, WP1 4 which is known to belong to a family of probability distributions, but some of whose parameters are unknown; by making some observations of the phenomenon and processing the data, the statistician has to make a decision on P. Therefore, a statistical decision rule is a transition probability distribution from a space of outcomes into a space of decisions D1. In the case of a record linkage procedure, the space of actual outcomes consists of a real match or a real nonnmatch for every pair of records belonging to ={(a,b): aA, bB}, and the space D of all possible decisions consists of assigning or not the pair as a link. In this context, the decision rule can be seen as made up of a two-step process, where the first stage is to organize the set of agreements between common identifiers for each pair of records (a,b) in an array γab. This means a mapping from on , where is known as space of comparisons. A function that returns a numerical comparison value for γjab multiplied by a weight wj, gives a basic score on the level of coincidence for the j-th key variable, which sets the contribution of every common identifier. Procedures for measuring agreement between records (a,b) will then result in a composite weight of their closeness. Patterns can be more or less arbitrary, based on distance, similarity, or linear regression models, amongst others. For a more complete list of comparators, see Yancey (2004b). Newcombe et al. (1959), and Fellegi and Sunter (1969) consider the different amount of information provided by each key variable, by means of using a log-likelihood ratio taking into account the agreement probabilities. This is considered the standard procedure, as shown below. From m(γab) and u(γab) as defined in the previous section, each pair is assigned the following weight: wab = log (m(γab) / u(γab)). Once a weighted measure of agreement is set, the following step is in its turn a mapping from on a space of states which consists of the following decisions: A1 (that is, a link), A3 (that is, a non-link), and A2 (that is, a possible link), with related probabilities given that (a,b) U or (a,b) M, which can also be derived from the probability distributions m(γab) and u(γab) and the regions of associated to each decision. As the weighted score increases, the associated pair (a,b) is more likely to belong to M. So, on the one hand, given an upper threshold, a bigger numerical comparison value for γab will lead to consider the pair as a link; and, on the other hand, a smaller comparison value, given a lower threshold, will lead to consider it as a non-link. Taking into account both steps, the problem of record linkage and the decision rule can be faced up as a common statistical hypothesis test with a critical and an acceptance region, which are obtained through the different values of γ in and their respective composite weight values on R, compared with a set of fixed bounds. A probability model based on [m(γab), u(γab)] in order to calibrate the error rates, i.e. = P{A1/(a,b) U}and = P{A3/(a,b) M}is therefore also needed. At this point, it is important to remark that, while consists of only two disjoint subsets M or U, the space of decisions is split into three subsets due to the fact that probability distributions of matches and non matches are partially overlapping. Then, for possible links, when A2 is achieved, a later clerical review of the ambiguous results will be needed, in order to appropriately discriminate these intermediate results between the link cases and non-link 1 For a more formal definition of a statistical decision rule, see Chentsov (1982), 65. WP1 5 cases. An intuitive idea is that, if the main reason to implement an automatic record linkage procedure is to avoid or reduce costs, time wasting, or errors due to the use of specifically trained staff to link records manually, the larger A2 is, the bigger those costs, time consumption and errors are, and the worst the decision rule is. So, the optimal linkage rule has to maximize the probabilities of positive dispositions of comparisons -that is to say, positive links A1 and positive non-links A3- for a given pair of fixed errors and . 1.4.2. Probability model and optimal fusion rule Following Fellegi and Sunter (1969), m(γ) and u(γ) are defined to be the conditional probabilities of observing γ given that the record pair is, respectively, a true match or a true non-match. Then, P{A1/U}and P{A3/M}are defined respectively as the sum of probabilities γ m(γ)P{A1/ γ} and γ m(γ)P{A3/ γ}. Moreover, P{A2/ γ} should be minimised in the optimal decision rule2. In order to simplify notation, we write just γ instead of γab. Then, the values of γ must be arranged in order to make the ratio R1(γ) = m(γ)/u(γ) monotonically decreasing -provided that values of R1(γ) where m(γ)>0 and u(γ)=0 should be placed first- and indexed as 1,2 ... | |, where | | is the cardinality of the set . For a value of equal to the sum of u(γ) for the first n values of γ so previously arranged, and a value of equal to the sum of m(γ) for the last values of γ starting the count from a value n’, let T = m(γn)/u(γn) be an upper cut-off threshold, and T = m(γn’)/u(γn’) a lower one. Then, the optimal rule is given by: (a,b) A1 (positive link) when the ratio R1(γ) is bigger than or equal to T (a,b) A2 (possible link) when the ratio R1(γ) is in the region lying between TandT (a,b) A3 (positive non- link) when the ratio R1(γ) is lower than or equal to T The figure above illustrates how the optimal rule works, creating the critical, acceptance and intermediate regions. Vertical lines are each representing a correspondent threshold, the line 2 For other criteria of optimality, see Gu et al. (2003) and Verykios (2000). WP1 6 on the left represents the lower bound T, and the one on the right represents the upper bound T.. Areas marked as FU and FM represent, respectively, the probability of false non-matches (FU) and false matches (FM), that is to say, the associated error or false-match rates. As the figure suggests, the number of records (a,b) U is widely greater than the number of records (a,b) M. Let nA and nB the number of records in A and B. Say, without loss of generality, nA<nB. Then nA< (nA×nB - nA). So, it is a common assumption when estimating the u(γab) distribution, that the proportion p of matched pairs is negligible. As shown below, some additional assumptions on the behaviour of m(γab) and u(γab) can be made in order to simplify those conditional distributions, which result in a little bit different form of the ratio R, closer to the weights proposed at the beginning of the section. Given the independence between the components, note that they can be written as m(γab) = m1(γ1ab)· m2(γ2ab)·... mk(γkab), and u(γab) = u1(γ1ab)·u2(γ2ab)·... uk(γkab). Then, the decision rule can be written as a log-likelihood ratio R2(γ) = log[m(γ)/u(γ)], which represents a weighted score j wj, where wj = log (mj / uj). Anyway, as far as the reliability of the decision rule is heavily dependent on the accuracy of the estimates of m(γab) and u(γab), the core problem of the standard procedure for record linkage is to determine the values of those probabilities, known as matching parameters (see Scheuren and Winkler, 1993); the difficulties of empirically simulating the accuracy of the estimated parameters has lead to different approaches, discussed in the following section. Bibliography Belin, T.R. and Rubin, D.B. ,1990. Calibration of errors in computer matching for Census undercount. Proceedings of the Government Statistics Section of the American Statistical Association, pp. 124-131. Chentsov, N.N., 1982. Statistical Decision Rules and Optimal Inference. Translations of Mathematical Monographs, Volume 53, 499. American Mathematical Society, Rhode Island, U.S. De Groot, M.H., 1970. Optimal Statistical Decisions. McGraw-Hill, New York [etc.] Fellegi, I. P., and A. B. Sunter, 1969. A theory for record linkage. Journal of the American Statistical Association, Volume 64, pp. 1183-1210. Gu L., Baxter R., Vickers D., and Rainsford C., 2003. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, Canberra, Australia, April 2003. Jaro, M.A., 1989. Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association, Volume 84, pp. 414-420. Larsen, M.D., 2004. Record Linkage of Administrative Files and Analysis of Linked Files. In IMS-ASA's SRMS Joint Mini-Meeting on Current Trends in Survey Sampling and Official Statistics. The Fort Radisson, Raichak, West Bengal, India. WP1 7 Newcombe, H. B., Kennedy, J.M., Axford, S.J. and James, A.P., 1959. Automatic linkage of vital records. Science, 130, pp. 954-959. Newcombe, H. B., and Kennedy, J. M., 1962. Record linkage. Making maximum use of the discriminating power of identifying information. Communication of the Association for Computing Machinery, Volume 5(11), pp. 563-566. Scheuren, F. and Winkler, W.E., 1993. Regression analysis of data files that are computer matched – Part I. Survey Methodology, Volume 19, 1, pp. 39-58. Verykios, V.S., 2000. A Decision Model for Cost Optimal Record Matching. In: National Institute of Statistical Sciences Affiliates Workshop on Data Quality, Morristown, New Jersey. Wald, A.,1950. Statistical Decision Functions, John Wiley & Sons, New York. Yancey, W.E., 2004b. An Adaptive String Comparator for Record Linkage. U.S. Bureau of the Census, Statistical Research Division Report Series, n. 2004/02. 1.5. Estimation of the distributions of matches and nonmatches Mauro Scanu (ISTAT) As shown in the previous sections, a key aspect for applying probabilistic rules for record linkage is played by the distributions of comparison variables for respectively matches and nonmatches. The problem is that these distributions are usually unknown, and need to be estimated. Most papers deal with the problem of estimating these distributions from the data sets to be linked. The proposed methods basically follow the approach firstly established in Fellegi and Sunter (1969). The latter approach consists in considering all the pairs as a sample of nA×nB records independently generated by a mixture of two distributions: one for the matched pairs and the other for the unmatched ones. The status of matched and unmatched pairs is randomly chosen by a latent (i.e. unobserved) dichotomous variable. This model allows the computation of a likelihood function to be maximized in order to estimate the unknown distributions of the comparison variables γab for matched and unmatched pairs. Maximization of the likelihood function will require iterative methods for dealing with the latent variable, usually the EM algorithm or some if its generalizations3. 1.5.1. Different approaches in the estimation As a matter of fact, the presence of a latent variable risks to make the model parameter unidentified. For this reason, different papers have considered simplifying assumptions. In 3 The EM (Expectation Maximization) algorithm has been defined in Dempster, Laird and Rubin (1977) as a method for obtaining maximum likelihood estimates from partially observed data sets (including the case of latent variables). Broadly speaking, it is an iterative procedure which starts with a preliminary value of the parameter to estimate , say 0; fills in missing values with the expected value of the missing value under 0 (E step); computes the maximum likelihood estimate of on the completed file (M step); iterates the E and M steps until convergence. WP1 8 almost all cases comparison variables γab are assumed to be dichotomous, i.e. they just report the equivalence or difference of each key variable. Independence between the comparison variables – This assumption is usually called the Conditional Independence Assumption (CIA), i.e. the assumption of independence between the comparison variables γjab, j=1,…,k, given the match status of each pair (matched or unmatched pair). Fellegi and Sunter (1969) define a system of equations for estimating the parameters of the distributions for matched and unmatched pairs, which gives estimates in closed form when the comparison variables are at most three. Jaro (1989) solves this problem for a general number of comparison variables with the use of the EM algorithm. Dependence of comparison and latent variable defined by means of loglinear models – Thibaudeau (1989, 1993) and Armstrong and Mayda (1993) have estimated the distributions of the comparison variables under appropriate loglinear models of the comparison variables. They found out that these models are more suitable than the CIA. The problem is estimating the appropriate loglinear model. Winkler (1989, 1993) underlines that it is better to avoid estimating the appropriate model, because tests are usually unreliable when there is a latent variable. He suggests using a sufficiently general model, as the loglinear model with interactions larger than three set to zero, and incorporating appropriate constraints during the estimation process. For instance, an always valid constraint states that the probability of having a matched pair is always smaller than the probability of having a nonmatch. A more refined constraint is obviously the following: nA 1 . p nB n A nB Estimation of model parameters under these constraints may be performed by means of appropriate modifications of the EM algorithm, see Winkler (1993). Bayesian approaches – Fortini et al. (2001, 2002) look at the status of each pair (match and nonmatch) as the parameter of interest. For this parameter and for the parameters of the latent variables that generates matches and nonmatches they define natural prior distributions. The Bayesian approach consists in marginalizing the posterior distribution of all these parameters with respect to the parameters of the comparison variables (nuisance parameters). The result is a function of the status of the different pairs that can be analysed for finding the most probable configuration of matched and unmatched pairs. Iterative approaches – Larsen and Rubin (2001) define an iterative approach which alternates a model based approach and clerical review for lowering as much as possible the number of records whose status is uncertain. Usually, models are estimated among the set of fixed loglinear models, through parameter estimation computed with the EM algorithm and comparisons with “semi-empirical” probabilities by means of the Kullback-Leibler distance. Other approaches – Different papers do not estimate the distributions of the comparison variables on the data sets to link. In fact, they use ad hoc data sets or training sets. In this last case, it is possible to use comparison variables more informative than the traditional dichotomous ones. For instance, a remarkable approach is considered in Copas and Hilton (1990), where comparison variables are defined as the pair of categories of each key variable observed in two files to match for matched pairs (i.e. comparison variables report possible classification errors in one of the two files to match). Unmatched pairs are such that each component of the pair is independent of the other. In order to estimate the distribution of WP1 9 comparison variables for matched pairs, Copas and Hilton need a training set. They estimate model parameters for different models, corresponding to different classification error models. 1.5.2. Quality assessment and open issues Usually papers on record linkage do not report any quality assessment on the estimation of the distributions of the comparison variables. However, it is necessary to report a weakness of all the estimation methods based on a specific model (apart from that proposed by Copas and Hilton, 1990). The assumed model fails to be true for the sample defined by the set of n A×nB records for the two data sets to link. In that case, it is not possible to state that comparison variables are independently generated by appropriate distributions. For more details about this weakness, see Kelley (1984). It is not yet clear how the failure of this independence hypothesis affects the record linkage results. Given the presence of a latent variable, estimation is not reliable when one of the categories of the latent variable is rare. In this case, the set of the matched pairs M should be large enough (say, more than 5% of the overall set of nAnB pairs). This is one of the motivations for the application of blocking procedures, as shown in the next paragraph. Bibliography Armstrong, J. and Mayda, J.E., 1993. Model-based estimation of record linkage error rates. Survey Methodology, Volume 19, pp. 137-147. Copas, J. R., and F. J. Hilton, 1990. Record linkage: statistical models for matching computer records. Journal of the Royal Statistical Society, A, Volume 153, pp. 287-320. Dempster, A.P., Laird, N.M., and Rubin, D.B., 1977 Maximum Likelihood from Incomplete Data via the EM algorithm. Journal of the Royal Statistical Society, Series B, Volume 39, pp. 1-38 Fellegi, I. P., and A. B. Sunter, 1969. A theory for record linkage. Journal of the American Statistical Association, Volume 64, pp. 1183-1210. Fortini, M., Liseo, B., Nuccitelli, A. and Scanu, M., 2001. On Bayesian record linkage. Research in Official Statistics, Volume 4, pp. 185-198. Published also in Monographs of Official Statistics, Bayesian Methods (E. George (ed.)), Eurostat, pp. 155-164. Fortini, M., Nuccitelli, A., Liseo, B., Scanu, M., 2002. Modelling issues in record linkage: a Bayesian perspective. Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 1008-1013. Jaro, M.A., 1989. Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association, Volume 84, pp. 414-420. Kelley, R.B., 1984. Blocking considerations for record linkage under conditions of uncertainty. Statistical Research Division Report Series, SRD Research Report No. RR-84/19. Bureau of the Census, Washington. D.C. WP1 10 Larsen, M.D. and Rubin, D.B., 2001. Iterative automated record linkage using mixture models. Journal of the American Statistical Association, 96, pp. 32-41. Thibaudeau, Y., 1989. Fitting log-linear models when some dichotomous variables are unobservable. Proceedings of the Section on statistical computing, American Statistical Association, pp. 283-288. Thibaudeau, Y., 1993. The discrimination power of dependency structures in record linkage. Survey Methodology, Volume 19, pp. 31-38. Winkler, W.E., 1989a. Near automatic weight computation in the Fellegi-Sunter model of record linkage. Proceedings of the Annual Research Conference, Washington D.C., U.S. Bureau of the Census, pp. 145-155. Winkler, W.E., 1989b. Frequency-based matching in Fellegi-Sunter model of record linkage. Proceedings of the Section on Survey Research Methods, American Statistical Association, 778-783 (longer version report rr00/06 at http://www.census.gov/srd/www/byyear.html). Winkler, W.E., 1993. Improved decision rules in the Fellegi-Sunter model of record linkage. Proceedings of the Survey Research Methods Section, American Statistical Association, pp. 274-279. 1.6. Blocking procedures Gervasio Fernandez (INE) 1.6.1. General Blocking procedures Record linkage procedures require that every record from a data set be compared with all the records from the other data set; when one or both is supposed to be quite large, the expected number of pair wise comparisons would shoot up and system requirements would become correspondingly higher either in time or in resources. There is a way to reduce those needs, by splitting records into groups or 'blocks', provided that comparisons between elements from different blocks will not be made. So, each record from a given block in the first data set should be compared only with records from a given block in the second data set. However, it must be taken into account that this reduction bears the risk of mistakenly including records in a block and then some of their possible matches would never be compared, i.e. they will not be properly matched. This sort of handicap can be reduced by means of applying multi-pass techniques. To find a review on blocking procedures for record linkage, see Baxter et al. (2003), Cochinwala et al. (2001) and Gu et al. (2003). 1.6.2. Standard Blocking A first and easy way to group records is possible when well-defined and well-coded keys are available for both data sets, e.g. for place of birth. In this case, each record is just compared with every other record with the same place of birth from the second data set. WP1 11 Some other examples of keys would be the first digits from Social Security number, the first characters from the first or last name of a person; in this case, it is used to deal with phoneticorthographic codes (e.g. Russell-Soundex, NYSIIS, ONCA, Metaphone for English words) for reducing misspelling/writing errors. The figure below shows an example using the ZIP / postal code as a key: Record blocks are kept defined based on those with the same key value, where the key has been defined using the available attributes in each data set. Depending on used keys, blocks with a wide amount of records can be found, and then an ineffectively large number of comparisons; and, on the other hand, in the case of small blocks, true record matches can be lost, especially if key includes misprints. An analysis of error reduction using blocking methods can be found in Elfeky et al. (2002), Jaro (1989) and Newcombe (1988). Nevertheless, standard blocking procedures will not work properly unless the variables used as a key are correctly coded and recorded. This ideal situation is not always the case, and several alternative methods -which are introduced below- have been proposed for rearranging data into blocks. 1.6.3. Fuzzy Blocking When keys with misprints are present, and they give rise to losses of true record matches because they are assigned to different blocks, fuzzy blocking methods can be applied. Through these methods, records are split on not necessarily disjoint blocks, or records are assigned to more than one block. For example, it is possible to have date of birth in a data set, and then define blocks by year of birth. However, in other data sets, people’s ages are available and then it will possible to look through the records with the appropriate year of birth. A kind of fuzzy blocking method is known as Bigram Indexing, used in the software Febrl (Christen and Churches, 2005b). The underlying idea is to consider, for a given key in a record, all possible bi-grams (length-two strings) and to build feature subsets combining both characters till a threshold value is reached, and then assign to that record the resulting keys. Let us give an example of a Bigram Indexing procedure where the ZIP / postal code mentioned above is split into bigrams. The code number "28046" results in a bigram list WP1 12 ("28", "80", "04", "46"), which is the main list of bigrams. The set of all the sub-strings of length 3 is as follows: ("28", "80", "04") ("28", "80", "46") ("28", "04", "46") ("80", "04", "46") Then every record which holds the value "28046" for the key variable ZIP / postal code will be assigned to 4 different blocks, each of them labelled with the corresponding set of substrings. 1.6.4. Sorted Neighbourhood Another method consists in stringing together the data to handle, and then order them by some external key. Every record will be compared with the records in a moving window of size w, centred upon the selected record. This method can be used with several independent sort keys, increasing the number of comparisons to be made. Sort keys for data sets must be found in such a way that records to be compared stay close to each other in the re-arranged data set. Sort keys should be chosen to be related to the elements involved in the comparison functions. Ability to sort large data sets then arises as an important subject. A description and analysis of some algorithms and methods can be found in Hernandez and Stolfo (1995, 1998), Monge (2000) and Neiling and Muller (2001). 1.6.5. Alternatives to Sorted Neighbourhood In order to avoid the re-arrangement of the data set time and time again, and on the assumption that the involved data sets were previously arranged or at least partially arranged, several methods have been proposed, among them the use of priority queues (heaps) where representative records from the ultimately used blocks are stored and first used to seek the next record to be processed. A description of this method can be seen in Monge and Elkan (1997). Similar methods are applied by Yancey (2002) in BigMatch system. 1.6.6. Similarity/Distance-based Clustering Although being similar to Sorted Neighbourhood method, this technique differs from the previous method as, instead of a centred sliding window of size w in the pre-arranged data set, it uses a canopy block for the record to be processed, consisting on a block which is made of records nearby, according to a similarity/distance-based function. The basic idea is to use similarity/distance-based functions which should be easier to calculate WP1 13 than the used comparison function, and should approximate its real value. So, records that are located far apart from each other will be non-matches. The figure above shows the procedure for identifying canopy clusters: given two datasets A and B -whose elements (records) are labelled with a or b respectively- and using two key variables, say BV1 and BV2, both datasets are arranged in a unique list of records; then, one record is randomly chosen as the centre of the first cluster. All the records within a certain distance are considered to belong to the corresponding canopy cluster; then, the first record randomly chosen and a subset of records close to it (within a smaller threshold distance) are removed from the list, in order to avoid the proliferation of overlapping clusters. For major detail it is possible to consult the papers of Bilenko and Mooney (2002), Cohen and Richman (2002), MacCallum et al. (2000) and Neiling and Muller (2001). Bibliography Baxter, R., Christen, P. and Churches T. (2003) A Comparison of fast blocking methods for record linkage. In Proc. of ACM SIGKDD'03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 25--27, Washington, DC, USA, August 2003. Bilenko, M. and Mooney, R.J. (2002) Learning to Combine Trained Distance Metrics for Duplicates Detection in Databases. Technical Report AI-02-296, University of Texas at Austin, Feb 2002. Christen, P. and Churches, T. (2005a) A Probabilistic Deduplication, Record Linkage and Geocoding System. In Proceedings of the ARC Health Data Mining workshop, University of South Australia, April 2005. Christen, P. and Churches, T. (2005b) Febrl: Freely extensible biomedical record linkage Manual. Release 0.3 edition, Technical Report Computer Science Technical Reports no.TRCS-02-05, Department of Computer Science, FEIT, Australian National University, Canberra. Cochinwala, M., Dalal, S., Elmagarmid, A.K. and Verykios, V.S. (2001) Record Matching: Past, Present and Future. Cohen, W. and Richman, J. (2002) Learning to Match and Cluster Large High-Dimensional Data Sets for Data Integration. In Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). Elfeky, M., Verykios, V. and Elmagarmid, A. (2002) TAILOR: A Record Linkage Toolbox. Proc. of the 18th Int. Conf. on Data Engineering IEEE. Gu, L., Baxter, R., Vickers, D., and C. Rainsford, C. (2003). Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, Canberra. Hernandez, M. and Stolfo, S. (1995) The Merge/Purge Problem for Large Databases. In Proc. of 1995 ACT SIGMOD Conf., pp. 127–138. WP1 14 Hernandez, M. and Stolfo, S. (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Journal of Data Mining and Knowledge Discovery, 1(2). Jaro, M.A. (1989) Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association, Volume 84, pp. 414-420. Kelley, R.P. (1984) Blocking considerations for record linkage under conditions of uncertainty. Proceedings of the Social Statistics Section, American Statistical Association, pp. 602-605. McCallum, A., Nigam, K. and Ungar, L. (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. of the sixth ACM SIGKDD Int. Conf. on KDD, pp. 169–178. Monge, A.E. (2000a) Matching algorithm within a duplicate detection system. IEEE Data Engineering Bulletin, 23(4). Monge, A.E. (2000b) An Adaptive and Efficient Algorithm for Detecting Approximately Duplicate Database Records. Monge, A.E. and Elkan, C. (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In The proceedings of the SIGMOD 1997 workshop on data mining and knowledge discovery, May 1997. Neiling, M. and Muller, R.M. (2001) The good into the Pot, the bad into the Crop. Preselection of Record Pairs for Database Fusion. In Proc. of the First International Workshop on Database, Documents, and Information Fusion, Magdeburg, Germany. Newcombe, H.B. (1988) Handbook of Record Linkage, Oxford University Press. Yancey, W.E. (2002) BigMatch: A program for extracting probable matches from a large file for record linkage. RRC 2002-01. Statistical Research Division, U.S. Bureau of the Census. Yancey, W.E. (2004) A Program for Large-Scale Record Linkage. In Proceedings of the Section on Survey Research Methods, American Statistical Association. Other bibliography Christen, P. (2007) Improving data linkage and deduplication quality through nearestneighbour based blocking. Submitted to the thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'07). Christen, P., and Churches, T. (2004) Blind Data Linkage using n-gram Similarity Comparisons. Proceedings of the 8th PAKDD'04 (Pacific-Asia Conference on Knowledge Discovery and Data Mining), Sydney. Springer Lecture Notes in Artificial Intelligence, (3056). WP1 15 Christen, P., Churches, T., and Hegland, M. (2004) A Parallel Open Source Data Linkage System. In the Proc of The Eighth Pacific-Asia Conference on Knowledge Discovery and Data Mining, Sydney. Christen, P., Churches, T., and Zhu, J.X. (2002) Probabilistic Name and Address Cleaning and Standardization. Presented at the Australasian Data Mining Workshop, Canberra. Christen, P., Churches, T., Lim, K., and Zhu, J.X (2002) Preparation of name and address data for record linkage using hidden Markov models. BioMed Central Medical Informatics and Decision Making. (http://www.biomedcentral.com/1472-6947/2/9). Christen, P., et al. (2002a) Parallel Computing Techniques for High-Performance Probabilistic Record Linkage. Proceedings of the Symposium on Health Data Linkage, Sydney. Christen, P., et al. (2002b) High-Performance Computing Techniques for Record Linkage. Proceedings of the Australian Health Outcomes Conference (AHOC-2002), Canberra. Elfeky, M.G., Verykios, V.S., Elmagarmid, A., Ghanem, M. and Huwait, H. (2003) Record Linkage: A Machine Learning Approach, a Toolbox, and a Digital Government Web Service. Department of Computer Sciences, Purdue University, Technical Report CSD-TR 03-024. Gu, L., and Baxter, R. (2004) Adaptive Filtering for Efficient Record Linkage. SIAM Int. Conf. on Data Mining, April 22-24, Orlando, Florida. Verykios, V.S., Elfeky, M.G., Elmagarmid, A., Cochinwala and M., Dalal, S. (2000) On The Accuracy And Completeness Of The Record Matching Process. In Sloan School of Management, editor, Procs. of Information Quality Conference, MIT, Cambridge, MA. Goiser, K., and Christen P. (2006) Towards Automated Record Linkage. In Proceedings of the Fifth Australasian Data Mining Conference (AusDM2006), Sydney. 1.7. Quality assessments Nicoletta Cibella and Tiziana Tuoto (ISTAT) Record linkage is affected by two types of errors: the record pairs that should have been linked but actually remain unmatched and, vice versa, the record pairs which are linked even if they refer to two different entities. In a statistical context, record linkage accuracy is evaluated in terms of the false match and false non-match rates. In other contexts, as in the medical and epidemiological fields, different measures are considered (the positive predicted value and sensitivity), although they are algebraic transformations of the false match and false non-match rates, respectively. The same accuracy indicators are also used in the research field of information retrieval, although they are usually named precision and recall. In order to define the previous indicators, let us assume to know the following characteristics: – The number of record pairs linked correctly (true positives) nm. – The number of record pairs linked incorrectly (false positives, Type I error) nfp. – The number of record pairs unlinked correctly (true negatives) nu. – The number of record pairs unlinked incorrectly (false negatives, Type II error) nfn. WP1 16 – The total number of true match record pairs, Nm. – The total number of true non-match record pairs, Nu. The false match rate is defined as: fmr=nfp/(nm + nfp), i.e. the number of incorrectly linked record pairs divided by the total number of linked record pairs. The false match rate corresponds to the well-known 1- error in a one-tail hypothesis test. The positive predictive value is easily computed from the false match rate (ppv=nm/(nm + nfp)=1-fmr), and corresponds to the number of correctly linked record pairs divided by the total number of linked record pairs. On the other side, the false non-match rate is defined as: fnmr=nfn/Nm, i.e. the number of incorrectly unlinked record pairs divided by the total number of true match record pairs. The false non-match rate corresponds to the error in a one-tail hypothesis test. Similarly as for the ppv, sensitivity can be obtained from the fnmr (s=nm/Nm=1-fnmr) as the number of correctly linked record pairs divided by the total number of true match record pairs. Some authors also recommend computing the match rate: (nm + nfp)/Nm, i.e. the total number of linked record pairs divided by the total number of true match record pairs. A different performance measure is specificity, defined as nu/Nu, i.e. the number of correctly unlinked record pairs divided by the total number of true non-match record pairs. The difference between sensitivity and specificity is that sensitivity measures the percentage of correctly classified record matches, while specificity measures the percentage of correctly classified non-matches. As anticipated at the beginning of this section, in information retrieval the previous accuracy measures take the name of precision and recall. Precision measures the purity of search results, or how well a search avoids returning results that are not relevant. Recall refers to completeness of retrieval of relevant items. Hence, precision can be defined as the number of correctly linked record pairs divided by the total number of linked record pairs, i.e. it coincides with the positive predicted value. Similarly, recall is defined as the number of correctly linked record pairs divided by the total number of true match record pairs, i.e. recall is equivalent to sensitivity. As a matter of fact, precision and recall can also be defined in terms of non-matches. The same quality indicators can be evaluated even if the linkage procedure is performed through techniques different from the probabilistic one, as for instance supervised or unsupervised machine learning (Elfeky et al, 2003). Additional performance criteria for record linkage are given by the time consumed by software programmes and by the number of records that require manual review. The time complexity of a record linkage algorithm is usually dominated by the number of record comparisons. On the other hand, manual review of records is also time-consuming, expensive and error prone. WP1 17 All the performance indicators defined above have to be evaluated on actual data, as shown in the following methods. 1.7.1. Sampling and clerical review The above defined measures can be estimated drawing randomly (or with a purposive selection) a sample of pairs from the whole pairs (i.e. from both M and U). These sample pairs are then matched with a more intensive (and accurate) procedure in order to evaluate the accuracy of the original match (see Hogan and Wolter 1988, Ding and Fienberg 1994). As the procedures implemented are more accurate and are performed by highly qualified personnel, the “rematch” is considered error free and it represents the true match status. Sometimes the whole record linkage procedure on the selected sample is done manually so as to be confident that it is “perfect”. The bias of the original match is evaluated by means of the discrepancies between the match and the rematch results. Selection of pairs to include in the sample is sometimes problematic. Winkler (1995) suggests reducing the sample size by selecting pairs of records from the area where problems arise more frequently. This can be done adopting a weighting strategy where the nearer to a fixed threshold the weights are, the more linkage errors occur. Alternative procedures consist of evaluating the record linkage procedure quality by using appropriate statistical models. These models produce an automatic estimate of the error rates, as described in the following paragraphs. 1.7.2. Belin-Rubin procedure Belin and Rubin (1995) propose a model for estimating false match rates for each possible threshold value. They define a model where the distribution of observed weights is interpreted as a mixture of weights for true and false matches. Their approach focuses on providing a predicted probability of match for a pair of records, with associated a standard error, as a function of the matching weight. Their method is particularly useful when record linkage should satisfy the following constraint: each record of one file cannot be matched to more than one record of the other file (one to one match constraint). In this case their procedure dramatically improves the record linkage performance because non-matches are mostly eliminated. Generally, the method works well when there is a good separation between the matching weights associated with matches and non-matches and the failure of the conditional independence assumption is not too severe. This method requires that a training sample of record pairs is available, where the status of each record pair is perfectly known. 1.7.3. Torelli-Paggiaro estimation method In order to avoid the use of a training sample, Torelli and Paggiaro (1999) suggest a strategy that allows evaluating error rates by means of the estimates of the probability to be a link for each record pair. Maximum likelihood estimates of these probabilities are computed via the EM algorithm or some of its modifications. Torelli and Paggiaro propose to evaluate the false non-match rate as the sum of the matching probabilities of the record pairs under the threshold. The false match rate is computed similarly. Quality of these error rate estimators dramatically depends on the accuracy of the probability estimators: if these probabilities are obtained under the conditional independence assumption, and this assumption does not hold, error rate estimators will be strongly biased. WP1 18 1.7.4. Adjustment of statistical analyses Generally speaking, it is important to assess the record linkage quality because linkage errors can affect the population parameter estimates (Neter et al, 1965). Scheuren and Winkler (1993) propose a method for adjusting statistical analyses for matching errors. In this case, the problem is restricted to the impact of the mismatch error on the bias of the coefficient of the standard regression model on two variables (one from each source). The estimator bias is corrected introducing in the model the probabilities of being correctly and incorrectly matched. Scheuren and Winkler (1997) also propose to modify the former regression estimates for the presence of outliers, introducing an appropriate iterative solution. Lahiri and Larsen (2000) extend the model in order to estimate also the regression coefficients when more than one variable from a data set is considered. Bibliography Belin TR. and Rubin D.B., 1995. A method for calibrating false-match rates in record linkage. Journal of the American Statistical Association, 90, 694-707. Christen, P and Goiser, K, 2005. Assessing duplication and data linkage quality: what to measure?, Proceedings of the fourth Australasian Data Mining Conference, Sydney, December 2005, viewed 16 June 2006, http://datamining.anu.edu.au/linkage.html Ding Y. and Fienberg S.E., 1994. Dual system estimation of Census undercount in the presence of matching error, Survey Methodology, 20, 149-158. Elfeky M.G., Verykios V.S., Elmagarmid A., Ghanem M. and Huwait H., 2003 Record Linkage: A Machine Learning Approach, a Toolbox, and a Digital Government Web Service. Department of Computer Sciences, Purdue University, Technical Report CSD-TR 03-024. Hogan H. and Wolter K., 1998. Measuring accuracy in a post-enumeration survey. Survey Methodology, 14, 99-116. Lahiri P. and Larsen M.D., 2000. Model based analysis of records linked using mixture models, Proceedings of the Section on Survey Research Methods Section, American Statistical Association, pp. 11-19. Neter J., Maynes S. and Ramathan R., 1965. The effect of mismatching on the measurement of response error. Journal of the American Statistical Association, 60, 1005-1027. Scheuren F. and Winkler W.E., 1993. Regression analysis of data files that are computer matched, Survey Methodology, 19, 39-58. Scheuren F. and Winkler W.E., 1997. Regression analysis of data files that are computer matched- part II, Survey Methodology, 23, 157-165. Winkler W.E., 1993. Improved decision rules in the Fellegi-Sunter model of record linkage. Proceedings of the Survey Research Methods Section, American Statistical Association, pp. 274-279. WP1 19 Winkler W.E., 1995. Matching and record linkage. Business Survey Methods, Cox, Binder, Chinappa, Christianson, Colledge, Kott (eds.). John Wiley & Sons, New York. 1.8. Analysis of files obtained by record linkage Miguel Guigo (INE) As it can be inferred from the amount of applied studies in which these methods are involved (Alvey and Jamerson, 1997), merging files through probabilistic record linkage technique is not an end in itself, but a means to a wide variety of goals related to the use of administrative microdata, even not for statistical purposes. Implementations refer to imputation, improvement of survey frames, treatment of non-response problems, longitudinal studies, procedures for obtaining better estimates, and so on. Therefore, when a pair of data sets is fused, any administrative decisions as well as statistical conclusions based on the linked file must take into account that the results are affected by two types of errors: on the one hand, the percentage of incorrect acceptance of false matches and, on the other hand, the incorrect rejection of true matches. Record linkage procedures must deal with the existing trade-off between both types of errors and/or measure the effects on the parameter estimates of the models that are associated to the resulting files. Different approaches have tackled the problem, the first due to Neter et al. (1965) that has studied bias in the estimates of response errors when the results of a survey are partially improved through record checks, and raises awareness of substantial effects in the results with relatively small errors in the matching process. Scheuren and Oh (1975) focus on different problems noticed in a large-scale matching task as a Census - Social Security match through Social Security Number (SSN)4. They focus attention to the impact of different decision rules on mismatching and erroneous no matching. Furthermore they point out the constraints to develop an appropriate comparison vector when statistical purposes differ from administrative aims that generated the file and that regulate its maintenance. Nevertheless their approach does not offer general criteria to estimate the parameters of the distributions, as m(γab) and u(γab). Their approach is to select a sample of records, manually check their status of matched and unmatched pair, and estimate those parameters from the observed proportions. Some more complete methodologies have been developed by Scheuren and Winkler (1993, 1996a, 1996b, 1997) through recursive processes of data editing and imputation. This methodology focuses on building an accurate imputation model from a quite small number of likely matched pairs, which are in their turn the result of a common record linkage procedure, i.e. once a first round of links has been made, a subset of high-scored matches, which error rate is estimated to be low, is selected to design a linear regression model and estimate its parameters. Let A and B be the two datasets to be compared and, for every unit a in A, some likely matches have been selected from B. Let also x and y be two characteristics available for the records in A and B, respectively. In an ordinary univariate linear regression model, 4 Although a unique common identifier is used to fuse data from two files, some different problems can arise even when linkage is achieved through some automated process. Scheuren and Oh report problems related to misprints, absence of SSN in one of the two records that are candidate to be matches, unexplainable changes of SSN in records known to be from the same person, etc. WP1 20 yi = a0 + a1xi + i xi and yi ought to be from the same observation, say they belong to the unit a. However, the mismatched pairs of records are characterized by values of x and y observed on distinct units. In this case, the actual dependent variable in B is not Y anymore, but a new variable zi whose values are zi = yi if i=j, and zi = yj otherwise. Scheuren and Winkler (1993) consider several possible matches in B for every unit in A, and therefore zi has also several possible values, yi with probability pi, and yj with probability qij. It must be taken into account that the estimates of the intercept a0 and the slope a1 are biased and the correlation between x and y is understated due to the independence of both variables in the cases where zi is equal to yj instead of yi. Assumed that the probabilities pi and qij can be estimated accurately, it is possible, in its turn, to get better estimates for the parameters of the regression model, which can then provide feedback for the record linkage procedure by estimating values of yi to be compared with those in the possible matches in B. The so improved record linkage step can lead to a new cycle until convergence. Scheuren and Winkler (1996b) also deal with a variety of different scenarios, depending on availability of comparison variables. Larsen (1999, 2001, 2004) and Lahiri and Larsen (2000 and 2005) have widely discussed the use of the former methodology for mixture models, trying to improve the estimates of the probability that a pair of records is actually a match. Those estimates can be found through maximum likelihood or Bayesian analysis, and then adjust the regression models by an alternative to the bias correction method used in Scheuren and Winkler. By means of simulated data sets, Larsen (1999) finds maximum likelihood estimates on the one hand and posterior distributions on the other hand, for the mixture model parameters. The different values can be used to express uncertainty in the relationship between records because of their unknown real status. Lahiri and Larsen (2000 and 2005) consider the multivariate regression model yi = x'i + I, where xi is a column vector of explanatory variables which belong to some record a in A, yi is the response variable and is the column vector of unknown regression coefficients. The idea of investigating the bias of the estimator of is developed under the assumption of existing but not identified mismatches between records, in such a way that the observed values for the left side of the equation above are actually zi as described in the case study by Scheuren and Winkler. Lahiri and Larsen (2000) propose = (W'W)-1 W'z as an unbiased estimator instead of the one obtained by ordinary least squares-, where z is the column vector of actual values of the response variable, and W is a linear transformation of the square matrix X of explanatory data, with wi = q'i X = jqijx'j. A robust estimator based on absolute deviations is also mentioned. Variances of different estimators are compared in Lahiri and Larsen (2005) via Monte Carlo simulation. Additionally, Liseo and Tancredi (2004) develop a brief regression analysis based on a Bayesian approach to record linkage, proposing a simulation to show that the relation between the values of the explanatory variables xi and the actually observed values zi can provide information in order to improve the linkage process. Finally, Winkler (2006a) suggests that the use of a regression adjustment to improve matching can be done by means of identifying variables that are not strictly the same, but actually WP1 21 include the same information from different points of view. Based on a practical experience by Steel and Konschnik (1997), a possible application for files with data from companies is proposed, pointing out the fact that e.g. observations on receipts or income are referred to the same accounting concept. Bibliography Alvey, W. and Jamerson, B. (eds.), 1997. Record Linkage Techniques – 1997. (Proceedings of an International Record Linkage Workshop and Exposition on March 20-21, 1997 in Arlington, Virginia) Washington, DC: Federal Committee on Statistical Methodology. Lahiri P. and Larsen M.D., 2000. Model-based analysis of records linked using mixture models, Proceedings of the Section on Survey Research Methods Section, American Statistical Association, pp. 11-19. Lahiri P. and Larsen M.D., 2005. Regression Analysis With Linked Data. Journal of the American Statistical Association, 100, pp. 222-230. Larsen M.D., 1999. Multiple imputation analysis of records linked using mixture models. Proceedings of the Survey Methods Section, Statistical Society of Canada, pp. 65-71. Larsen, M.D., 2001. Methods for model-based record linkage and analysis of linked files. In Proceedings of the Annual Meeting of the American Statistical Association, Mira Digital publishing, Atlanta. Larsen, M.D., 2004. Record Linkage of Administrative Files and Analysis of Linked Files. In IMS-ASA's SRMS Joint Mini-Meeting on Current Trends in Survey Sampling and Official Statistics. The Ffort Radisson, Raichak, West Bengal, India. Larsen, M.D., 2005. Advances in Record Linkage Theory: Hierarchical Bayesian Record Linkage Theory. 2005 Proceedings of the American Statistical Association, Survey Research Methods Section [CD-ROM], pp. 3277- 3284. Alexandria, VA: American Statistical Association. Liseo, B. Tancredi, A., 2004. Statistical inference for data files that are computer linked Proceedings of the International Workshop on Statistical Modelling - Firenze Univ. Press, pp. 224-228. Neter, J., Maynes, E.S, and Ramanathan, R., 1965. The effect of mismatching on the measurement of response errors. Journal of the American Statistical Association, 60, pp. 1005-1027. Scheuren, F. and Oh, H.L., 1975. Fiddling around with nonmatches and mismatches. Proceedings of the Social Statistics Section, American Statistical Association, pp. 627-633. Scheuren, F. and Winkler, W.E., 1993. Regression analysis of data files that are computer matched – Part I. Survey Methodology, Volume 19, pp. 39-58. Scheuren, F. and Winkler, W.E., 1996a. Recursive analysis of linked data files. U.S. Bureau of the Census, Statistical Research Division Report Series, n.1996/08. WP1 22 Scheuren, F. and Winkler, W.E., 1996b. Recursive Merging and Analysis of Administrative Lists and Data, Proceedings of the Section of Government Statistics, American Statistical Association, pp. 123–128. Scheuren F. and Winkler W.E., 1997. Regression analysis of data files that are computer matched- part II, Survey Methodology, 23, pp. 157-165. Steel, P., and Konschnik, C.,1997. Post-Matching Administrative Record Linkage Between Sole Proprietorship Tax Returns and the Standard Statistical Establishment List. In Record Linkage Techniques 1997, Washington, DC: National Academy Press, pp. 179-189. Winkler, W. E., 1991. Error model for analysis of computer linked files. Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 472-477. Winkler, W.E., 2006a. Overview of Record Linkage and Current Research Directions. U.S. Bureau of the Census, Statistical Research Division Report Series, n.2006/2. WP1 23 2. Literature review on statistical matching 2.1 Statement of the problem of statistical matching Marcello D’Orazio (ISTAT) The words Statistical Matching (or data fusion or synthetical matching) refer to a series of methods whose objective is the integration of two (or more) data sources (samples) drawn from the same target population. The data sources are characterized by the fact they all share a subset of variables (common variables) and, at the same time, each source observes distinctly other sub-sets of variables. Moreover, there is a negligible chance that data in different sources observe the same units (disjoint sets of units). 2.1.1. Differences with record linkage and preliminary definitions This set of procedures is quite different in the inputs (i.e. the data sets to be integrated) and in the output. As far as the input is concerned, the data sets are usually two distinct samples without any unit in common (no overlap between the data sources). On the contrary record linkage requires at least a partial overlap between the two sources. In the simplest case of two samples, the classical statistical matching framework can be represented in the following manner (Kadane, 1978, D’Orazio et al, 2006): Y X Data source A X Z Data source B In this situation X is the set of common variables, Y is observed only in A but not in B and Z is observed in B but not in A (Y and Z are not jointly observed). A second difference between record linkage and statistical matching is the output. In record linkage, the objective is to recognize records belonging to the same unit in two distinct but partially overlapping data sets. For this reason the focus is only on the X variables and on how to deal with the possibility that these variables are reported with error. On the contrary, statistical matching methods aim at integrating the two sources in order to study the relationship existing among the two sets of variables not jointly observed, i.e. Y and Z or, more in general, to study how X, Y and Z are related. This objective can be achieved by using two seemingly distinct approaches. WP1 24 Bibliography D’Orazio, M., Di Zio, M. and Scanu, M., 2006. Statistical matching: theory and practice. John Wiley, Chichester. Kadane, J.B., 1978. Some statistical problems in merging data files. 1978 Compendium of Tax Research, Office of Tax Analysis, Department of the Treasury, pp.159-171. Washington, DC: U.S. Government Printing Office. Reprinted in Journal of Official Statistics (2001), Volume 17, pp. 423-433. 2.2 The statistical matching workflow Marcello D’Orazio, Marco Di Zio and Mauro Scanu (ISTAT) The statistical matching workflow is extremely simple. It consists of three sequential steps: harmonization of the data sets to match, application of an algorithm, quality evaluation of the results. This simplicity is due to the fact that the statistical matching problem is essentially a simple inferential problem: the estimation of joint information of the variables that are never jointly observed. This estimation problem can be either explicit or implicit in a statistical matching approach, nevertheless it is always present. This chapter focuses mainly on the second node, explaining all the different approaches that are available, according to the researcher’s goals and available information (Section 2.3). The following table illustrates the possibilities available so far. Stat. matching objectives Macro Micro Approaches to statistical matching Parametric Nonparametric Mixed As in every inferential statistics problem, the specification of the framework of estimation can be either parametric or nonparametric. In the case of statistical matching, there is also a third possibility, given by a mixture of the two frameworks. This last option basically consists of a two step procedure. Step 1) a parametric model is assumed and its parameters are estimated. Step 2) a completed synthetic data set is derived through a nonparametric micro approach. On the contrary, it is a specific feature of statistical matching to distinguish two kinds of objectives, in the sequel denoted as micro and macro approaches. In the micro approach, the statistical matching objective is the construction of a complete “synthetic” file. The file is complete in the sense that it contains records where X, Y and Z are jointly present. The term “synthetic” refers to the fact that this file is not the result of a direct WP1 25 observation of all the variables on a set of units belonging to the population of interest, but it is obtained exploiting information in the observed distinct files. For example, in the case of the data sets as in the previous figure, a synthetic file can be Y X ~ Z Synthetic file (file A with Z filled in). In a macro approach, the distinct data sources are used in order to provide an estimate of the joint distribution function of the variables not jointly observed ( f y, z in the example, or f x, y, z ) or of some of its key characteristics such as a correlation matrix ( ρ YZ ), a contingency table, a measure of association, etc. The micro approach seems the most popular one. There are different reasons for this. Firstly, in the initial statistical matching applications, the problem is regarded as an imputation problem (see e.g. Okner 1972): the missing variables have to be imputed in one of the source data sets called recipient (or host file). Therefore, for each record of the recipient file the missing variables are imputed using records (chosen suitably) from the other sample, the donor file. Secondly, the micro approach has some nice features from a practical point of view. It permits to “create” huge data sources to be used as input for micro simulations (Cohen, 1991). In other cases, a synthetic data set is preferred just because it is much easier to analyze than two or more incomplete data sets. Sometimes building a synthetic data set may be preferred due to the difficulties in managing complex models capable to handle at the same time several categorical and continuous variables. Finally, a synthetic data set can be used by different subjects with different informative needs. On the contrary, when the objective of the integration is simply the estimation of a contingency table or of a correlation matrix for variables not jointly observed, the macro approach can result to be much more efficient than the micro one. This approach basically consists in obtaining direct estimates of the parameters of interest from the partially observed data sets. However, there is not a clear-cut distinction between the two approaches. In fact, in some cases the synthetic file can be obtained as a by-product of the estimation of the joint distribution of all the variables of interest. For instance, a parametric model is assumed, its parameters are estimated and then these estimates are used to derive a synthetic file, e.g. by predicting the missing variables in one of the source data files. A crucial element for tackling the statistical matching problem is the availability of information, apart from that contained in the two data sets. It is possible to distinguish three different situations: CIA: it is believed that Y and Z follow a very simple model, the conditional independence of Y and Z given X; this model can be easily estimated by means of the data sets at hand (note that only some models are estimable using the data sets A and B). WP1 26 Auxiliary information: external information is available, either on some parameters or provided by a third data set where Y and Z are jointly observed; other models than the conditional independence assumption become estimable. Uncertainty: no assumptions are considered; in this case there is a particular form of uncertainty which characterizes the statistical matching problem: uncertainty due to lack of joint information on Y and Z. In Section 2.3, the parametric, nonparametric, and mixed methods for the micro and macro approaches are discussed under the CIA and the use of auxiliary information. Section 2.4 is devoted to the assessment of uncertainty when no assumptions are made on the model between Y and Z. Finally, Section 2.5 details how quality evaluations can be performed. Bibliography Cohen, M.L., 1991. Statistical matching and microsimulation models. Improving information for social policy decisions: the uses of microsimulation modelling, Volume II, Technical Papers (C.F. Citro and E.A. Hanushek (eds.)). National Academy Press, Washington, DC. Okner, B. A., 1972. Constructing a new database from existing microdata sets: the 1966 merge files. Annals of Economic and Social Measurement, Volume 1, pp. 325-342. 2.3 Statistical matching methods Marco Di Zio (ISTAT) In the description of matching methods, different elements should be considered: the nature of the method, i.e. parametric, nonparametric and mixed techniques; the main goal of the integration task, i.e. micro and macro objective; the model assumed for the variables X, Y, Z in the population. The first two characteristics have been already introduced in the previous section. As far as model assumptions are concerned, it is necessary to distinguish between the situations when the conditional independence assumption (CIA) between Y and Z given X holds and when it cannot be assumed. When the CIA holds, the structure of the joint (X, Y, Z) distribution is f x, y , z f y | x f z | x f x . This assumption is especially important in the context of statistical matching, as most of the methods usually applied, explicitly or implicitly, assume this structure. The reason is that the data at hand are sufficient to estimate directly the quantities concerning the conditional distributions of Y given X, of Z given X, and the marginal distribution of X. The CIA cannot be tested from the data set A and B. This implies that it has to be assumed by making some general considerations on the investigated phenomenon, or by testing on some historical or similar data sets where all the variables are jointly observed. In order to avoid the CIA, auxiliary information must be used. In order to exemplify the concept of conditional independence, an illustrative example is provided. Let us suppose we have observed the variables Gender (Y), Income (Z) and Country (X) on a sample of units. In the case of independence of Gender and Sex, if we are interested in the computation of probability of being woman with an income higher than the average income, it is sufficient to determine the proportion of women, the proportion of WP1 27 people with an income higher than the average, and compute the product of the two probabilities. If you consider also the conditioning variable X (Country), the previous statement should be true across the countries, i.e. the simple combination applies to each and every country provided differences in the proportions of women and income, and this represents the conditional independence of Gender and Income given the Country. The violation of conditional independence means that this independence relationship is not true for at least one country, for instance women systematically earn less than men, or vice versa. In the following, a discussion of statistical matching methods characterised by these different elements is given. The first part of the section deals with methods assuming the conditional independence. The second part will discuss methods exploiting the use of auxiliary information. 2.3.1. Conditional independent assumption 2.3.1.1 Parametric methods In parametric methods, the model is determined by a finite number of parameters. Models introduced in literature for continuous variables are mainly based on the assumption of multinormality of data. The CIA ensures that data are sufficient to estimate the parameters of the model, but when integrating two data sets, it is a crucial point how to estimate the parameters of the joint distribution from the information available in the two separate data sets. In the following, we consider only three variables X, Y, and Z. The generalization to the multivariate case is straightforward. The critical parameters are those involving relations between the variables Y and Z that are not jointly observed. In the multinormal case the critical parameter is the covariance (or analogously the correlation) yz . By assuming the conditional independence, this parameter is determined by other estimable parameters because yz xy xz / 2 x . However, also in this case that the critical parameter is estimable, great care should be posed when combining estimates from different data sets. For instance, the naive way for the estimation of the parameters obtained through their observed counterpart in the data sets, that is for xy the sample covariance sxy computed on the file A (sxy;A), for xz the sample covariance sxz computed on the file B (sxz;B), and for 2 x the sample variance s2x;AUB may lead to unacceptable results, like a not positive semi definite covariance matrix. A solution to this problem (Anderson, 1956) is to use the maximum likelihood approach that leads, for the estimation of the covariance yx , to correct the sample covariance syx by a factor that is the regression coefficient of Y on X, i.e. ˆ yx ˆ yx s x2; AUB where the regression coefficient is estimated through ̂ yx syx;A / s2x;A . Analogous results hold for the couple Z and X. A discussion of the problems concerning the combination of estimates obtained from the different data sets is in Moriarity and Scheuren (2001), Rubin (1974), D’Orazio et al., (2006a). Formulae to estimate coherently the parameters by means of a likelihood approach in a general multivariate context are in D’Orazio et al. (2006a). A Bayesian approach is described in Rässler (2002, 2003). WP1 28 Similar considerations hold for categorical variables. Let ijk denote the probability that X assumes category i (for i=1,…, I), Y assumes category j (for j=1,…, J) and Z assumes category k (for k=1,…, K). The CIA implies that the critical parameters involving Y and Z are ijk ij . i.k i.. j|i k |i i.. where j|i and k|i represent the conditional probabilities of Y given X and Z given X respectively, that can be estimated through the data sets A and B. Also in this case, how to combine estimates is an important issue to deal with. A coherent combination of the estimates, under a multinomial model, is obtained through the maximum likelihood estimates nijA. nB n A i.. n B i.. ˆ ˆ i.. , j|i A , ˆk |i iB.k , n A nB ni.. ni.. where nA is the size of the data set A, niA.. is the frequency of observations assuming the category i in the data set A, and so on. The model is illustrated in detail in D’Orazio et al. (2006a,b). When the goal is a micro approach, the estimation of the model parameters is still required. In this case, synthetic data are generally obtained by drawing values from the estimated model or by imputing with the estimated conditional expectation. 2.3.1.2 Nonparametric methods Nonparametric methods differ from parametric ones in that the model structure is not specified a priori. The term nonparametric is not meant to imply that such models completely lack parameters but that the number and nature of the parameters are flexible and not fixed in advance. The most used nonparametric techniques in statistical matching are those belonging to the hot deck methods. Hot deck methods are widely used for imputation, especially in statistical agencies. They aim at the completion of one data set (say A) denoted as recipient file by substituting a missing value with a value observed in a similar statistical unit observed in the other data set (say B) denoted as donor file. The way of measuring the similarity of units characterizes the hot deck method. Random hot deck and distance hot deck are generally the most used techniques in statistical matching. Random hot deck consists in randomly choosing a donor record in the donor file for each record in the recipient file. The random choice is often done within strata that are determined through variables leading to a homogeneous set of units. Distance hot deck is widely used in the case of continuous variables. In the simplest case of one variable X (the generalization to the multivariate case is straightforward), the donor for the ath observation in the recipient file A is chosen so that d ab* xaA xbB* min xaA xbB 1bnB (different distances than the one used in this example can be chosen). An interesting variation of the distance hot deck is the constrained distance hot deck. In this approach, each record in B can be chosen as donor only once. This guarantees the preservation of the marginal distribution of the imputed variable (in this case the variable Z). WP1 29 A discussion about the use of hot deck techniques is in D’Orazio et al. (2006a) and Singh et al. (1993). Limits of those methods are deepened in Paass (1985) and Conti et al. (2006) The nature of the previous nonparametric techniques, that are essentially imputation methods, makes clear their use for micro objective. As far as the macro objective is concerned, it is still not developed a clear theoretical framework. A first attempt is in D’Orazio et al. (2006a), and Conti et al. (2006). 2.3.1.3 Mixed methods This is a class of techniques that makes use of parametric and nonparametric methods. More precisely, initially a parametric model is adopted, and then a completed synthetic data set is obtained by means of some hot deck procedures. The main reason for the introduction of this approach is that they exploit the advantages of models (being more parsimonious for the estimation) and that the final data are ‘live’ data, i.e. really observed. Despite this intuitive justification, theoretical properties of those methods are still to be investigated in depth. In the following two mixed procedure examples are given. For continuous variables the general mixed procedure usually adopted is 1. Estimate (on B) the regression parameters of Z on X. 2. For each of the ath observations in file A (a=1,…,na), generate an intermediate value ža by using the regression function. 3. Impute the missing observations in file A by using distance hot deck methods, where the distance is computed on the values ža and zb. For categorical variables a procedure is 1. Estimate the expected cell frequencies through a log-linear model. 2. Impute through a hot deck procedure, by accepting the donor only if the frequency of the cell the donor belongs to does not exceed the estimated expected cell frequency (computed in the first step). A not exhaustive list of references, that relates to more complicated settings than the CIA, are Rubin (1986), Singh et al. (1993), Moriarity et al. (2001, 2003). Also in this case, the nature of the methods strongly leads their use towards the micro objective problems. 2.3.2. Auxiliary information The CIA can be easily used in the statistical matching framework, because it can be easily estimated by the available data sets. Anyway, the model that generates data can be quite different from the CIA, as when the conditional correlation coefficient of Y and Z given X is different from zero. In this case, the data sets at hand are unable to estimate the model, thus it is necessary to resort to auxiliary information to fill this gap. In this setting it is important to take into account the characteristics of auxiliary information. It may have the form of (Singh et al., 1993): 1) a third file where either (X,Y,Z) or (Y,Z) are jointly observed; 2) a plausible value of the inestimable parameters of either (Y,Z|X), or (Y,Z). WP1 30 Also in this framework, the considerations about the previously described characteristics, like parametric, nonparametric, mixed, and micro, macro objectives are valid. However, when using auxiliary information, it is a crucial point how to include the information in the method that is going to be used. The problem of coherence between the external auxiliary information and the estimates obtained through the data at hand may arise. 2.3.2.1 Parametric methods As far as continuous variables are concerned, let us assume that data are normally distributed. Moreover, let us assume that the inestimable parameter yz is equal to a certain value * yz (obtained for instance from a past survey). In this setting it is not necessarily true that the covariance matrix determined by this value and the maximum likelihood estimators of the other parameters is definite positive. In fact, in this case a constrained likelihood approach should be used. Further studies are needed for this case. The simplest case happens when auxiliary information is on the partial parameter, e.g. the partial correlation coefficient ( yz| x * yz| x ). In this setting, this value can be coherently combined with the maximum likelihood estimates of ˆ y| x and ˆ z| x (the same obtained under the CIA) through the formula ˆ yz| x * yz| x ˆ 2 y| xˆ 2 z| x . The critical covariance can be estimated through the formula ˆ yz ˆ yz| x ˆ yx ˆ zx / ˆ 2 x . A generalized version of these algorithms can be found in D’Orazio et al. (2006a). A slight different version for estimating the parameters with a given value for yz| x is also given in Rässler (2003). As far as the case of categorical variables is concerned, similar considerations are valid. It is worth to remark that in this context information on Y and Z are sufficient to determine only at maximum a loglinear model without the triple interaction. In the latter case information on the triple Y, Z and X is needed. 2.3.2.2 Nonparametric methods Let us consider A as the recipient file, B as the donor file, and C as the auxiliary file. A general algorithm for imputing missing data through hot deck is 1) the ath observation of file A is imputed with a value za*, taken from C, for a=1,…,na by means of a hot deck procedure. When file C contains all the three variables distances (or imputation cells) are computed on (X,Y). When file C contains only the two variables (Y,Z), distances (or the imputation cells) are computed by only considering Y. 2) the ath observation of file A is imputed with a value za**, taken from B, by means of a hot deck procedure, that takes into account for computing the distance (or imputation cells) the variables (xa,za*) in A and the variables (xb, zb) in B. 2.3.2.3 Mixed methods Considerations concerning mixed methods are similar to those described under the CIA. The main difference consists in the use of auxiliary information. It may be introduced in the parametric estimation phase when an intermediate imputed value is computed (see previous WP1 31 section 1.3 on Mixed Methods). Auxiliary information can also be used as a constraint to be fulfilled, for instance 1. Estimate the regression parameters of Z on Y and X (through auxiliary information). 2. For each of the ath observations in file A (a=1,…,na), generate an intermediate value ža by using the regression function. 3. Impute the missing observations in file A by using distance hot deck methods, where distance is computed on the values ža and zb. The potential donor observation is accepted as an actual donor if and only if the frequency of the cell it belongs to (having introduced a discretization for the variables X, Y and Z) does not exceed the frequency of the same cell in the auxiliary file C. As far as the use of auxiliary information is concerned, parametric techniques based on normal and multinomial data are described in D’Orazio et al. (2006a). Nonparametric methods are mainly described in D’Orazio et al. (2006a) and Singh et al. (1993). Mixed methods are illustrated in Kadane (1978), Moriarity and Scheuren (2001, 2003), D’Orazio et al., (2006a), and Singh et al. (1993). 2.3.3. Partial auxiliary information A final consideration is about the use of partial auxiliary information. Sometimes, it is possible to make use of practical considerations leading to the construction of constraints. As an example, in the case of social surveys, logical rules based on law may be derived, e.g. it cannot be accepted that a ten years old person is married. In general, to be useful, the constraint has to refer to the variables never jointly observed. Nevertheless, in practice it is common to have only partial information. with respect to a specific combination of (Y,Z) (as for structural zeros). For instance, if we represent the frequency distribution of a population with respect to Age (in class of years) and the Marital Status (married/non married) in a contingency table, the previous logical rule about the age/marry results to constrain some (not all) cells at zero. This partial knowledge is not general enough to determine a specific distribution, but it can be usefully applied to decrease the degree of uncertainty about the relationships concerning (Y,Z), see D’Orazio et al. (2006a, 2006b) and Vantaggi (2005). This concept will be clarified in the next section dedicated to the study of uncertainty. Bibliography Anderson, T.W., 1957. Maximum likelihood estimates or a multivariate normal distribution when some observations are missing, Journal of the American Statistical Association, 52, pp. 200-203. Conti P.L., Marella, D., Scanu M., 2006. Nonparametric evaluation of matching noise. Proceedings of the IASC conference “Compstat 2006”, Roma, 28 August – 1 September 2006, Physica-Verlag/Springer, pp. 453-460. D’Orazio, M., Di Zio, M. and Scanu, M., 2006a. Statistical matching for categorical data: displaying uncertainty and using logical constraints. Journal of Official Statistics, Volume 22, pp. 137-157. D’Orazio, M., Di Zio, M. and Scanu, M., 2006b. Statistical Matching: Theory and Practice. John Wiley, Chichester. WP1 32 Kadane, J.B., 1978. Some statistical problems in merging data files. In 1978 Compendium of Tax Research, Office of Tax Analysis, Department of the Treasury, pp.159-171. Washington, DC: U.S. Government Printing Office. Reprinted in Journal of Official Statistics (2001), Volume 17, pp. 423-433. Moriarity, C. and Scheuren, F., 2001. Statistical matching: a paradigm for assessing the uncertainty in the procedure. Journal of Official Statistics, Volume 17, pp. 407-422. Moriarity, C. and Scheuren, F., 2003. A note on Rubin’s statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business and Economic Statistics, Volume 21, pp. 65-73. Paass, G., 1985. Statistical record linkage methodology, state of the art and future prospects. Bulletin of the International Statistical Institute, Proceedings of the 45th Session, LI, Book 2. Rässler, S., 2002. Statistical matching: a frequentist theory, practical applications, and alternative Bayesian approaches. Springer-Verlag, New York. Rässler, S., 2003. A non-iterative Bayesian approach to statistical matching. Statistica Neerlandica, Volume 57(1), pp. 58-74. Renssen, R.H., 1998. Use of statistical matching techniques in calibration estimation. Survey Methodology, Volume 24(2), pp. 171-183. Rubin, D.B., 1976. Inference and missing data. Biometrika, Volume 63, pp. 581-592. Rubin, D.B., 1986. Statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business and Economic Statistics, Volume 4, pp. 87-94. Singh, A.C., Mantel, H., Kinack, M. and Rowe, G., 1993. Statistical matching: use of auxiliary information as an alternative to the conditional independence assumption. Survey Methodology, 19, pp. 59-79. Vantaggi, B., 2005. The role of coherence for the integration of different sources. Proceedings of the 4th International Symposium on Imprecise Probabilities and Their Applications (F.G. Cozman, R. Nau and T. Seidenfeld (eds.)), pp. 269-278. 2.4. Uncertainty in statistical matching Mauro Scanu (ISTAT) Statistical matching is essentially a problem characterized by uncertainty. The available information (i.e. the two samples A and B) is not enough to estimate the joint distribution of X, Y and Z, unless non testable assumptions as the CIA are believed to hold, or external auxiliary information is at hand. 2.4.1. Definition of uncertainty in the statistical matching context When a practitioner avoids the use of non testable assumptions (as the CIA) and when auxiliary information is not available, the statistical matching problem consists of evaluating WP1 33 uncertainty for the joint distribution of X, Y and Z, understanding whether a unique solution can be taken into account, reducing uncertainty if possible. In order to define what uncertainty means in statistical matching, it is necessary to introduce the notion of estimable and inestimable parameters. In the case of samples A and B, the parameters of the marginal distribution of X, and of the conditional distributions of Y|X and Z|X are estimable: it is possible to use maximum likelihood methods or appropriate unbiased and efficient estimators on the observed data. On the contrary, there is no data for estimating any parameter of the distribution of YZ|X. The latter parameters are inestimable for the samples A and B. Uncertainty is defined as the set of values that the inestimable parameters can assume given the estimates of the estimable parameters. In order to make this concept clear, let us consider the following simplifying example. Let X, Y and Z be normal distributions, and suppose that Z is composed of two normal variables, Z1 and Z2 respectively. In this situation, almost all the parameters of the joint distribution of X, Y, Z1, and Z2 are estimable: all the means, variances and correlation coefficients for all the pairs of variables except for (Y, Z1) and (Y, Z2). The following figure shows the uncertainty space for the two correlation coefficients given that the estimable parameters are set to specific values (similar results are in D’Orazio et al, 2006, p.111, and Rässler, 2004). Figure 1 – Uncertainty space for YZ1 and YZ 2 when XY 0.9 , XZ1 0.3 , XZ 2 0.5 , and Z1Z2 0.4 . In this example, the uncertain space is an ellipse. The pair of inestimable parameters may assume any of the values in the ellipse: in other words, YZ1 may assume values between 0.14 and 0.68, while YZ 2 may assume values between 0.07 and 0.82. The two inestimable parameters YZ1 and YZ 2 under the CIA correspond to the intersection of the two ellipse axes. WP1 34 2.4.2. Estimation of the uncertainty space A seminal paper on uncertainty in statistical matching is Kadane (1978). That paper investigates the case (X,Y,Z) are normally distributed. Exploration of the uncertainty space for the inestimable parameters is performed according to this procedure: a. Estimable parameters are estimated consistently (so that for large sample sizes estimates coincide approximately with the parameters values). b. Inestimable parameters may assume all those values that are compatible with the estimated ones. In the normal case, this problem concerns only the parameters in the variance matrix (or equivalently the correlation matrix). This matrix should be positive semi definite. Hence the inestimable parameter YZ may assume an interval of values, defined by the inequality: || 0 (where || is the determinant of the matrix ). This approach was followed by Moriarity and Scheuren (2001, 2003, 2004). They extend this approach to the case X, Y, and Z are multivariate. They also show the performance of different imputation procedures in this context. A different approach in evaluating uncertainty in statistical matching was followed by Rubin (1986) and Rässler (2002). In this case, the statistical matching problem is explicitly seen as a missing data problem. Missing data is imputed with the use of multiple imputation. Both the authors still restrict their analysis to the case of normally distributed data, so that uncertainty reduces to the evaluation of the different values that the correlation coefficient of (Y,Z) can assume. The method proposed by Rubin, named RIEPS by Rässler, fixes a value from a grid of possible values for YZ|X between -1 and 1, and then reconstructs piecewise YZ. It can be proved that this is an improper multiple imputation approach. Rässler (2002) defines a proper multiple imputation approach, called NIBAS (non-iterative Bayesian approach to statistical matching). Missing Z in A and missing Y in B are imputed m times drawing imputations from the predictive distribution. The resulting complete data sets are used in order to estimate YZ m times. These m estimates describe how uncertain the inestimable parameter is. Rässler extends this approach also to the case X, Y and Z are multivariate, and suggests some measures for evaluating how uncertain the inestimable parameters are. Rässler (2002) also includes S-Plus codes for the application of the algorithms NIBAS and RIEPS. D’Orazio et al (2006a and b) study parameter uncertainty in statistical matching by means of maximum likelihood. They consider the case of normal variables, as in the previous references, but they also tackle the case X, Y and Z are categorical. In these papers the problem of the possible reduction of uncertainty is also considered. This aspect may be studied introducing constraints on the parameter space. These constraints are normally used in official statistics if structural zeros (used for edit rules) are present. Software codes in R are described in D’Orazio et al. (2006a). Vantaggi (2005) introduces a very promising procedure for the evaluation of the uncertain space for categorical variables and the introduction of parameter constraints for reducing uncertainty. The procedures discussed in this work are based on the De Finetti coherence approach. Bibliography D’Orazio, M. Di Zio, M. and Scanu, M., 2006a. Statistical matching. Theory and practice. John Wiley, Chichester. WP1 35 D’Orazio, M., Di Zio, M. and Scanu, M., 2006b. Statistical matching for categorical data: displaying uncertainty and using logical constraints. Journal of Official Statistics, Volume 22, pp. 137-157. Kadane, J.B., 1978. Some statistical problems in merging data files. Department of Treasury, Compendium of Tax Research, pp. 159-179. US Government Printing Office, Washington DC. Reprinted in 2001, Journal of Official Statistics, Volume 17, pp. 423-433. Moriarity, C. and Scheuren, F., 2001. Statistical matching: a paradigm for assessing the uncertainty in the procedure. Journal of Official Statistics, Volume 17, pp. 407-422. Moriarity, C. and Scheuren, F, 2003. A note on Rubin’s statistical matching using file concatenation with adjusted weights and multiple imputation. Journal of Business and Economic Statistics, Volume 21, pp. 65-73. Moriarity, C. and Scheuren, F, 2004. Regression based matching: recent developments. Proceedings of the Section on Survey Research Methods, American Statistical Association. Rässler, S., 2002. Statistical matching: a frequentist theory, practical applications, and alternative Bayesian approaches. Springer-Verlag, New York. Rässler, S., 2003. A non-iterative Bayesian approach to statistical matching. Statistica Neerlandica, Volume 57(1), pp. 58-74. Rässler, S., 2004. Data fusion: identification problems, validity, and multiple imputation. Austrian Journal of Statistics, Volume 33 (1-2), pp. 153-171. Rubin, D.B., 1986. Statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business and Economic Statistics, Volume 4, pp. 87-94. Vantaggi, B., 2005. The role of coherence for the integration of different sources. Proceedings of the 4th International Symposium on Imprecise Probabilities and Their Applications (F.G. Cozman, R. Nau and T. Seidenfeld (eds.)), pp. 269-278. 2.5. Evaluation of the accuracy of statistical matching Marcello D’Orazio (ISTAT) Evaluation of the accuracy of results obtained applying a statistical matching technique is not a simple task. Formally, it should consist of the estimation of the Mean Square Error 2 ( MSE Var Bias ) of the statistic used to estimate an unknown characteristic of the target population. Unfortunately, in statistical matching applications the accuracy of final results will depend on the amount of errors (observation and no observation errors) in the source data sets A and B and on the properties of the chosen statistical matching method. If it is assumed that the amount of errors in source data sets is low or “under control” (few nonsampling errors and a given sampling error), the accuracy of the final results will mainly depend on the capability of the chosen statistical matching method to provide parameter estimates close to the true unknown parameter values, in case of statistical matching macro methods, or a complete WP1 36 synthetic data set that can be considered as a random sample drawn from the true unknown population, in case of statistical matching micro methods. At first, evaluation studies have been carried out in order to assess the results of statistical matching micro methods based on distance hot deck. Barr and Turner (1981) propose the use of simple measures of agreement between the records joined in the two source data sets. Furthermore, it is suggested to verify whether the synthetic complete data set maintains the structure of the original source data sets in terms of summary measures (totals, group totals, averages,…). Barr and Turner (1990) suggest to compare the relation between X and Z in both the synthetic and the donor data-sets, as well as some summary characteristics (mean, variance) of the set of variables Z. A similar suggestion is also recommended in Rodgers (1984), although emphasis is posed to univariate and joint distributions of the variables Z in the synthetic data set and in the source data file B, respectively. At a second level, the comparison should be extended to the relationship between X and Z. Cohen (1991) recommends carrying out a “rough” sensitivity analysis on the failure of the conditional independence assumption (CIA) when using distance hot deck statistical matching methods. In that paper, it is explained how to perform a sensitivity analysis when unconstrained distance hot deck methods are applied. Rässler (2002) proposes a framework for evaluating the “validity” of a statistical matching procedure. The term validity is chosen to remark that evaluation should go beyond the evaluation of the efficiency (in terms of MSE). The framework consists in four levels of evaluation: (1) reproduction of the values of the unknown variables Z in the recipient file; (2) how the joint distribution of the variables X, Y and Z is reflected in the synthetic data-set; (3) preservation of the correlation structure and of the higher moments for the joint distribution X, Y and Z and for the marginal distributions of the X-Y and X-Z; and (4) preservation of, at least, the marginal distribution of Z and of X-Z in the fused data file. Obviously, level (1) can be assessed only by means of simulation studies. Levels (2) and (3) can be assessed by means of simulation studies or by referring to external information (distributions estimated in other studies,…). Level (4) can be easily assessed in all the statistical matching applications by using chi-square tests or similar measures such as the index of dissimilarity based on the absolute differences among relative cell frequencies. Rässler (2002) suggests using the correlation coefficient of cell frequencies to avoid the problem of zero cell frequencies in the denominator of the chi-square statistics. As far as simulations are concerned, Paass (1986) suggests a procedure that is also known as the folded database procedure. The variables of one of the original data sets (usually the largest one) are partitioned in three distinct sets, say G , G and G . The chosen database is randomly partitioned in two sub-samples A and B, then G is deleted from A and G is deleted in B, reproducing the typical statistical matching set-up. These two sub-samples are matched according to the chosen procedure: the result is the folded database. The estimates obtained from the folded database are compared with those derived from the original one so to compute measures of accuracy. Unfortunately, this evaluation method relies heavily on the assumption that the variables G , G and G behave approximately as the target variables X, Y and Z. Using this procedure, Paass assessed the accuracy of different statistical matching procedures in terms of reconstruction of the Z, X-Z and Y-Z distributions. Evaluations are performed by means of distribution free multivariate two-sample tests (Wald-Wolfowiz and Smirnov tests), chi-square tests and analysis of variance. WP1 37 Conti et al. (2006) investigate another measure of performance of micro statistical matching approaches: the matching noise. The matching noise registers the difference between the genuine and the imputed data generation processes. If this difference is large, the imputed data set is not appropriate for estimation of relationship parameters between matching and imputed variables. In same cases, this difference can be computed explicitly. For instance, it can be seen how distance hot deck methods behave remarkably well when there is a linear relationship between the matching and the imputed variables, although this performance deteriorates for non linear regression functions. Bibliography Barr, R.S. and Turner, J.S., 1981. Microdata file merging through large-scale network technology. Mathematical Programming Study, Volume 15, pp. 1-22. Barr, R.S. and Turner, J.S., 1990. Quality issues and evidence in statistical file merging. Data Quality Control: Theory and Pragmatics (G.E. Liepins and V.R.R. Uppuluri (eds.)), pp. 245313. Marcel Dekker, New York. Conti P.L., Marella D., Scanu M, 2006. Nonparametric evaluation of matching noise. Proceedings of the IASC conference “Compstat 2006”, Roma, 28 August – 1 September 2006, Physica-Verlag/Springer, pp. 453-460 Paass, G., 1986. Statistical match: Evaluation of existing procedures and improvements by using additional information. Micro analytic simulation models to support social and financial policy (Orcutt, G.H., Merz, J. and Quinke, H. (eds.)), Elsevier Science, Amsterdam. Rässler, S., 2002. Statistical matching: a frequentist theory, practical applications, and alternative Bayesian approaches. Springer-Verlag, New York. Rodgers, W.L., 1984. An evaluation of statistical matching. Journal of Business and Economic Statistics, Volume 2, pp. 91-102. WP1 38 3. Literature review on micro integration processing 3.1. Micro integration processing Eric Schulte Nordholt (CBS) Micro integration processing consists of putting in place all the necessary actions aimed to ensure better quality of the matched results as quality and timeliness of the matched files. It includes defining checks, editing procedures to get better estimates, imputation procedures to get better estimates, etc. It should be kept in mind that some sources are more reliable than others. Some sources have a better coverage than others, and there may even be conflicting information between sources. So, it is important to recognize the strong and weak points of all the data sources used. Since there are differences between sources, a micro integration process is needed to check data and adjust incorrect data. It is believed that integrated data will provide far more reliable results, because they are based on an optimal amount of information. Also the coverage of (sub) populations will be better, because when data are missing in one source, another source can be used. Another advantage of integration is that users of statistical information will get one figure on each social phenomenon, instead of a confusing number of different figures depending on which source has been used. 3.2. Combining data sources: micro linkage and micro integration Eric Schulte Nordholt and Frank Linder (CBS) 3.2.1. Micro linkage Most of the present administrative registers in the Netherlands are provided with a unique linkage key. It is the so-called social security and fiscal number (SoFi-number), a personal identifier for every (registered) Dutch inhabitant and those abroad who receive an income from the Netherlands and have to pay tax over it to the Dutch fiscal authorities. To prevent misuse of the SoFi-number, Statistics Netherlands recodes it for statistical processing into a so-called Record Identification Number (RIN-person). Personal identifiers, such as date of birth and address, are replaced by age at the reference date and RIN-address. This is all done in accordance with regulations of the Dutch Data Protection Authority to protect the privacy of the citizens. Since the SoFi-number is in use by social security administrations and tax authorities, one may expect it to be of excellent quality. A limited amount of SoFi-numbers may be registered with incorrect values in the data files, in which case linkage with other files is doomed to fail. However, in general, the percentage of matches is close to one hundred percent. Abuse of SoFi-numbers, for example by illegal workers, may occur in some cases, which results in a false match. Sometimes there are indications of a mismatch. An example of this is when the jobs register and the central Population Register (PR) are linked and the worker turns out to be an infant. Another example is, when the FiBase (fiscal administration) shows an unusually high income for a worker, when it is in fact the sum of the incomes of all people using the same SoFi-number. WP1 39 All social statistics data files can be linked to the PR. In practice this means that these data files are all indirectly linked to each other via the PR. Therefore the PR can be considered the backbone in the set of social data sources. When linking the PR and the jobs register, or the PR and a register of social benefits, it is a linkage between different statistical units (persons, jobs, benefits). In that case multiple linkage relationships can exist because someone can have more than one job or can benefit from several social benefits. In household sample surveys, like the Labour Force Survey (LFS), records do not have a SoFi-number. For those surveys an alternative linkage key is used, which is often built up by a combination of the following personal identifiers: - sex; - date of birth; - address5. This sort of linkage key will usually be successful in distinguishing people. However, it is not a hundred percent unique combination of identifiers. Linking may result in a mismatch in the case of twins of the same sex. False matches may also occur when part of the date of birth or the postal code and house number is unknown or wrong. Another drawback is that the linkage key is not person but address related, which may cause linkage problems if someone has recently moved. When linking the PR and the LFS with this alternative key, and tolerating a variation between sources in a maximum of one of the variables sex, year of birth, month of birth or day of birth, the result is that close to hundred percent of the LFS records will be linked. In its linkage strategy, Statistics Netherlands tries to maximize the number of matches and to minimize the number of mismatches. So, in order to achieve a higher linkage rate, more efforts are made to link the remaining unlinked records by means of different variants of the linkage key. For example, leaving out the house number and tolerating variations in the numeric characters of the postal code. To keep the probability of a mismatch as small as possible, some 'safety' devices are built in the linkage process. This last linking attempt accomplishes an extra one percent matches. In the end about two to three percent of the LFS records could not be linked to the PR. All together this is a good result, but selectivity in the micro linkage process is not to be ruled out. If the unlinked records belong to a selective subpopulation, then estimates based on the linked records may be biased, because they do not represent the total population. Analysis in the past has indicated that the young people, in the 15-24 age bracket, show a lower linkage rate in household sample surveys than other age groups. The reason for this is that they move more frequently, therefore they are often registered at the wrong address. The linking rate for persons living in the four large cities Amsterdam, Rotterdam, The Hague and Utrecht is lower than for persons living elsewhere. Ethnic minorities also have a lower linkage probability, among other things because their date of birth is often less well registered (Arts et al., 2000). Nowadays, the PR is serving as a sampling frame for the LFS. Therefore, the matching rate is almost hundred percent, and no more linkage selectivity problems occur. 3.2.2. Micro integration Successfully linking the PR with all the other data sources mentioned, makes much more coherent information on the various demographic and socio-economic aspects of each 5 In fact, the combination of a postal code (mostly related to the street) and house number is used as substitute for the address. The postal code in the Netherlands consists of four figures, followed by two letters. WP1 40 individual's life available. One has to keep in mind, however, that some sources are more reliable than others. Some sources have a better coverage than others, and there may even be conflicting information between sources. So, it is important to recognize the strong and weak points of all the data sources used. Since there are differences between sources, we need a micro integration process to check data and adjust incorrect data. It is believed that integrated data will provide far more reliable results, because they are based on an optimal amount of information. Also the coverage of (sub) populations will be better because when data are missing in one source we can use another source. Another advantage of integration is that users of statistical information will get one figure on each social phenomenon, instead of a confusing number of different figures depending on what source has been used. During the micro integration of the data sources the following steps have to be taken (Van der Laan, 2000): a. harmonisation of statistical units; b. harmonisation of reference periods; c. completion of populations (coverage); d. harmonisation of variables, in case of differences in definition; e. harmonisation of classifications; f. adjustment for measurement errors, when corresponding variables still do not have the same value after harmonisation for differences in definitions; g. imputations in the case of item nonresponse; h. derivation of (new) variables; creation of variables out of different data sources; i. checks for overall consistency. All steps are controlled by a set of integration rules and fully automated. Now an example follows of how micro integration works in the case in which data from the jobs register are confronted with data from the register of benefits. Both jobs and benefits are registered at volume base, which means that information on their state is stored at any moment in the year instead of at one reference day. Analysts of the jobs register know that the commencing date and the termination date of a job are not registered very accurately. It is important though to know whether or not there is a job at the reference date, in other words whether or not the person is an employee. With the help of the register of benefits it is sometimes possible to define the job period more accurately. Suppose that someone becomes unemployed at the end of November and gets unemployment benefits from the beginning of December. The jobs register may indicate that this person has lost the job at the end of the year, perhaps due to administrative delay or because of payments after job termination. The registration of benefits is believed to be more accurate. When confronting these facts the 'integrator' could decide to change the date of termination of the job to the end of November, because it is unlikely that the person simultaneously had a job and benefits in December. Such decisions are made with the utmost care. As soon as there are convincing counter indications of other jobs register variables, indicating that the job was still there in December, the termination date will in general not be adjusted. 3.2.3. The Social Statistical Database (SSD) The micro linkage and micro integration process of all the available data sources result in the end in the Social Statistical Database (SSD), a whole set of integrated microdata files in their WP1 41 definitive stage. The SSD contains coherent and detailed demographic and socio-economic statistical information on persons, households, jobs and (social) benefits. A major part of the statistical information is available on volume base. An extensive discussion on the SSD can be found in Arts and Hoogteijling (2002). In trying to imagine what the SSD looks like, one should not think of a large-scale file with millions of records and thousands of variables. It would be very inefficient to store the integrated data as such. Furthermore, the issue of data protection prevents Statistics Netherlands from keeping so much information together. Instead, all the integrated files in their final stage are kept separately. There is just one combining element which is the linkage key RIN-person, present in every integrated file. So, whenever users demand a selection of variables out of the SSD set, only the files with the variables demanded will be supplied. These can easily be extracted from the set and linked by means of the linkage key. Bibliography Arts, C.H. and E.M.J. Hoogteijling. 2002. The Social Statistical Database of 1998 and 1999. Monthly Bulletin of Socio-economic Statistics. Vol. 2002/12 (December 2002), pp. 13-21, 2002. [in Dutch] Laan, P. van der, 2000. Integrating Administrative Registers and Household Surveys. Netherlands Official Statistics, Vol. 15 (Summer 2000): Special Issue, Integrating Administrative Registers and Household Surveys, eds. P.G. Al and B.F.M. Bakker, pp. 7-15. Schulte Nordholt, E. and F.S. Linder, 2007. Record matching for Census purposes. In: Statistical Journal of the IAOS, 24, 2007, pp. 163-171. 3.3. Key reference on micro integration Miguel Guigo (INE), Paul Knottnerus (CBS) and Eric Schulte Nordholt (CBS) National and international programs have been run to check the quality of the output. Regardless of the actual purpose of the micro integration procedure, either through modelbased, donor-based, simulation-based or repeated-weighting coefficients, an assessment of the source data as well as measures on the reliability of the estimators must be taken into account. Several simulation studies show that the method of repeated weighting leads to estimates with lower variances than usual estimation methods, due to a better use of auxiliary information. An open issue remains on how small areas can be estimated in case of no available register information. The estimation of small areas, that is to say, to get a valid and efficient estimation of population parameters for sub national domains, both geographically-based domains or categories in classifications at a very disaggregate level, is a task that can be performed by use of administrative registers (as long as those variables from the registers are correlated at the micro level to those to estimate). For efficient small area estimation, records from an administrative source can play the role of auxiliary information. When no such external information is available, estimation of parameters in those small domains could be performed by means of some modern estimation techniques, but then the open issue is how to keep consistency of the set of tables. WP1 42 Schulte Nordholt, E., 2005. The Dutch virtual Census 2001: A new approach by combining different sources. Statistical Journal of the United Nations Economic Commission for Europe, Volume 22, Number 1, 2005, pp. 25-37. Abstract Data from many different sources were combined to produce the Dutch Census tables of 2001. Since the last Census based on a complete enumeration was held in 1971, the willingness of the population to participate has fallen sharply. Statistics Netherlands found an alternative in the Virtual Census, using available registers and surveys. The table results are not only comparable with the earlier Dutch Censuses but also with those of the other countries in the 2001 Census Round. For the 2001 Census, more detailed information is required than was the case for earlier Census Rounds. The acquired experience in dealing with data of various administrative registers for statistical use enabled Statistics Netherlands to develop a Social Statistical Database (SSD), which contains coherent and detailed demographic and socio-economic statistical information on persons and households. The Population Register forms the backbone of the SSD. Sample surveys are still needed for information that is not available from registers. To achieve overall numerical consistency across the Census tables set of 2001, the methodologists at Statistics Netherlands developed a new estimation method that ensures numerically consistent table sets if the data are obtained from different data sources. The method is called repeated weighting, and is based on the repeated application of the regression method to eliminate numerical inconsistencies among table estimates from different sources. Key words Census, consistent table estimates, repeated weighting 3.3.1. Definition of the problem In this publication it is explained how the method of repeated weighting has been used to produce consistent table estimates using available registers and surveys only. Although statistical offices belonging to the ESS are not obliged to conduct a Census every ten years, nor supplying census data, Eurostat has provided general advices which, in the form of a gentlemen's agreement, compose the frame to compile co-ordinated and harmonised data for European countries. Moreover, the general trend is to reach not only voluntary but binding agreements in order to provide data on population enumeration. The European Parliament will discuss the new regulation on European Population and Housing Censuses in December 2007. Then, to produce Census tables by use of already available data on population from administrative sources, is a valuable option to be considered for reducing costs related to data collection as well as for facing non-response and low participation problems. In these particular issues, Statistics Netherlands (Schulte Nordholt, 2005 and Corbey, 1994) has provided documentation on its own experiences, since the last traditional Census in the Netherlands, in 1971, met with much privacy objections against the collection of integral information about the population living in the Netherlands. WP1 43 Regardless of issues related to cost reduction opportunities, the choice of a virtual Census that is to say, to produce Census data based on administrative registers- can result from considering at least three key aspects of a traditional Census: a) unit non-response: a certain part of the population will not participate in a traditional Census survey; b) item non-response: even the part of the population that does participate will not answer some questions; c) to deal with these problems, traditional correction methods fall short of the need to be able to publish reliable results, that is to say, some of the cells in the set of tables cannot be disseminated. When using techniques as massive imputation, there are not enough degrees of freedom to get a sufficiently rich imputation model (Kroese and Renssen, 2000). Therefore, an alternative to usual weighting and imputation procedures must be developed to be able to produce a consistent set of tables using available registers and surveys only. 3.3.2. Phases characterising the problem a. Study the required output and the available data sources: in order to produce a set of tables concerning housing, commuting, demography, occupation, level of education and economic activity, a complete statistical database must be built. In the case of Statistics Netherlands, this role is played by the Social Statistical Database (SSD), which microdata are obtained through the integration of the central Population Register and several surveys as the Labour Force Survey, the Employment and Earnings Survey, and the Survey on Housing Conditions. Schulte Nordholt, Hartgers and Gircour (2004) give details on the overall proceeding for the Dutch Virtual Census of 2001. b. Define and apply the estimation strategy: once an appropriate statistical database is a database made up of several sources, the so-obtained set must be treated in order to reconstruct missing values for a record or an item (total or partial non-response). Little and Rubin (1987) introduce some overall methods of imputation. When the lack of information affects the whole unit or record, macro integration procedures are adequate to meet data needs. Denk and Hackl (2003, 2004) propose some macro integration methods in the context of a comprehensive database of enterprise data for taxation microsimulations, in order to get micro-founded indicators. So, the model-based approach estimates the probability for the occurrence of observed values of a variable, and missing data are then randomly imputed; the estimated probability distribution can be obtained by means of, e.g. auto-regressive models built from previously available data in former surveys. Donor-based approaches (e.g. hotdeck, nearest-neighbour), look for a record that is similar to the incomplete record (the socalled donor); similitude between both is determined via matching variables. Finally, the simulation-based approach estimates more than a single value for each missing item in a record. Houbiers et al. (2003) and Houbiers (2004) provide an alternative procedure called repeated weighting (RW), which aim it is to cut out numerical inconsistencies among table estimates from different sources. It is based on the repeated application of the regression estimator and generates a new set of weights for each table that is estimated. Let y be a variable of which the population parameter -either total or average- ought to be obtained for a table through a set of explanatory variables x from a register. The linear regression estimator of the population average for y is obtained through 1 YˆREG y s bs X p x s ; bs X ' s X s X ' s y s , WP1 44 where X p , xs , Y p and y s are, respectively, the population and sample averages of x and y, and bs the estimated linear regression coefficients. Instead of these traditional regression estimators, the repeated weighting procedure uses a set of coefficients in the form 1 bw Z ' s Ws Z s Z ' s Ws y s , where Zs is the matrix of sample observations on the variables in the margins of the table with variable y. The averages of the marginal variables z have been estimated already in an earlier table or are known from a register. Denoting these estimates or register counts by Zˆ RW , the repeated weighting estimator of Y is defined by YˆRW YˆREG bw Zˆ RW Zˆ REG . It turns out that the weights of the records in the microdata are adapted in such a way that the new table estimate is consistent with all earlier table estimates. c. Analyse the results: for a detailed case study, Schulte Nordholt (2005) gives some key results of the 2001 Census in the Netherlands. Schulte Nordholt (2005) provides tables on population by sex, type of household and age group; population by economic activity and sex; employees by working hours and sex; working population by occupation and sex; and population by level of education and age group, together with a comparison with former censuses. A deeper analysis of these results specially focusing on historical comparison, regional distributions -paying special attention to major cities- and corresponding results in other European countries is also available at Schulte Nordholt, Hartgers and Gircour (2004). 3.3.3. References for these different phases a. b. c. d. e. National and Eurostat documents National documents Houbiers (2004), Houbiers et al. (2004) and Kroese and Renssen (2000) This key reference Schulte Nordholt et al. (2004) 3.3.4. Quality assessment National and international programs have been running to check the quality of the output. Regardless of the actual purpose of the micro integration procedure, either through modelbased, donor-based, simulation-based or repeated-weighting coefficients, an assessment of the source data as well as measures on the reliability of the estimators must be taken into account. Knottnerus and Van Duin (2006) give the variance formulae for the repeated weighting (RW) estimator, and test RW estimators under various conditions. Several simulation studies, e.g. Boonstra (2004) and Van Duin and Snijders (2003) show that the method of repeated weighting leads to estimates with lower variances than usual estimation methods, due to a better use of auxiliary information. 3.3.5. Open issues It remains an open issue how small areas can be estimated in case of no available register information. The estimation of small areas, that is to say, to get a valid and efficient estimation of population parameters for sub national domains, both geographically-based domains or categories in classifications at a very disaggregate level, is a task that can be properly done by making use of administrative registers. For efficient small area estimation, WP1 45 records from an administrative source can play the role of auxiliary information, see e.g. EURAREA Consortium (2004). Saralegui et al. (2005) develops a case study applied to quarterly data taken from the Spanish Labour Force Survey (EPA) using external administrative sources in order to area-level covariates; an overall evaluation of the output is also provided. When no such external information is available, estimation of parameters in those small domains could be performed by means of some modern estimation techniques (see e.g. Rao, 2003) could be applied, but then the open issue is how to keep consistency of the set of tables. Bibliography Corbey, P., 1994. Exit the population Census. Netherlands Official Statistics, Volume 9, summer 1994, pp. 41-44. Duin, C. van and V. Snijders, 2003. Simulation studies of repeated weighting. Discussion paper 03008, Statistics Netherlands, Voorburg / Heerlen. http://www.cbs.nl/en/publications/articles/general/discussion-papers/discussion-paper03008.pdf. Houbiers, M., 2004. Towards a social statistical database and unified estimates at Statistics Netherlands. Journal of Official Statistics, Volume 20, No. 1, pp. 55-75. Houbiers, M., P. Knottnerus, A.H. Kroese, R.H. Renssen and V. Snijders, 2003. Estimating consistent table sets: position paper on repeated weighting. Discussion paper 03005, Statistics Netherlands, Voorburg / Heerlen. http://www.cbs.nl/en/publications/articles/general/discussion-papers/discussion-paper03005.pdf. Kroese, A.H. and R. H. Renssen, 2000. New applications of old weighting techniques, constructing a consistent set of estimates based on data from different sources. ICES II, Proceedings of the second international conference on establishment surveys, survey methods for businesses, farms, and institutions, invited papers, June 17-21, 2000, Buffalo, New York, American Statistical Association, Alexandria, Virginia, United States, pp. 831-840. OECD, 2003. Education at a glance. OECD Publications, Paris, France. Rao, J.N.K., 2003. Small area estimation. Wiley, New York, United States. Schulte Nordholt, E., M. Hartgers and R. Gircour (Eds.), 2004. The Dutch Virtual Census of 2001, Analysis and Methodology, Statistics Netherlands, Voorburg / Heerlen, July, 2004. http://www.cbs.nl/en-GB/menu/themas/dossiers/volkstellingen/publicaties/2001-b57-epub.htm Statistics Netherlands, 2003. Urban Audit II, the implementation in the Netherlands. Report, BPA no. 2192-03-SAV/II, Statistics Netherlands, Voorburg. http://www.cbs.nl/en/publications/articles/regional/urban-audit-II-Netherlands.pdf. WP1 46 3.4. Other references Boonstra, H.J., 2004. A simulation study of repeated weighting estimation. Discussion paper 04003, Statistics Netherlands, Voorburg / Heerlen. Denk, M. and P. Hackl, 2003. Data integration and record matching an Austrian contribution to research in official statistics. Austrian Journal of Statistics, Volume 32, pp. 305-321. Denk, M. and P. Hackl, 2004. Data Integration Techniques and Evaluation. Austrian Journal of Statistics, Volume 33, Number 1&2, pp. 135-152, Vienna. EURAREA Consortium, 2004. Main report. Enhancing Small Area Estimation Techniques to meet European Needs, Deliverable D.7.1.4. http://www.statistics.gov.uk/eurarea/downloads/EURAREA_PRV_1.pdf Heerschap, N. and L. Willenborg, 2006. Towards and integrated statistical system at Statistics Netherlands. International Statistical Review, Volume 74, Number 3, December 2006, pp. 357-3378. INE, 2002. Monte Carlo simulation to evaluate small area estimators of single person households (Unpublished manuscript). Kamakura, W.A. and M. Wedel, 1997. Statistical data fusion for Cross-Tabulation. Journal of Marketing Research, Volume 34, pp. 485-498. Knottnerus, P. and C. van Duin, 2006. Variances in repeated weighting with an application to the Dutch Labour Force Survey. Journal of Official Statistics, Volume 22, No. 3, pp. 565-584. Little, R.J.A. and D.B. Rubin, 1987. Statistical Analysis with Missing Data. John Wiley & Sons, New York. Saralegui, J., M. Herrador, D. Morales and A. Agustín Pérez, 2005. Small Area Estimation in the Spanish Labour Force Survey. Proceedings of Challenges in Statistics Production for Domains and Small Areas. http//www.stat.jyu.fi/sae2005/abstracts/saraleg.pdf. WP1 47 4. Practical experiences 4.1. Record linkage of administrative and survey data for the EU-SILC survey: the Italian experience Paolo Consolini (ISTAT) The EU-SILC (European Union Statistics on Income and Living Conditions) Italian team has developed a pioneer strategy in the measurement of self-employment income since 2004. This strategy consists in a multi-source data collection, based on a paper and pencil face-to-face interview and on linkage of administrative with survey data. The aim of combining administrative and survey data is to improve data quality on income components (target variables) and relative earners by means of imputation of item non-responses and reduction of measurement errors. Integration of administrative and survey data at micro level is performed by linking individuals through common key variables. Target variables - In the first edition (survey 2004), this process has involved only two income components which are self-employment income and pensions. Nevertheless for the second edition (survey 2005) it has included a third one: the employment incomes. Target population - The target population is represented by the Italian reference population of EU-SILC: all private households and their current members residing in Italy at time of data collection. Persons living in collective households and in institutions are excluded from target population. The analysis units are adult members (15+ aged) who live in private households 6. The EU-SILC 2004 survey includes approximately 52.5 thousand interviewees aged 15 years or over. Among these, about 49.2 thousand units have a tax statement or a declaration in the administrative data-sources. Problems in the available data sets - With regard to the measurement of self-employment incomes in household surveys there are two clear-cut statements, taken from the “Canberra Handbook”, that depict the state of the art: “Income data for the self-employed are also generally regarded as unreliable as a guide to living standards” (Canberra Group, 2001, p.54); “Household surveys are notoriously bad at measuring income from capital and selfemployment income” (Canberra Group, 2001, p.62). Figure 1 below shows, in a simplified sketch, the problem of collecting self-employment incomes when either survey or administrative data are available, and the objective is to obtain disposable income: the shaded areas correspond to the income available to an individual for his/her personal use. 6 The Eurostat database of the Italian EU-SILC personal incomes data refers to all adults aged 16+ living in each sample households. WP1 48 Figure 1 Personal gross, taxable, reported and disposable income TAX AVOID. & DEDUCTIONS TAXES + CONTRIB. UNDERREPORTING GROSS INCOME TAXABLE INCOME NET TAXABLE INCOME (administrative data) DISPOSABLE INCOME INCOME REPORTED (survey data) The different sources of microdata on earnings from self-employment may not contain the variable ‘disposable income’. Survey data may be affected by under-reporting. On the other hand, administrative data gathering individual tax returns do not take account of illegal tax evasion and may not display all the authorized deductions allowed in the calculation of taxable income (tax avoidance). Privacy issues – The Personal Tax Annual Register, including all the Italian tax codes, cannot be used directly from Istat. Therefore, record linkage has to be performed by the tax agency on Istat’s behalf. The exact linkage performed by Tax Agency produces 7.8% unmatched records that are partially retrieved by means of auxiliary information (1.5%). Transfer of know-how in performing record linkage from Istat to Tax Agency could improve the effectiveness of the matching procedure in the future. Integration methodology - In the EU-SILC project, the standard procedure to measure net self-employment income requires to collect “the amount of money drawn out of selfemployment business” only when the profit/loss from accounting books or the taxable selfemployment income (net of corresponding taxes) are not available. For the Italian EU-SILC project, both tax and survey microdata are available, through an exact matching of administrative and survey records. However, both sources may be affected by underestimation of self-employment incomes. Moreover, some individuals report self-employment incomes in only one data source. This is the case of some individuals whose professional status at the time of the interview is different from that of the income reference period and of many percipients of small and/or secondary self-employment incomes7. The integration procedure consists of the following 4 phases. 1. Key (individual identifier) Each sample person has been identified with her/his tax code (i.e. the personal identification number assigned to each individual by the Italian 7 The survey data include as self-employment incomes those small compensations for minor and informal services that are frequently unnoticed for tax purposes. For example, the earnings of baby-sitters. On the other hand, some minor self-employment incomes shown in the tax returns may be disregarded during the interview to ease the response burden. WP1 49 tax authorities). The tax codes have been primarily retrieved out of the Population Registers by the Statistical Offices of the Municipalities who participated in the survey. As the information released by the local statistical offices may be missing or inaccurate, Istat has also requested them to collect auxiliary data on the individuals to be interviewed. Indeed, the personal tax code is univocally determined by the values of selected individual characteristics (name, surname, sex, date of birth and place of birth). Thus, the collected tax codes were compared with those resulting from the computation based on the available individual characteristics and, when necessary, corrected. 2. Linkage of survey and tax records In a second step, tax codes of the previous phase were matched to those in the Personal Tax Annual Register, consisting of all the Italian tax codes. The procedure searched for the tax codes of the persons in the EUSILC sample among the ones in the tax files. More precisely, linkage focuses mainly on adults (15 years and over) that actually participated in the survey. The rate of successfully matched records was 93.7%. In other words, the tax source covers 93.7% of the adults interviewed for the 2004 Italian EU SILC survey. The unmatched units (6.3%) are either individuals with no tax code available in the Population Registers (4.4%) or persons not included in the initial survey frame but later registered as additional household’s members by the interviewers (1.8%). 3. Loading tax data The third step consisted in reading and checking information on selfemployment income included in the tax records. At this stage, two relevant sources of microdata have been uploaded: (i) “UNICO persone fisiche” and: (ii) “730” tax returns8. Once implemented, the reading procedures lead to a suitable database of tax records that has been used to build the net (taxable) self-employment income. The Italian tax system distinguishes between two broadly defined categories of selfemployment income: ‘redditi da libera professione’ (earnings from free profession) and ‘redditi d’impresa’ (business incomes). The latter may also include income attributed to sham co-helpers for tax splitting purposes. Income splitting within a small family business occurs when there is a transfer of taxable income from a person in a higher income bracket to a person in a lower income bracket. Because of the progressive tax schedule, it is thus possible for some self-employed people to lower total household income liability. We define as ‘sham co-helpers’ persons who appear as percipients of self-employment income in tax returns but, at the same time, convincingly report themselves in the survey as inactive, non-working persons during the income reference period (students, housewives etc.). Income received by sham cohelpers has been assigned to the active self-employed household members. 4. Imputation through integration The assumption underlying the fourth step has been that true disposable self-employment income may be under-reported by both sources. In order to minimise under-estimation, self-employment income has been set to the maximum value between the net income resulting from the tax source and the net income reported in the survey. In most cases, comparisons of self-employment income reported in the two sources has been made at the individual level. However, for small family businesses, comparisons have been made at the household level, that is by With few exceptions, the ‘UNICO persone fisiche’ form must be filled by the generality of percipients of selfemployment incomes. In particular, by any person who is the sole or joint owner of an unincorporated business for which he/she works and by those taxpayers who perceive incomes from unincorporated businesses. The ‘730’ form must be filled by the percipients of secondary and/or occasional self-employment incomes (for example, an employee who adds to the wage a self-employment income from a secondary job as a free-lancer). 8 WP1 50 comparing the sums of the self-employment income received by all household members in the two sources. Conclusions - The use of administrative data has changed the tails of the distribution of selfemployment incomes (Figure 2). Indeed, with respect to survey data, the final (i.e. integrated) dataset contains a lower percentage of self-employment incomes in the range 2,000 - 12,000 Euros per year and a higher proportion of percipients with incomes greater than 20,000 Euros. Figure 2 - Distributions of self-employment incomes drawn by: survey, administrative and final dataset (all percipients) 16 14 12 10 % Survey Administrative Final 8 6 4 2 0 -10 0 10 20 30 40 50 60 70 80 90 100 thousands of euro As a result, combining administrative and survey data brings about a rise of 15.6 % in the number of percipients and an increase of 11.9 % in the average of self-employment income compared to the exclusive use of survey data. When both sources report information on selfemployment incomes, there is some evidence of a higher under-estimation rate on the tax data compared to the survey data. Bibliography Canberra Group, 2001. Final Report and Recommendations, Ottawa, Canada. 4.2. Record linkage applied for the production of business demography data Caterina Viviano (ISTAT) The Business Demography (BD) project: background - The harmonised data collection on BD is a European project launched in 2001 with the aims to result in comparable data on business demography for the European Union (EU). In particular it aims to satisfy the growing requirements for structural indicators regarding births, deaths and survival. Until now data have been produced for reference years from 1998 to 2004. The number of enterprise births is a key variable in the analysis of business demography as other variables such as the survival and growth of newly born enterprises are related to this WP1 51 concept. The production of statistics on newly born enterprises is based on a definition (The Commission Regulation No 2700/98): “A count of the number of births of enterprises registered to the population concerned in the business register corrected for errors. A birth amounts to the creation of a combination of production factors with the restriction that no other enterprises are involved in the event. Births do not include entries into the population due to: mergers, break-ups, split-off or restructuring of a set of enterprises. It does not include entries into a sub-population resulting only from a change of activity.” The aim is to produce data on the creation of new enterprises that have started from scratch and that have actually started activity. An enterprise creation can be considered an enterprise birth if new production factors, new jobs in particular, are created. The identification process - In practice to obtain enterprise births (as well as for deaths) it is necessary to carry out an identification process with the aim of obtaining births and deaths. This process is described for births in the following steps: Step1 - The populations of active enterprises (N) at reference time t, t-1 and t-2 are obtained from the Business Register ASIA (in Italian "Archivio Statistico delle Imprese Attive", the frozen annual file of the Statistical Archive of Active Enterpises, obtained through the integrated use of different administrative and statistical sources). Step2 –The new enterprises in year t (Et) are a subset of the population of active enterprises Nt, identified by comparing the population of active enterprises in year t with the population of active enterprises in year t-1 and with t-2 to exclude reactivations. New enterprises are identified as enterprises that are only present in year t. The three populations are matched by exact codes i.e. the business register id code. Step3. -The identification of births is carried out by eliminating creations due to other events than births from the population of new enterprises, that is break-ups, split-offs, mergers and one-to-one take-overs. The method for identifying other creations compares the new enterprises with the population of active enterprises for the current year (Nt) and the previous year, using a linkage process. The linkage process includes matches on name, economic activity and location (variables already available into the BR). This linkage technique allows the application of the continuity rules9 i.e. when two enterprises are deemed to be the same. The continuity rules consider three continuity factors, continuity of control, economic activity and location. These rules generally follow the approach that if two out of three of the continuity factors change, there is discontinuity of the enterprise. In addition to the linkage process described above, it is also necessary to check for links between units, which may indicate that a new enterprise is not a real birth, and to carry out additional matching or checking using any other available information such as the database of events of structural changes (mergers, take over), administrative sources supplying information on links like the partnership archive by the Chamber of Commerce, and so on. It is necessary to be aware that some activities naturally tend to be concentrated in certain locations, such as retailing (shopping malls), construction (large sites), and the “liberal professions” (shared premises), where there is an increased risk of false matches. At last, to finalise the identification of the enterprise births, the largest enterprises are investigated manually to detect whether the event actually can be considered a real birth. 9 Following Eurostat (2003), the continuity of an enterprise depends on the continuity of its production factors. An enterprise is considered to be continued if it modifies without any significant change in its identity that is in terms of its production factors. The production factors include the set of means (employment, machines, raw material, capital management, buildings) that the enterprise uses in its production process and leading to the output of goods and services. WP1 52 The RL procedure - The data matching process to identify pairs of records to validate continuity rules is carried out applying a (non probabilistic) record linkage technique (RL). The method is applied in such a way that the decision to link 2 or more records is the result of a complex decision-making procedure. In this process the first decision is based on a deterministic choice of link and non link record pairs, according to the outcomes produced by the compared variables. Afterwards results are assessed, integrated and validated with other deterministic rules aiming to reduce the problem of multiple links and to evaluate the actual connection of the link with real conditions. The RL process can be summarized as follows: Phase 1: treatment of variables and rules for the matching The BR record units are compared through agreement/disagreement rules delineated for three variables: enterprise name (N), location (L) and economic activity (S). For a better specification information on legal status and fiscal code have been used. Step1. Before matching, each variable is standardized and parsing is made. Parsing means that a free format field - like the enterprise name – is divided in a set of components that can be identified and that can be automatically matched. Each component is parsed using dictionaries (for the variable Enterprise Name). For example words representing surnames, names, activities, legal status, alias names, etc. are identified. Rules used to establish agreement/disagreement between characters are described in the following schemes. Enterprise name According to the Italian enterprise structure, an enterprise name takes different formats according to its legal form. Aggregating the legal form (J) into 3 classes: (I= Sole proprietorship; Sp= Partnership; Sc= Limited liability company) the adopted rules are: RULE – Enterprise Name A=agreement PA=partial agreement D=disagreement (J=I, Sp) Surname and Name are the same (J=Sp,Sc) % of matched words 100% (J=I, Sp) Surname is the same (J=Sp,Sc) % of matched words >%, min < 100% (J=I, Sp) Surname is different (J=Sp,Sc) % of matched words <=% min Address Standardization of address consists of identifying 3 components: toponymic (T), street name (sN) and street number (sN). Matching rules are delineated according to different combinations of components outcomes RULE – Enterprise Address A=agreement and equal PA=partial agreement D=disagreement T =equal, missing; sN= % of matched strings high; sN= present T =equal, missing; sN= % of matched strings high; sN= missing T =differ; sN= % of matched strings high/low; sN= differ WP1 53 Economic activity Nace rev.1 codes at 4 digits are compared to produce outcomes. RULE – Economic activity A=agreement PA=partial agreement D=disagreement Nace code equal Nace code consistent10 Nace code differs The composite comparison vector for each record pair allows identifying subpopulations of matched records. Step2: Blocking and reductions of the comparison space A blocking strategy is important to reduce the large amount of possible record pairs to be compared. A block means that only records belonging at the same block are compared. The chosen blocks are: 1) municipality, as the main block; 2) then economic activity code (3 digits)–postal code (CAP) (4 digits), as alternative blocks. In order to increase quality of data, postal codes have been standardized using for the first time a file downloaded from the Italian Post Office site. Results of this process allowed improving the overall quality of blocking strategy. Phase 2: Implementation of the RL procedure (only deterministic): In this process, the probabilistic step (EM parameters estimation method; calculation of weights to associate to each record pairs that allow choosing between link and not link) has not been applied. The main reason is due to estimation problems of a specific parameter that is based on a random sample of data chosen from the population. Given that these estimates become biased as population size increases, there is a problem concerning the sample size and its representativeness; stratification by small geographical areas would be necessary but this strategy would need a longer data processing. To face this problem alternative solutions are under consideration. Main steps for the choice of matched records: inclusion of pairs having two or three agreements according to the continuity rules; investigation and selection of some particular comparison vectors having partial agreements for some components (for example in the comparison between cooperatives/consortiums with other companies the link is accepted in presence of Name (Partial Agreement) and Address and Nace code (Agreement)); clustering of pairs of records and reduction of multiple matches choosing the better comparison vector; exclusion of clusters with more than 3 couples of linked records (very big multiple matches); exclusion of comparisons between sole proprietorship and limited liability company (in this case a change in the control variable is a strong evidence of discontinuity); The sub-population determined by the agreement on both Locations and Sector of activity (L+S) is analysed with more attention, due to its large amount in comparison to the others. A list of economic activity “at risk” is built up so that combination carrying particular 10 Economic activity codes note equal at four digits can be consistent whether the activities carried out use the same production process (for example production and sell of bread are consistent activities). WP1 54 activities are first excluded and then they are recovered after a check for telephone numbers and date of registration/deregistration. For example in the year 2004, the weight of L+S record pairs, in the match between stock and new enterprises, was 55%, after the check only 9% of them were re-integrated. The list of such activities in terms of Nace codes is the following: 45, 5262, 5263, 65, 66, 67, 70, 741, 742, 744, 7484, 6025, 633, 634, 851. Conclusions - The RL technique is of fundamental importance in detecting links for continuity between creations of enterprises and active units. Its application allows identifying and cleaning data. Results are significant as it accounts for a high percentage of detected links (Table 1). Table 1 – Identification of enterprise births - year 2004 1. Active enterprises 2. Creations of enterprises because of: 2.1 reactivations 2.2 errors 2.3 state of activity 3. New enterprises (2.3 less Nace 7415) 3.1 creations due to events of take-oker, merger,etc 3.2 creations due to continuity 3.3 creations due to events of changes of juridical status 3.4 creations due links by administrative sources 3.5 exclusion due to corrections 3.T Total exclusions (3.1-3.5) Enterpirse births rate of new enterprises excluded for links (3.T/3.%) rate of new enterprises linked for continuity (3.2/3.%) rate of exclusions due to continuity (3.2/3.T%) BD 2004 4,366,679 447,419 36,873 1,624 408,922 408,607 10,928 44,695 1,260 12,159 266 69,308 339,299 17.0 10.9 64.5 Bibliography Eurostat, 2003. Business registers - recommendations manual, Office for Official Publications of the European Communities, Luxembourg. 4.3. Combination of administrative, register and survey data for Structural Business Statistics (SBS) - the Austrian concept Gerlinde Dinges (Statistics Austria) The European SBS-Regulation11 is the basis for compilation of Structural Business Statistics in Austria from the reference year 1997 onwards. From reference year 2002 onwards the new national regulation12 based on the Federal Statistics Act 2000 foresees a completely new data collecting and estimation concept for SBS statistics in Austria. To satisfy national regulation 11 Council Regulation (EC, EURATOM) No 58/97 of 20 December 1996 concerning structural business statistics. 12 National Regulation for SBS (Leistungs- und Strukturstatistik-Verordnung, BGBl. II 428/2003). WP1 55 and requirements of EU the new concept is implemented by conducting a yearly cut-off survey in combination with use of administrative sources and statistical calculation methods instead of former applied stratified random sampling (grossing up of 42.000 sampled units). For NACE sections C-F information from short-term statistics (STS) can be used additionally. The selection frame for SBS13 is the Business Register (BR) of Statistics Austria which provides links between the various administrative sources and the statistical units (enterprises). Administrative data from Social Security Authorities (employment information) and Tax Authorities (turnover) are used in combination with the survey and also as basic data for the non surveyed units. For NACE divisions 65, 66 the Austrian Central Bank and Austrian Financial Market Supervisory Authority are single sources. Main Target – the main target of the new concept with its combination of administrative data, surveyed data and model based estimation is the reduction of respondent’s burden on the one hand and the retention of high data quality on the other hand. Strategy – the principle of data concentration on a small number of large and medium sized enterprises in the SBS should be used (see fig. 3). Hence based on a cut-off-survey the ‘most important information’ should be obtained from the units directly and the ‘less important’ information should be obtained by model based estimation. Thus, there is no general loss of information and detailed information can be provided for all units on record level. Figure 3: The principle of data concentration for selected variables (SBS 2003) Cut-Off-Survey - In accordance with the Austrian SBS-Regulation in the new concept only about 32.000 enterprises (12%) above a NACE-specific threshold are in the yearly SBS Survey. For production (NACE C-F) the threshold is defined in terms of persons employed (in general 20 depending on the coverage), for services a turnover threshold (1.5 Mio EUR for trade and NACE 633 and 634, 750 Tsd. EUR for other services) is used. In the service sector for the variable number of employees (as well as the breakdown by sex and status) administrative data from social security authority is combined with surveyed data from the SBS Survey. The variable number of employees in total is collected in the survey for 13 NACE sections C-K. WP1 56 checking the link to administrative sources in the course of plausibility checks. For most of the enterprises all details concerning employment variables are taken from Social Security data directly. In the case of major deviations an update of the link in the business register mostly solves the problem. In manufacturing industries and constructions some main variables (like employment, personnel expenditures,…) can be taken from STS. Estimation Model - For smaller enterprises below legal thresholds (about 240.000 units) all variables are estimated on record level by using the sources Business Register, Social Security Register and Tax Register to obtain basic variables (economic activity, number of employees, turnover) and by applying model based estimation for the other variables. Model parameters for microdata estimation for enterprises below legal thresholds are based on the ‘most similar’ enterprises in the surveyed SBS-data. The ideal case - the structure of variables for enterprises above thresholds is the same as for enterprises below – cannot generally be assumed. That can be seen from Figure 4 which shows the box plots for personnel expenditures per employee for a selected NACE subclass and different turnover size classes in the last SBS-Census (1995). Figure 4: Distribution of ‘personnel expenditures’ per employee for a selected NACEsubclass and different turnover size classes in the last SBS-Census (1995) personnel expenditures per employee MM 60 50 40 30 20 10 0 0 1 2 3 4 5 6 7 8 9 Gr K l turnover size classes observable in cut-off-survey not observable basis for parameter estimation legal threshold Therefore, a step by step approach based on turnover size classes and economic activities has been applied, to allow for structural differences in model parameter estimation. If enough enterprises are available in the cut-off-sample, the calculation of model parameters is based on the most detailed NACE classification level (subclasses) and the smallest possible size classes. Starting with a turnover-limit of EUR 999.000 the size classes were raised step by step up to a maximum of 5 Million EUR. If the number of enterprises in the relevant NACE subclass was too small then the parameters were calculated for a higher NACE aggregation. Since outliers can have a great influence on the quality of the model adaptation a robust WP1 57 regression method (LTS14) was applied to estimate the main variables (e.g. personnel expenditures, purchases of goods and services,…). For detailed variables like breakdown of turnover or breakdown of purchases of goods and services ratio estimation has been used. Quality of Administrative Sources – In order to define the population (units above and below the legal thresholds) and to get basic variables for model based estimation, employment data from Social Security Authority by sex and status level and annual tax declaration values or aggregated monthly tax declarations (VAT tax advance return) from administrative sources are used. The quality and completeness of social security data is very satisfactory. Because of missing tax declarations or incomplete links of administrative data with Business Register, for about 15% of enterprises below thresholds turnover imputation is necessary. Additionally turnover definitions from tax declarations and SBS do not correspond to each other by 100%. Deviations between SBS definitions and administrative data depend on different reasons such as foreign tax accounts, definitional differences, structural changes, group company tax declarations, financial year records (for SBS), calendar year (tax) or deficient tax declarations etc. However, analyses have shown that differences for small observable enterprises are rather negligible and the large or medium sized enterprises are in the survey anyway. Model Effects – For insufficiently covered inhomogeneous economic activities and economic activities with a very different structure of the enterprises above and below thresholds (e.g. NACE categories in which trade and intermediary activities are put together) a systematic bias cannot be avoided. In this case only expert rating can increase the quality of the results (subjective assessment of results by qualified experts of data editing staff). Conclusions – In general the concentration principle of the surveyed data in combination with administrative data and model based estimation works very well. For most of the NACEclasses the small percentage of surveyed enterprises provides a relatively high data coverage (ratio of the surveyed data). In this case possible model effects as a result of inhomogeneous branches or deviations between SBS and administrative data may have less influence. Figure 5: Coverage for basic variables on detailed NACE-level (SBS 2003) Number of NACE-classes Number of NACE-classes Number of NACE-classes 182 200 200 200 147 150 100 50 150 150 120 78 76 56 80 100 41 28 17 3 5 50 33 6 0 0 11 9 8 18 28 100 49 50 Percentage coverage of number of enterprise at NACE 4 digit level (NACE-classes) 36 5 13 9 13 24 37 58 0 0 0-10% 11-20% 21-30% 31-40% 41-50% 51-60% 61-70% 71-80% 81-90% >90% 82 0-10% 11-20% 21-30% 31-40% 41-50% 51-60% 61-70% 71-80% 81-90% Percentage coverage of turnover at NACE 4 digit level (NACE-classes) >90% 0-10% 11-20% 21-30% 31-40% 41-50% 51-60% 61-70% 71-80% 81-90% >90% Percentage coverage of number of employees at NACE 4 digit level (NACE-classes) Applying the concentration principle for all economic activities will require an adaptation of the legal thresholds of economic activities not well covered. This would cause a change of national regulation15 for SBS which is not foreseen at the moment. The Austrian SBS method was developed with consideration of all legal, technical and mathematical circumstances by using all data sources available for the moment. Various test calculations were carried out in advance and analyses have shown high quality results for basic data like turnover and number of persons employed and an improvement to former concepts in particular for results on regional level. Results for all the other variables were also 14 15 FAST-LTS (Least Trimmed Square Regression) algorithm (Rousseeuw and Van Driessen, 1999). The next change of the national regulation is planned with implementation of NACE Rev. 2. WP1 58 very satisfying. In case of basic legal changes or when new (re)sources are available, further adaptations will be carried out. Documentation on the project methodologies http://ec.europa.eu/comm/eurostat/ramon/nat_methods/SBS/SBS_Meth_AU.pdf Bibliography Rousseeuw, P.J. and A. M. Leroy. Robust regression and outlier detection, John Wiley & Sons, Inc., August 1987. Rousseeuw, P.J. and K. van Driessen. Computing LTS Regression for Large Data Sets, Springer Netherlands 2006 (http://www.springerlink.com/content/06k45m57x01028x6/fulltext.pdf). 4.4. Record linkage for the computer-assisted maintenance of a Business Register: the Austrian experience Alois Haslinger (Statistics Austria) Since the Federal Statistics Act 2000 has become effective, Statistics Austria receives monthly copies from at least four administrative data sources or registers (AR) all covering information about Austrian enterprises or subunits of them. This information is not only used for the updating and maintenance of the Business Register (BR) but also as a partial or total surrogate for censuses and surveys. The idea is to reduce the response burden as much as possible. If any information which is needed for statistical purposes is already stored somewhere in the public administration, then Statistics Austria should take that information instead of surveying the enterprises or the citizens again. The Austrian Business Register - The Business Register of Statistics Austria serves as an instrument for all surveys conducted in economic statistics and even for some in social statistics. It has been designed according to the requirements of Council Regulation No. 2186/93 on business registers for statistical purposes within the EU and contains about 410.000 enterprises including their establishments and local units. All in all it has about 560.000 active and 210.000 inactive units (i.e. enterprises, establishments and local units) and has been in operation since mid-1995. The BR is a central register held in Statistics Austria for statistical purposes. It is designed to comply as far as possible with European requirements, but generally does not recognize any difference between a legal unit and an enterprise, like the registers of most EU member states. AR used for the maintenance of the BR - Basically four administrative registers are used for the continuous servicing of the BR: 1. Register of the Federal Economic Chamber (FEC) Until 2000 it was the sole administrative source used for the updating of the BR. Most physical and legal persons who want to pursue a business have to apply for a trade licence. A separate licence is necessary for each different trade in which someone wishes to engage. The register of members of the Economic Chamber has about 350.000 entries. The register of the Economic Chamber still is an indispensable source of information for maintaining the quality of the BR (local units, status of enterprises). Unfortunately, not every economic WP1 59 unit has to become a member of the Federal Economic Chamber, e.g. that holds for physicians, lawyers and civil engineers, which have their own chambers. 2. Register of Incorporated Companies (RIC) This register is a public electronic register which is operated by special courts. It keeps record and informs about all facts of enterprises, which have to be stored according to commercial law (name and address of the company, company number, legal form, enrolment of the submission of accounts, changes of the persons authorised to represent the company). The register of companies contains only about 150.000 entries of corporations or merchants who have been entered as such in the register. Generally, such a person must have a net turnover above 400.000 € per year or above 600.000 € for food retailers or general stores. 3. Social Security Register (SSR) Each Austrian employer has to register his employees in one of about 20 different social security insurance institutions. It depends on the region and the kind of contract of employment, which insurance institution is responsible for a certain employee. It is possible that the employees of an employer are registered at two or more different insurance institutions (e.g. if an employer has local units in more than one province). For each employment of a certain person by a certain employer, a data record is stored in the responsible social security insurance institution containing, among other things, the social security number of the person, an identification code of the employer, a code of the insurance institution, sex of the person and the kind of contract of employment. Because of the federal organisation of the social insurance system most of the institutions are member of an umbrella organisation called Main Association of Austrian Social Security Institutions. The Main Association has access to the employment registers of its members and additionally maintains a register which holds one record for each combination of insurance institution and employer. This register of employer accounts (SSR) contains the name of the employer, postcode, address, place of the enterprise, NUTS-3 and NACE codes, and contains about 350.000 units. The units of this register are not comparable with the units of the BR. Usually, one enterprise of the BR consists of 0 to n units of the Social Security register. 4. Tax Register (TR) The most comprehensive administrative register used for the updating of the BR is the register of the tax authorities. It contains basic information like name and address, date of birth, sex and civil status (the last 3 primarily for persons), legal status and economic classification according to NACE (primarily for enterprises) for about 6 million taxable units (persons, business partnerships, corporations, institutions, associations…). The coverage of this basic tax file is much broader than that of the BR. To get a sub-file from the basic tax-file which is comparable with the BR it has to be merged with the turnover taxation file from the tax authorities containing about 600.000 units. Statistics Austria receives both files of the tax register monthly. Both files include a unique subject identification key which can be used for merging the two files. The turnover taxation file contains all units from the basic file which did a turnover tax return in at least one of the last 3 years. A problem is the lag between a fiscal year and the time, when all units have received their tax assessment. This lag is about 2-3 years. To get a realistic value of total turnover in 2000 you have to wait at least until mid 2003. The merged file of that date covers units which are no longer active at present. On the other hand, in the merged file units of the BR are lacking which are not liable for turnover taxation (e.g. turnover from medical activity). Nevertheless, most of the units of this merged file are in accordance with the enterprises of the BR. From the start of 2003 on, the problem of the time lag of the turnover returns has been reduced, because now each enterprise with a turnover above 100.000 € in a year has to do a monthly turnover tax WP1 60 advance return beginning with January of the next year. Therefore, new enterprises are registered earlier than in the past in the basic tax file. 5. Other administrative sources: The staff of the unit responsible for the manual updating of the BR uses also more or less regularly additional administrative sources which are not supplied in an electronic file and/or not monthly. Examples of such sources are the register of reliability of the borrowers (Kreditschutzverband), membership directories of the Medical Chamber, the Chamber of Lawyers, of Civil Engineers, Patent Agents, Notaries etc.,… Problems in the available administrative registers and record linkage - The units of the AR do not exactly agree with the units of the BR (enterprises, establishments and local units) and each register has its own system of identification keys for its units. Some information in the AR is incomplete or wrong (e.g. the NACE classification) or the timeliness of some units in the AR is different from that in the BR. The greatest problem is the non-existence of a unique numerical identifier for the units in different registers. The matching of the register units has to be done by comparing mainly text fields like name of the company and address. These fields are not standardised and of different length in different registers. Fortunately, both the BR and the different AR store the postal and/or municipality code for each unit which diminishes the number of necessary comparisons highly. Generally the most suitable kind of unit of the BR for linkage with the units of an external AR is the enterprise, only the licenses of the Federal Economic Chamber become linked with the local units of the BR. For each AR a table is loaded in the DB/2-database of the BR where each enterprise/local unit gets assigned the identification keys of the corresponding units in the AR. Record linkage methodology - For record linkage of the units of two registers Statistics Austria uses the bigram method: For comparison of the name of an unit a in register A and the name of an unit b in register B the two names are decomposed in overlapping bigrams (e.g. ‘MAYER KARL’ is decomposed in the bigrams ‘MA’, ‘AY’,’YE’,’ER’,’KA’,’AR’,’RL’). The names length is measured by the number of different bigrams of the name and the similarity of two names by the number of bigrams belonging to both names divided by the square root of the product of the length of the compared names and multiplied by 100. The result is always a value between 0 and 100, a value of 0 stands for no common bigram and a value of 100 signifies that the two compared phrases are identical. We have also experimented with other similarity measures but found no great differences between all of them. In the special situation when the text field in one register is generally shorter than the corresponding one in the other register it can be of advantage to divide by the minimum number of bigrams in the two strings instead of by the geometric mean. The bigram method is simple to implement, is usable for each language and robust against permutation of words in a phrase. Parsing - To achieve satisfying results with the bigram method it is necessary that the compared text fields of the same unit in different registers are not written too differently. Before the text variables of two registers are compared for similarity, they must be standardised and parsed. This is essentially a statistical process. It is usually done by computing the frequency of all words in both texts. If the frequency of a string (like ‘corp’, ‘inc’, ‘ltd’, ‘doctor’) is very different in both registers, either one can delete that string in both registers, abbreviate it identically in both registers or replace it by a synonym at least in one WP1 61 register. Other steps are the converting of lowercase characters in uppercase, converting special characters, etc. Blocking - The comparison of thousands or even millions of records of one register with all records of another big register would need even for a modern computer very long elapse times. Therefore only a small subset of all possible pairs of units is compared and their similarity is measured e.g. in the monthly matching of the Social Security units with the units of the BR in one run the units with identical names in both registers are compared for similarity of name, postal code and legal status, in a second run the units with identical address, in a third run units with the same postal code and the same Christian name and in the last run units with the same first name are compared. The total similarity of name, postal code and legal status is a weighted average of the three separate similarities. The weights and the threshold above which a pair is a candidate for a possible link are determined empirically. At the moment the weights are .70 for name, .10 for postal code and .20 for legal status, the threshold for the total similarity is 87. Units with a total similarity above 87 are listed for manual checks. Monthly updating of the Austrian BR - Each month Statistics Austria receives copies of the four AR which are incorporated in the BR in the following 4 phases. 1. Creation of new relations Our SAS record linkage program tries to suggest for each unit of the AR which is not yet linked with a unit in the BR a possible candidate for a positive match. It is the pair with the highest similarity above a threshold. All that pairs are listed for manual inspection. 2. Checking and Storage of new relations If the manual check for a pair of units succeeds then both the key of the BR unit and the key of the AR unit are stored in a separate table of the DB2 database. Any other information of the AR is not stored in the BR because it can be connected whenever necessary by using the stored relation of the keys. In case of a failure the pair is deleted from the list. For the manual inspection also information additional to that in the four AR is used (e.g. search engines in the Internet). 3. Creation of new business units The units of an AR which even after the above two steps are not linked with a unit in the BR are considered as units belonging to newly born enterprises. The SAS record linkage program tries to combine all the units in the four AR belonging to the same enterprise (it needs 6 runs to compare each AR with all the others). At the best a new unit is found in all four AR, but usually it is found only in two or three AR. All the pairs, triples and quadruples belonging to enterprises which have at least one employee or a yearly turnover above 22.000 € are listed for manual inspection. 4. Checking and Storage of new business units If the manual check for a pair, triple or quadruple of units succeeds then a new unit is created in the BR and the relations between that new unit and the corresponding units in the AR are stored as in step 2. 5. Deletion of dead units If an incorporated enterprise is deleted from the register of companies it is also immediately canceled in the BR. In the past not incorporated enterprises have been checked only quarterly for possible deaths, from mid-2007 onwards we will check monthly for possible deaths. An enterprise has died if it cannot be found any longer in any of the AR or if the yearly turnover in the last two years was under 22.000 €. WP1 62 Conclusions - Figure 6 demonstrates the success in raising the coverage of Statistics Austria’s BR: from 2001 onwards the number of active units of the BR has increased by one third from 300.000 to around 410.000. This increase is partially caused by a better coverage of the public and non-profit-oriented sector and partially by a lower under-coverage of the profit-oriented sector. Not only coverage has increased but also the number of enterprises which are linked to one or more AR. Now 99% of all active enterprises are linked to a unit in the tax register, 73% to a unit of the Federal Economic Chamber, 69% to at least one employer account number of the Social Security System and 37% to a unit of the register of Corporations. The monthly updating of the BR from administrative data beginning in 2005 results in a steady development of the BR and a good basis for sample selection and grossing up of the data. The use of computer-assisted record linkage has reduced the amount of manual work necessary for the maintenance of the BR quite considerable. It enables the combination of information (employees, turnover, NACE-code) from AR with the BR and also reduces the response burden of enterprises. Figure 6 – Number of enterprises in the Business Register by type of relation to AR 500000 450000 400000 350000 BR RIC FEC SSR TR 300000 250000 200000 150000 100000 50000 0 1/2001 1/2002 1/2003 1/2004 1/2005 1/2006 1/2007 Bibliography Council regulation (EEC) No 2186/93 of 22 July 1993 on Community coordination in drawing up business registers for statistical purposes. Bundesstatistikgesetz 2000. Federal Statistics Act of 2000, BGBL I Nr.163/1999, idF BGBL I Nr.136/2001, Vienna. WP1 63 Haslinger, A., 1997. Automatic Coding and Text Processing using N-grams. Conference of European Statisticians. Statistical Standards and Studies – No. 48. Statistical Data Editing, Volume No. 2, Methods and Techniques, pages 199-209. UNO, New York and Geneva. Haslinger, A., 2004. Data Matching for the Maintenance of the Business Register of Statistics Austria. Austrian Journal of Statistics, Volume 33, No. 1&2, pp. 55-67. http://www.stat.tugraz.at/AJS/ausg041+2/041+2Haslinger.pdf 4.5. The use of data from population registers in the 2001 Population and Housing Census: the Spanish experience INE (Spain) For the 2001 Spanish Censuses, the chosen option was a classical census with a first-time exploitation of the Continuous Population Register (CPR, named as Padrón Continuo de Habitantes). Specifically, an operation based on a thorough itinerary around the territory, strongly supported by the CPR and by a single, comprehensive questionnaire, which is implemented exhaustively. The population count based on the Population Register (PR) is not immediately accepted as the best possible, but checked and corrected against reality through the complete enumeration. The census is thus an exhaustive evaluation of the coverage of the PR, and allows accurately adjusting population counts and reducing the main sources of under-coverage (typical of the classical censuses) and over-coverage (typical of some PRs). Technical and legal measures must be applied to ensure that the PR information to be checked is treated in a proper and different way, all over the whole operation, from the rest of census information. The key points of the process can be summarised as: a) The Population census is based on the Register data to improve precision and cut costs and bother the citizens as little as possible, taking advantage of the fact that register data can be used legally with statistical purposes; b) The data collected in the census questionnaires are not transferred to the Register (as this would violate statistical secrecy); c) The modifications entered by the inhabitants in their register are noted on specific sheets and sent to the Council so that, after performing the necessary verifications, the Register is updated with the proper corrections. Purpose of the integration activities: advantages and drawbacks, and complexity of the problem - The aim of this kind of Census is to reduce both costs as response burden, through the use of relevant administrative registers, and complemented with an exhaustive statistical operation, with a twofold aim: to improve the accuracy of population counts, and, in addition, to obtain from the census variables not available from the combination of registers. Although a long-term aim, a Census based exclusively on administrative registers was still considered unfeasible; due to the need for delicate legislative reforms, problems for social acceptance, lack of a common identity number for each person, and non-standardised and easily exploitable administrative information. WP1 64 It also presents several drawbacks: register information gives rise to rights and duties. So, they are prone to contain 'convenient' rather than 'true' information’; e.g., PR address is not always the usual residence of a person (even though legislation provides so) but their best trade off between rights and duties. Moreover, trusting indefinitely in the reliability of PR counts without any kind of periodical check is risky, because of cumulative errors that, to more or less extent, every PR inevitably contains (difficulties in measuring accurately the departures from the country is, perhaps, the best example). Furthermore, a classical Census, as a comprehensive population count, with slight or no relationship with the Register, would not be appropriate either: as well as not suitably maximising the potential savings brought about by supporting the information with data from the Register, it would not satisfy the common benefit relationship established in article 79 of the Population Regulation (see INE 2001, 12-14), which states that the carrying out of the Population Census must rely on data from all the Municipal Registers, and municipalities, owners of the registers, must help the INE in whatever it needs. On the other hand, a Census consisting in a combination of registers and a complete enumeration offers a. More precise population counts than in a classical census, thanks to the previous information contained in the PR (preventing undercoverage) and more precise than an exclusively register-based census, thanks to the checking against reality that complete enumeration supplies (preventing cumulative errors of the PR). b. Information not available through register integration, which is obtained in an exhaustive, classical, way, allowing maximum geographical and conceptual detail. c. The longitudinal perspective allowed through the use of registers as its main support. d. The use of more efficient collection methods, achieved via the previous knowledge of the location where every person is registered. However, drawbacks also come from that intermediate-point condition. a. They are more expensive than exclusively register-based censuses, because of the exhaustive collection operation (anyway, it should be cheaper than classical censuses). b. Response burden, ceteris paribus and ruling out other factors, is also somewhere between the minimum achieved in censuses without specific collection operation and the maximum of censuses with no previous information support. Specification of the sources to be included and other previous conditions - Availability of a PR, at least reliable enough as an initial solution to state how many people, whom and where will be counted in census figures, is obviously needed. It is also advisable to have another administrative registers usable for census purposes, as Cadastre, Tax declarations, Social Security General Affiliation files, public unemployment registers, educational qualification records, and so on. This type of census, as regards its relationship with the PR, has also two variants, depending on whether the Census is simply supported by the PR, or whether benefits are mutual, such that the PR uses the census operation to update and improve its information. Then, the law WP1 65 governing the PR must explicitly provide such use of the census operation to update the PR (while preserving the statistical confidentiality in the strictly census-related information)16. So, the Population Regulation enables the INE to perform operations to control the precision of register data (article 78), and establishes (article 79) that these operations will be performed on the occasion of the Population censuses. Councils must be notified of the results of these operations. The modification of the register data should be entered in a specific document, thus avoiding the legal problems that could derive from the direct use of the census questionnaire to collect changes to be performed in the Register. Scope and methods - The Population census only includes persons, regardless of their nationality, whose regular address is located in the national territory. As regards the Housing Census, the population scope considers dwellings and group establishments. Dwellings are considered to be all venues used for human habitation, that are family dwellings, and those others that, although they are not designed for that purpose, are actually inhabited on the date the Census is performed; these are called Accommodations. The research includes the whole national territory, and the counts of the different units referred to a single census date, in this case, November 1st 2001. The infrastructure was obtained using two relevant administrative files: The Register and the Cadastre. The former provided the location of the buildings containing main dwellings (in which there were people who resided there regularly) and the latter allowed the identification of the other buildings (buildings without resident persons and commercial establishments). - The CPR (Municipal Register of Inhabitants) information, established in the Law on the Basis of Local Regimes, limits the information to: name and surname, ID number, address, sex, place and date of birth, nationality and school or academic education of the person’s resident in the municipality. This information was also useful to role out some additional questions which were collected to set directly, together with kinship, which is the family core and which is the structure of the most complex families. - The Urban Cadastre database, combined with CPR allowed a single census itinerary and also implied economic savings, since the preparation process that was traditionally performed in years ending in 0, called Censuses of Buildings and Commercial Premises was replaced by crossing both computerised databases, thus resulting in great advantages. It also allowed to gather characteristics of all of the units: buildings, households and individual persons. That is to say, as the Housing and the Building Censuses have been jointly performed for the fist time, it has been possible to study the characteristics of the population, on the basis of kind of building used as a residence. The different nature of PR information, with both administrative as statistical purposes, and the rest of census data, only with statistical purposes, must be clearly explained in the questionnaires. Separated sheets for PR information may help to emphasize this essential distinction. And all along the processing, a proper separation between the two types of information must be assured; for instance, files containing personal identifications should never contain statistical non- PR variables. 16 To avoid the violation of the Fundamental Principles of Official Statistics (see UNECE / EUROSTAT, 2007, Chapter I, paragraph 17), stating that “individual data collected by statistical agencies […] are to be used exclusively for statistical purposes”. WP1 66 The 2001 Census operation included four models of questionnaires, although not all the persons resident in the dwellings had to complete them. The following documents were used: Register information. This document verified if the information the Councils have in the Register, which is printed on this sheet, is correct. Dwelling Questionnaire. This questionnaire garners the most important characteristics of main dwellings (those where a person lives regularly). Household questionnaire. Gathering census variables that have to be completed by all persons (kinship, marital status, type of studies, municipality of residence on March 1st 1991, etc.). Individual questionnaire. Only completed by persons aged 16 years old or older who study or work. Containing information on the types of work or studies, and the place where they are carried out. Pre-filling questionnaires with PR information is a complex technical task, especially when associated with the large census volume and with the constraints imposed by optical reading technology; for example the necessity of finding the information in a very exact location of the questionnaire to be effective, or the convenience of using blind colours for the fixed, noninformative, parts of the questionnaires. In order to update the CPR information, the INE was in charge of channelling the proposed changes to the Councils. Proposals for the modification of register data, which were performed by some citizens, were compiled and sent to each Council involved. In turn, after carrying out additional verification procedures, the Council sent the INE the accepted variations, which were introduced in the corresponding Register. Finally, the INE, after receiving the confirmation, consolidated the variations in their copies of the register files. The building and updating of the CPR - The INE keeps a backup of all the Municipal Registers which was built from the files corresponding to the last ex-novo Register Renewal referred to 1 May 1996- and the monthly variations occurred in the Municipal Register data and issued by the Town Councils from then on, with the aim of co-ordinating them and preventing duplicates, in fulfilment of its obligations as imposed by the law in force. The co-ordination of all the Municipal Registers (MRs) consists of checking and adding to that backup all those monthly variations, and then reporting to the Town Councils any inconsistencies detected. To that end, not only the data issued by the municipalities, but also data from other administrative sources is used. It is important to remark that, unlike other sources with an exclusively statistical purpose, inconsistencies are detected and they are noticed to the corresponding municipality, provided that relevant information is kept in order to know at any moment whether the municipality has modified the data or not, and the reason for it, but the corrected data is never added directly to the database. Therefore, the external sources with which the CPR is checked, are as follows: Source Relevant facts for the CPR Civil Register Births and Deaths. Changes of nationality, name, surname and sex. WP1 67 Issues and renewals of National Identity Cards (DNIs) or in the case of foreigners, the document that replaces it: temporary residence permits, residence cards (NIEs) Issues of educational certificates Ministry of the Interior Ministry of Education and Science CPR of Spaniards Resident abroad (PERE) registerd / unregistered Spaniards because of moving abroad Ministry of Foreign Affairs Some basic indicators Information from the Town Councils and other administrative sources used for the initial load and (approximately) monthly updates. Source Regular Basis Town Councils Ministry of Education and Science Ministry of the Interior Complete MR database and variations DNIs Residence cards Certificates Initial load 40.200.000 39.015.395 2.111.725 28.037.758 Monthly procedure 1.062.902 493.973 150.883 (1) Source Ministry of Foreign Affairs PERE Civil Register Births Deaths Regular Basis Initial load 887.857 612.848 1.724.561 Monthly procedure 12.647 34.858 30.892 (Approximately) monthly cross-checks between updated administrative sources and updated CPR tables Monthly new updates of the CPR database (including variations reported by the Town Councils) are themselves crosschecked against information from administrative sources (both (1) Only the initial file has been received. WP1 68 updated databases and updates). The total amount of checks, as an average, is shown in the table below. Source Civil Register Ministry of the Interior Regular Basis Monthly DNIs Residence cards Births Deaths 888.893 669.465 78.999 65.321 Source Ministry of Foreign Affairs Regular Basis PERE Monthly procedure 18.725 Monthly reversals issued to the Town Councils The table below shows the number of transmissions to the Town Councils that are carried out during the (approximately) monthly procedure of checking variations; and, amongst them, how many of them are related to the cross-checks against the different sources: first, after the initial issue and then, when performing the monthly procedure. Source Total Ministry of the Interior Regular Basis Initial issue -Monthly procedure Source 462.776 DNIs Residence cards Card expirations 3.004.512 521.136 1.807.084 60.123 43.479 75.849 Civil Register Births Deaths Ministry of Foreign Affairs PERE Regular Basis Initial issue 28.113 Monthly procedure 1.599 188.588 Not available 20.065 2.562 WP1 69 It is also planned to transmit, along 2008, new citizenships acquired by foreign nationals and changes in the data on name, surname or sex held by the Civil Register. Cross-checks - The process of identifying pairs of records when browsing variation datasets from the Town Councils and the Civil Register consists of a sequence of the following comparisons between common identifiers: 1. Complete agreement in name, surname, date of birth and DNI. 2. For those not matched in 1): partial agreement in name, surname, date of birth and DNI. 3. For those not matched in 2): clerical review. 4. For records that were not yet matched, a new search is performed using several agreement criteria, and a table containing all those data issued by the corresponding source and related to the CPR database is stored, in order to identify, monthly, all those variations not previously matched. Conclusions - An integration of registers and an exhaustive collection operation improves flexibility in the content, while reducing the response burden in comparison with a classical census with the same information. Compared with integrating registers and sample surveys, the main advantage is the complete geographical and conceptual detail of all the variables, whether available in the registers or not. Information from previous censuses and related administrative data also improves appropriate processing and editing of data as well as imputation when incoherence or missing values are detected. Dissemination also gets benefits from the previous censuses, because of the longitudinal perspective it allows. Finally, to trust indefinitely in the PR counts without any periodical check against reality has several, very tempting, advantages (drastic reduction of census costs and respondent burden) but also a potentially very negative consequence: population figures may become very apart from true figures. However, the issue of the risks associated with the use of the census to update the population registers is very important, so it deserves a more complete assessment. It is completely true that this risk exists, but it must be compared with the disadvantages of the opposite. So, each country having a PR should weigh up the relative advantages and disadvantages and decide whether it is advisable a two-way relationship between census and PR or not. In case of opting for doing so, some technical and legal aspects related to privacy issues, and differences between PR and Census data, will arise as especially relevant. Bibliography Instituto Nacional de Estadística (2001): "Population and Housing Census 2001: 2001 Census Project". INEbase18, INE, Madrid. http://www.ine.es/en/censo2001/infotec_en.htm Instituto Nacional de Estadística (2005a): "Outline of the census type planned for 2011 in Spain". Submission to the United Nations Statistics Division Website on 2010 World Population and Housing Censuses. INE, Madrid. 18 INEbase is the system the INE uses to store statistical information on its corporate web page for the Internet. WP1 70 http://unstats.un.org/unsd/Demographic/sources/census/Spainpdf.pdf Instituto Nacional de Estadística (2005b): "Description of the census type consisting in a combination of registers and a complete enumeration", in Summary of Comments on the Draft Version of the CES Recommendations for the 2010 censuses of Population and Housing. Joint UNECE/EUROSTAT Meeting on Population and Housing Censuses Organised in cooperation with UNFPA, Working paper No.5. Geneva, pp 36-38. http://www.unece.org/stats/documents/ece/ces/ge.41/2005/wp.5.e.pdf UNECE/EUROSTAT (2007): Appendix II: Alternative approaches to census-taking", in Recommendations for the 2010 censuses of Population and Housing. Conference of European Statisticians (CES). UN, New York and Geneva, pp. 153-165. http://www.unece.org/stats/documents/ece/ces/ge.41/2007/mtg1/zip.1.e.pdf 4.6. Administrative data source (DBP) for population statistics based on ISEO register in the Czech Republic Jaroslav Kraus (CZSO) Database of person (DBP) is a fully historical register of all persons with permanent place of residence on the territory of the Czech Republic. Data of ISEO (Integrated system of persons’ evidence) are the main constitutive source for creating DBP. According to legislative rules only information about persons (individuals) is involved – without any relation to other individuals. Thus, no information about families or households is accessible. Data model - There are some basis principles for the data model: o Full history for all defined n-tities; o Sharing of n-tities for common data storing; o Data-driven contents of monitored attributes; o Lossless management of changes of monitored n-tities. Monitoring of history - Full monitoring of main defined n-tities is based on time validity of its contents. Status of DBF to historical data is given by attributes and date of view to DBF. Sharing classes of n-tities - Sharing classes of n-tities warrant saving on one defined place. This approach is used in n-tities of attributes of interest and attributes of dates. Nomenclatures - The content of nomenclatures of DBP gives the range of attributes of interest and dates concerning physical persons. In case of increasing number and structure of attributes, it is not necessary to change the structure of the data model, but only to change the content of nomenclatures. Lossless management of changes of monitored n-tities - This approach to data model is given by: o Only editing of attributes and availability of record; o Creating of record of changes in separate n-tity; o Impossibility of physical deleting of records in DBP; o Identification of all changes by time stamp and signature of author. These principles warrant storing of all information with possibility of historical development of information. WP1 71 Data contents Data source (ISEO) - According to the agreement between the Ministry of the Interior and the Czech Statistical Office data from ISEO (e.g. Integrated System of Personal Evidence) are accepted as a data source for DBP. ISEO will be used as records without relations to each other. It means that the structure of households is not – for the time being – accessible. However, there is a possibility to generate supplement records for obtaining additive information about the population. For example: to generate for record with information about death person means to add so many records how many persons (e.g. members of household) are connected to the death person (for example spouse, children, etc.). Process of ISEO data transformation - The data coming from ISEO are transformed on separate workplace where 1,…,n records are generated from ISEO. Thus two data structures are created: Copy of original ISEO data records - This part of DBP contains copies of records for given persons <person_DBP>, followed by possible records of parents (<father>, <mother>), partner <partner> and all children (<children>). Related persons (parents, partners, and children) are stored in RelatedPerson structure <related_person>. There is no personal identification of these records; the rest of the structure is the same as for records in the previous paragraph. The relation between these two types of records is not permitted. Derived records of (related) persons - Derived records of related persons are created from (ISEO) related persons (and their records). This derivation means that ISEO RelatedPerson is defined as a DBP MainPerson with attributes information coming from ISEO RelatedPerson. This record is stored in <person_related> structure: ISEO/Father: ISEO/father ISEO/Person_DBP Derived/Person_DBP Derived/child/children ISEO/Mother: ISEO/Mother ISEO/Person_DBP Derived/Person_DBP Derived/child/children Derived/Person_DBP Derived/partner Derived/Person_DBP Derived/father or Derived/Mother by ISEO/Person_DBP/sex ISEO/Partner: ISEO/partner ISEO/Person_DBP ISEO/child/children: ISEO/child/children ISEO/Person_DBP This approach respects the restriction of ISEO data using without loss of information about family relationships. However, the restriction of ISEO data using does not enable to keep family information, but it is possible to construct proxy households of nuclear households (parents and children). These proxy households may be sufficient approximation for many analytical needs. However, this idea should be tested in the future. WP1 72 4.7. An experiment of statistical matching between Labour Force Survey (RFL) and Time Use Survey (TUS) Gianni Corsetti (ISFOL, Italy) The ISTAT social surveys are a privileged framework for testing the potential and produce a critical evaluation of statistical matching methods. Taking advantage of such an opportunity, it was tackled the objective of creating a synthetic archive through the integration of data collected in two important ISTAT surveys: the current Labour Force Survey (RFL) and the Time Use Survey (TUS). The creation of such a dataset could allow the study of the relationships between the specific variables of each survey. The presence of a wide set of common variables and of similarities in the design of the survey model helps the data integration process. The two surveys also share many themes, so the definitions used and the questions asked were harmonized during the planning phase. The TUS survey is very rich and articulated as far as organization of time use and perception of quality of life are concerned. The survey conducted in 2002-2003 in compliance with Eurostat’s Guidelines, foresaw, along with an individual questionnaire and a family questionnaire, the collection of daily time use for those people who live in Italy through the use of a daily journal and a weekly journal. The survey on labour forces represents the primary source of information for studying the job market. Using together the data of the specific variables of the surveys could allow a researcher to analyse at the same time the characteristics of labour force and time usage. In this way it is possible on one side to enrich the analysis on the job market with subjective information and with the detail of daily life organisation and, on the other side, the more general analysis on the quality of life provided by TUS can be integrated with in-depth explanations on the characteristics of the work condition collected by the RFL. The data - In order to undertake a first experiment of integration of the two surveys studied through techniques of statistical combination, Tus has been considered as the recipient and Rfl as the donor survey, respectively. More precisely, the analysis was restricted only to the employed people, and the Tus dataset consisted of all the records (22,312 records) of the individuals observed during the entire collection period of the survey (April 2002 – March 2003). As far as the RFL survey is concerned, only the observations collected during the first three months of 2003 (30,526 records) were considered, so that the records in the two surveys could be considered as coming from homogeneous populations. Statistical matching was also performed on individual non weighted data, while for the expost evaluation of the results the weights of the recipient survey, i.e. Tus, were used on the overall completed data set. The common variables present in both surveys and the ones that, appropriately harmonised, were used to perform the statistical unconstrained matching are: Sex; Age groups; Status; Education; Family size; WP1 73 Geographical Area; Type of town; Area of employment; Type of contract. The marginal distributions of the common variables in the two surveys do not show significant differences. The most considerable differences are in the weighted distributions of the variable “Family Size”, where the RFL estimates present an unbalance towards the smaller families compared to the Tus estimates, and of the variable “Type of town” for what concerns some categories. As a matter of fact some of these variables have been used as stratification variables. This implies that the search for a donor was performed amongst those units with an identical value for such variables. The same stratification variables were not always used: because of computational constraints the first two stratification variables were always kept (“Geographical Area” and “Type of contract”), while the other variables were introduced when necessary, so that the number of records in a stratum was small enough to allow the software for statistical matching to execute. Table 1 describes the sixteen strata built in this way and the relative number of records in the two archives. Table 1 – Description of the strata created for the matching process Stratification Variables Stratum 1114c 1114e 1114o 1115 1124c 1124e 1124o 1125 1214 1215 122 2114 2115 2124 2125 221 222 Total Geographical Area Type of Contract North-centre North-centre North-centre North-centre North-centre North-centre North-centre North-centre North-centre North-centre North-centre South-islands South-islands South-islands South-islands South-islands South-islands Permanent Permanent Permanent Permanent Permanent Permanent Permanent Permanent Self-employed Self-employed Self-employed Permanent Permanent Permanent Permanent Self-employed Self-employed Sex Male Male Male Male Female Female Female Female Male Male Female Male Male Female Female Male Female Age <=44 <=44 <=44 >=45 <=44 <=44 <=44 >=45 <=44 >=45 <=44 >=45 <=44 >=45 Geographical Area Centre North-east North-west Centre North-east North-west Number of Records Tus 2002-2003 Number of Records Rfl 1st Trimester 2003 1.084 1.262 1.739 1.994 907 1.210 1.638 1.451 1.432 1.322 1.333 2.018 1.241 1.154 629 1.395 503 22.312 1.315 1.956 2.012 2.488 1.071 1.697 1.774 1.838 1.979 1.856 1.963 2.974 1.920 1.692 1.030 2.107 854 30.526 Matching technique used – In this work, we have chosen to work under the hypothesis of conditional independence between the specific variables in Tus and RFL given the matching variables. The synthetic archive was created through the implementation of the hot-deck donation techniques. Gower's similarity index is the distance function used for selecting the record from the donor survey (Rfl) to associate to each record in the recipient survey (Tus). This index is calculated amongst every pair of records from the donor and recipient files; if there are two or more observations in the donor survey that present the same minimum distance from the recipient WP1 74 record, then a unit is chosen randomly by assigning to each of them the same probability. The software described in Sacco (2008) was used for the application of matching methods. During the statistical matching process and the computation of the Gower distance, all the common variables that do not represent a stratification variable in the n-th stratum, were given a weight equal to one. In this way, it was decided to give all the variables the same importance. First Results - The reliability of the results obtained by a matching process depends on the level of preservation of the hypothesis of conditional independence that is not empirically verifiable. In this project it was not possible to use external auxiliary information to evaluate the integrity of the combination and the respect of the hypothesis of conditional independence. In order to perform a first evaluation of the results, some common variables that do not play the role of matching or stratification variable were used. These are usually called “control variables”. Table 2 shows one of the possible results obtained with the implemented statistical matching. It is a cross-table of two specific variables: paid hours worked in an average weekday (Tus) and the willingness to work a number of hours different to those worked in the week preceding the data collection (Rfl)19. Table 2 – Frequency distribution between two variables specific to the two surveys in the synthetic archives (absolute and percentual values). Willingness to work a number of hours different to those worked in the week analysed. (synthetic archive: variable donated by Rfl 1st trimester 2003) Hours worked in an average weekday (average generic length) 0-3 hours 4-8 hours 9 hours and more Total 509 238 2.736 54 3.537 2.784 1.650 16.651 302 21.387 ABSOLUTE FREQUENCIES Yes, less hours Yes, more hours No, the same hours Don’t Know Total 901 577 5.669 85 7.232 1.374 835 8.247 163 10.619 PERCENTUAL FREQUENCIES FOR 100 PEOPLE WHO WORK THE SAME HOURS Yes, less hours Yes, more hours No, the same hours Don’t Know Total 12,5 8,0 78,4 1,2 100,0 12,9 7,9 77,7 1,5 100,0 14,4 6,7 77,3 1,5 100,0 13,0 7,7 77,9 1,4 100,0 The correctness of such result can be verified if the marginal distribution of the combined variable is maintained compared to its original marginal distribution in Rfl. Table 3 shows that this first control is successful because for both absolute and relative frequencies the marginal distribution of each variable are homogeneous. 19 The reference population of this table is represented by those employed and who had performed at least one hour work in the week preceding the Rfl data collection. WP1 75 Table 3 - Marginal distribution of the specific Rfl variable in the original and synthetic archive (absolute and percentual values). Willingness to work a number of hours different to those worked in the week during data collection Yes, less hours Yes, more hours No, the same hours Don’t Know Total ABSOLUTE FREQUENCIES Rfl 1st semester 2003 Synthetic archive: variable donated by Rfl 1st trimester 2003 2.780 1.857 15.928 275 20.840 2.692 1.691 16.757 276 21.416 13,3 8,9 76,4 1,3 100,0 12,6 7,9 78,2 1,3 100,0 PERCENTUAL FREQUENCIES Rfl 1st semester 2003 Synthetic archive: variable donated by Rfl 1st trimester 2003 Another quality check consists in evaluating if also the joint distribution of a specific Tus variable with a control variable is comparable (the control variable can be either the genuinely observed variable in the TUS or the corresponding imputed variable from RFL). In this case, the variables chosen are: working hours (full time, part-time); presence of a second job (yes or no); night work (yes, no or don’t know). First of all it should be underlined that as far as the marginal frequency distributions are concerned, the control variables used are very similar and, above all, the marginal distributions of the control variables imputed in the synthetic archive are almost completely unaltered in comparison to those of the original Rfl archive (Table 4). Table 4 – Percentual marginal distribution of the shared control variables in the receiving archive, in the donating archive and in the synthetic archive. Weighed Data Non weighed Data Tus 20022003 Synthetic archive: variable donated by Rfl 1st trimester 2003 Rfl 1st trimester 2003 Tus 20022003 Synthetic archive: variable donated by Rfl 1st trimester 2003 Rfl 1st trimester 2003 Working Hours Full Time Part Time 88,7 11,3 87,8 12,2 87,3 12,7 88,7 11,3 87,5 12,5 87,6 12,4 SECOND JOB Yes No 3,6 96,4 2,6 97,4 2,7 97,3 3,7 96,4 2,6 97,4 2,6 97,4 NIGHT WORK Yes No Don’t know 13,7 86,3 - 11,2 88,5 0,3 11,4 88,3 0,3 13,9 86,1 - 11,3 88,5 0,3 11,3 88,3 0,4 Common Control Variable WP1 76 Table 5 shows the contingency tables obtained by crossing the specific Tus variable (paid hours worked in an average weekday) and the control variables in the original Tus archive and in the synthetic archive (variables donated by Rfl). Also in this case the results are quite satisfactory. This allows to state that, for the variables analysed, the joint frequency distribution produced by the statistical combination represents quite closely the same distributions of the original Tus archive. WP1 77 A more punctual check is obtained by observing whether the relationship that connects the variables of interest (specific Tus variable and common control variables) is adequately reproduced in the synthetic archive. Table 6 presents the Odds Ratio calculated on the previous double frequency distributions for each pair of categories of the specific Tus variable. The type of relationship between variables is constant, even if in the synthetic archive this correspondence always seems to be less intense than in the Tus donor archive. There is, however, a strong differentiation in the results obtained for the three different control variables: for the “Working Hours” variable the Odds Ratio calculated in the two different archives is more distant than those calculated for the other two control variables. Table 6 - Odds ratio calculated on the frequency distribution between a specific variable “Time Use” and the shared control variables coming from the original archive Time Use and synthetic archive donated by Rfl. Hours worked in an average weekday Common Control Variable Synthetic archive: variable donated by Rfl 1st trimester 2003 Tus 2002-2003 1st and 2nd type 1st and 3rd type 1st and 2nd type 1st and 3rd type ODDS RATIO Working Hours Second Job Night Work 0,37 0,91 0,84 0,11 0,64 0,54 0,82 0,92 0,89 0,43 0,67 0,86 Bibliography Sacco, G., 2008. SAMWIN: a software for statistical matching. Available on the CENEX-ISAD webpage (http://cenex-isad.istat.it), go to public area/documents/technical reports and documentation. WP1 78 5. Results of the survey on the use and/or development of integration methodologies in the different ESS countries 5.1. Introduction Mauro Scanu (ISTAT) One of the objectives of the CENEX-ISAD is the investigation of the state of the art on the use and/or development of integration methodologies in the different ESS countries. In order to tackle this task, it was necessary to take into consideration the following facts: 1. usually the NSIs do not have a centralised office whose aim is the development and/or the application of integration methodologies. 2. the area “integration” is rather large, including tasks as metadata management (for the harmonization of the different sources to integrate) and imputation methodologies. 3. the use of integration procedures is justified by informative needs that may occur in the most diverse situations: most of the projects on integration are directly conducted by survey managers on households or enterprises. The previous aspects did not allow the definition of a list of people in the NSIs to contact in order to understand the state of the art of the use and/or development of integration methodologies in the ESS. For this reason, we decided to investigate a more difficult target population: the population of the projects that involve the integration of two or more sources. The ESS consists of European countries of which 34 (the 27 member countries of the EU plus Switzerland, Norway, Iceland, Turkey, Croatia, Macedonia, Liechtenstein) have been contacted. In a limited period of time (1 month and a half) we received answers from 21 NSIs, 19 of them have at least one project concerning integration of two or more sources (Greece and Macedonia declared that no integration projects are currently active). Country Number of projects Italy 5 Czech Republic 4 Austria 3 Spain 3 UK 3 France 3 Netherlands 2 Finland 2 Switzerland 2 Malta 1 Romania 1 Hungary 1 Sweden 1 Cyprus 1 Latvia 1 Slovenia 1 Denmark 1 Belgium 1 Germany 1 We received 37 filled in questionnaires (projects). The previous table illustrates the number of projects of the respondents by country. WP1 79 The questionnaire focuses on only one project involving integration of two or more sources. For each project, the following topics are investigated in detail: a) objective of the integration process and characteristics of the files to integrate; b) privacy issues; c) problems of the integration process and methods; d) software issues; e) documentation on the integration project; f) planned changes; g) possibility to establish a link of experts. In the following, each topic is illustrated in a paragraph with a description of the 37 filled in questionnaires. 5.2. Objective of the integration process and characteristics of the files to integrate Luis Esteban Barbado (INE) This part of the questionnaire allowed the creation of a list of experts on the application of methodologies whose aim is the integration of sample surveys and archives. This list is composed of 44 names (for some projects, more than one name was available). The other questions on this part of the questionnaire aim at the description of the main characteristics of the data integration project and of the sources used for the project. The main features of the observed projects are that the integration process turns out useful especially for the construction of archives and registers, they are generally multipurpose, anyway the main objective is the reduction of the response burden; apart some project, they make use of not more than 10 sources; finally, it is inevitable to make use of sources which are not managed by the statistical office, although this leads to many problems in terms of harmonization and quality of the collected data. Question 5 Main area of interest of the data integration activity Of all the projects observed in this survey 41% can be classified as activities to improve the statistical infrastructure (censuses or registers), and some 32% are devoted to activities related to statistical production, where both social and economic surveys are included. Main area of interest To carry out a business survey To carry out a social survey To produce a population or housing census To produce a business census To produce a population archive/register To produce a business archive/register Other 6 6 6 1 1 7 10 Question 6 Objectives of the data integration activity The category "other" represents 27% of all responses. Apart from the existence of some projects with several areas of interest, there are also outstanding projects which aim to give an answer to areas not included in the questionnaire list, as for example National Accounts. WP1 80 Generally, they can be classified as projects devoted to satisfy specific needs of statistical production in a wide variety of areas of interest. Concerning the objectives, a total of 117 responses have been collected, which means an average of a bit more than 3 objectives per project. Generally all of the options, unless some exception (microsimulation policies), are well represented, and they are consistent with the current challenges and problems at Statistical Offices: how to combine the needs of rationalizing costs and information demands with the growing commitments to the provision of more and better quality information for users. An analysis by project typology allows to detect different priorities for the objectives. For projects related to statistical production, the aims of reducing costs and response burden, enhancement of editing and imputation processes, and improvement of estimation methods are the most outstanding, were indicated for about 70% of all projects. However, and as was to be foreseen, infrastructure projects are more connected to objectives related to maintenance or setting up a sampling frame, and 41% of all projects indicated this objective. Finally, regarding projects included in the category "other", and given their nature, a much more uniform distribution of objectives is detected, since neither of the options represents more than 20% of the total; improvement of estimation methods (19%) and reducing costs and response burden (15%) are the most frequent. Objectives Traditional census activities (for instance post-enumeration surveys) Reduction of costs and response burden Enhancement of editing and imputation processes Analysis of statistical relations Microsimulation policies Register/archive maintenance Improvement of estimation methods (weighting, imputation, small area estimators) Set up of a sampling frame (for instance improvement of coverage) Other 9 23 11 16 4 14 18 17 5 Question 7 How many other sources have been used in the project? Question 8 For the main sources used for the data integration project (maximum 14) specify the following details: source name and nature As a whole, the 37 observed projects use a total of 327 sources, which means an average of about 9 sources per project. There are 10 projects which use more than 10 sources, and one of them uses more than 40 sources. Taking into account some constraints in the questionnaire (only a maximum of 14 sources per project can be documented), question 8 allows to set up a basic typology of used sources. The distribution related to this criterion shows a clear predominance of archive/register sources, which represent about 64% of the total. The sampling sources are the second most important group, with 20% of the total. WP1 81 Among the sources of the first group, the intensive use of sources managed by Tax or Social Security Authorities is remarkable, no matter what project is considered. It is also possible to detect the different relevance of every typology of sources in accordance with the field of the project, as it is shown in the table below. Main area of interest To carry out a business / social survey To produce a population / business census To produce a population / business register Other Total Archive Sample / register 58,8% 28,0% 79,7% 6,1% Census Other 5,9% 12,2% 7,3% 2,0% 78,7% 17,0% 0,0% 4,3% 49,4% 63,7% 24,7% 20,4% 2,5% 4,9% 23,4% 11,0% Question 9 Following the sources order given in question 8, answer the following questions about each source: number of units (approximate figures are acceptable); specify whether the source is managed by your institute or not; whether the source is well documented; whether each record of the source is integrated with other records of other files, or whether the source is used only in order to compute aggregated values (totals, frequencies) to be used in the integration process. The information collected in this question allows some additional conclusions. With regard to the Manager Institution, the survey shows that 42% of the sources are managed by the Statistical Office itself. This rate grows to 45% for projects related to statistical production, where obviously sample sources play an important role. On the contrary, access to external sources in management processes of business censuses or registers is a common practice. For these projects, external sources represent, as a whole, 84% of the total. The quality level of the documentation available for each source must be understood as a subjective perception from each informant, since a priori parameters to quantify this concept have not been set. Broadly, the quality of documentation seems profitable enough for the users of the sources: 62% of them stated it as good, and 28% as excellent. Just for projects related to produce economic surveys, a certain degree of disappointment is noticed, since 26% of the sources show a poor quality level. The third criterion to be analysed was the use of the data source, either at the unit level or as aggregate data. From the beginning, it is noticeable that both procedures are not exclusive, as the responses provided show. About 88% of all sources are used only at the unit level; 8% of them are used jointly, that is to say, both basic units and aggregates obtained from the source are used. This case has occurred solely in projects of which the main area of interest was classified as "Other". Finally, only 4% of the sources were solely used as aggregate data. As far as the number of units for each source is concerned, it is to mention that this item shows partial non-response or non-numerical information in some questionnaires. Neither it is WP1 82 possible to know whether each source is comprehensively or just partially used, which makes more difficult a detailed analysis. As it should be expected, and taking into account these constraints, provided responses allow to meet the sources with a largest amount of individual records among projects related to social statistics or population registers. 5.3. Privacy issues Eric Schulte Nordholt (CBS) In this section the results of the privacy issues in the questionnaire are discussed. It is interesting to start the discussion with a comparison with the legal section of the inventory on Statistical Disclosure Control that has last year been filled in by 25 countries in the CENEX on SDC. Most of these countries consider the legal protection of their data to be very important. It does not make a difference in importance if these data concern natural persons or enterprises. Most countries pay attention to the legislative and administrative aspects of confidentiality. Most countries pay also often or very often attention to the mathematical and computing aspects of confidentiality as well as to the organisational aspects. Most countries answered to have a data protection law. France introduced a data protection law (to the protection of statistical data) in 1978, the other countries participating in the inventory introduced or changed such a law in the last fifteen years. Most countries answered to have principles and laws on public access to government information to the protection of statistical data. Most countries have specific regulations on statistical confidentiality. Internal regulations of the statistical office on statistical confidentiality exist only since recently or were changed over the last couple of years in most statistical offices. Different countries have different definitions of confidential data. Some countries mention the aspect that data should be (directly) identifiable to consider them as confidential. In some countries the statistical law only refers to personal data. In some countries the statistical law does not mention the concept of confidential data at all, and sometimes different definitions are used depending on the context. In the statistical laws of some countries a reference can be found to the implementation of EU legislation. Most countries have no statistics specific rules for the release of confidential data. In almost all countries almost all enterprise data are considered confidential. Most countries have no special rules that apply to the transmission of data to Eurostat. Most staff members of statistical institutes in European countries have to sign confidentiality warrants. Penalties can be imposed for (intentional) breaches of statistical confidentiality. Only two statistical agencies answered that they very often conduct assessments of public attitudes, perceptions and reactions to confidentiality. A majority of the statistical institutes in this inventory uses registers as e.g. the population register and the business register very often. Specific confidentiality rules concerning the use of these register data for statistical purposes are applied in some countries. In most offices a certain number of staff members have been made responsible for ensuring statistical data confidentiality. In almost all countries universities (and research centres) have the option to use individual data concerning natural persons for research purposes. In the majority of the countries this option also exists for individual data concerning enterprises. WP1 83 Business organisations, fiscal authorities and marketing organisations can in general not get access to individual data. However, remarkable is that legal authorities as e.g. the police can get access to individual data in a considerable minority of the countries. Finally, for other governmental organisations the picture is somewhat mixed. In about half of the countries these organisations can get access to individual data. In a majority of the countries a review panel (e.g. an ethical or statistical committee) exists to judge whether statistical data are sufficiently safe for use by persons outside the statistical office. Most of these review panels are internal committees. In a majority of the countries respondents can authorise the agency to provide their own individual data to a specified third party (informed consent). In most countries the variables on racial or ethnic origin, political opinions and religious or philosophical beliefs were considered as sensitive. Also data concerning health and sex life were considered sensitive in most countries. In addition in a majority of the countries trade union membership, data relating to offences, criminal convictions and security measures and data related to incomes were seen as sensitive. Data about professions and educational data were considered as sensitive in a minority of the countries only. Special licensing agreements exist in a minority of the countries. Access under contract for named researchers exists in a majority of the countries. About half of the countries have the option of access only for specially sworn employees. Also about half of the countries screen the results with respect to disclosure control but only a minority screens the users. For statistics on persons and administrative data the option of access only in controlled environment does not exist for the majority of the statistical institutes, but for statistics on enterprises there is a small majority where this option exists. The option of trusted third parties that keep the keys necessary for identification hardly exists in Europe. Most countries release microdata concerning natural persons and enterprises. However, Public Use Files (PUFs) exist only in a minority of the countries. However, the majority of the countries releases Microdata Under Contract (MUCs), although not on administrative data. Synthetic data files are hardly produced in Europe and the majority of the countries also has no on-site facility or online access option. Many offices have organisational, methodological and software problems concerning statistical confidentiality development. Most countries do not receive technical assistance from other countries to help with the implementation of disclosure control. However, many countries would like to receive help to solve in particular their software problems. Now we move to the two questions asked in this CENEX on ISAD on confidentiality. It is apparent that there are legal issues connected with the use of administrative data in almost all countries, and that each country is characterized by different regulations and restrictions on the use of microdata. Question 10. Is there a legal foundation which regulates the supply of administrative data and its usage by the NSI? Yes No 32 5 WP1 84 For 32 (86%) of the 37 data integration projects there is a legal foundation which regulates the supply of administrative data and its usage by the NSI. Question 11. In which way regulations and laws on privacy affect the data integration project? (please put an ‘X’ in one or more cells; multiple answers are allowed). Problems Unit identifiers have to be cancelled in some data sets Some data sets can provide only aggregate data One or more archives/samples useful for integration goals can not be used Linking of some groups of administrative data is prohibited by law Other 13 7 5 4 8 None of the four problems identified as answering categories in the questionnaire were mentioned for a majority of the data integration projects. For 8 (22%) of the 37 projects the option ‘Other’ was chosen. For those eight data integration projects it was specified what other problem could effect the project. In one of the Austrian projects the identifiers in each source will be / were replaced by a protected pin. It was / would be produced by the data protection office for each source due to name, date of birth, birth place via the central population register. This way data twins and persons which are not found are the problem. In another Austrian project it was stated that the Austrian federal law (“Bundesstatistikgesetz”) forces Statistics Austria to use administrative data as main sources. In the Census project of the Czech Statistical Office the possibilities will be clear after acceptance of the Census Act. In one of the UK projects it was mentioned that the ONS has developed statistical disclosure control methods specific for the outputs of research facilities. The Finnish Statistics Act states that whenever possible, administrative data must be used. The data received from administrations records for statistical purposes are confidential. According to the Statistics Act data collected may be released for purposes of scientific research and statistical surveys concerning societal condition in anonymised form (e.g. EU-SILC microdata is transmitted to and distributed by Eurostat). In one of the ISTAT questionnaires it was mentioned that there were no problems with the sources currently used. Data quality on foreign control could be improved by using additional sources maintained by the Bank of Italy, but this is prevented by legal problems (as they are not part of the Italian statistical system). In another ISTAT questionnaire it remained to be seen to which extent the required information will be available at unit level, according to Italian regulations regarding confidentiality. A common identificator is missing for all administrative data in Destatis, the German Statistical Office. WP1 85 5.4. Problems of the integration process and methods Nicoletta Cibella and Tiziana Tuoto (Istat) This part of the questionnaire investigates the main methodological aspects of the data integration project. It turns out that almost all the data integration projects must deal with the problem of harmonising the different sources. Probabilistic methods as well as statistical matching methods are still seldom used, although there are projects that apply all the methodologies. Finally, the last questions of this section investigate if and how quality evaluations are performed. Question 12. Which problems affect the data integration process? (Please put an ‘X’ in one or more cells; multiple answers are allowed). Problems Harmonization of statistical units Harmonization of reference periods Completion of coverage Harmonization of variables definitions Harmonization of classifications Parsing20 Adjustment for measurement errors Imputation for item non-response Derivation of new variables Check the overall consistency Other: Harmonization of editing procedures Construction of key for the linkage 27 18 27 28 17 9 11 19 7 18 2 1 1 No response 1 The previous table describes almost all the collected questionnaires (some of the collected integration projects are still under development, and their characteristics could still not clearly be defined). Generally speaking, the activities regarding the harmonization phases (Harmonization of statistical units, of reference periods, of variables definitions, of classifications, Completion of coverage, Parsing, Derivation of new variables, Harmonization of editing procedures) seem to be the most demanding ones, involving 133 answers; on the other side, the presence of non sampling errors (Adjustment for measurement errors, Imputation for item non-response, Check the overall consistency) are less relevant occurring 48 times. 20 Parsing divides a free-form name field into a common set of components that can be compared. Parsing algorithms often use hints based on words that have been standardised. This approach is usually used for comparing person names and surnames, street names, city names (see Section 20.3 of Winkler, W.E. (1995). Matching and Record Linkage. In Business Survey Methods (Cox B.G., Binder D.A., Chinnappa B.N., Christianson A., Colledge M., Kott P.S. (eds.)), pp. 355-384. Wiley, New York). WP1 86 Question 13. Main method used. Method Exact record linkage (linking records corresponding to the same unit from two data sources by merging identifying variables) Probabilistic record linkage (e.g. Fellegi- Sunter approach; linking records corresponding to the same unit from two data sources by probabilistic methods) Statistical matching (linking records corresponding to ‘similar units’ from two distinct sample surveys) Other data integration procedures 32 6 5 3 By far, projects on integration of different sources consist of the application of exact record linkage procedures. Record linkage procedures are considered in 32 situations, while statistical matching is applied in 5 cases. In one case, imputation has been used in order to integrate data (this approach can be associated to the statistical matching procedures). Actually, some projects report the presence of more than one main integration method (for this reason, the total of the previous table is more than 37). This is possible especially when the integration phase should combine a large number of data sources with different characteristics. In the following table, the total number of projects that make use of more than one integration method is illustrated. Combination of methods Exact record linkage + other data integration procedures (imputation) Exact record linkage + probabilistic record linkage Exact record linkage + statistical matching 3 Exact record linkage + probabilistic record linkage + statistical matching 2 Exact record linkage + probabilistic record linkage + statistical matching + other data integration procedures 1 2 3 From the joint-analysis of the questions number12 and number13, no evidence of a dependence of the problems affecting data integration process with the method implemented appears. In other words, the issues concerning the harmonization phase or presence of non sampling errors seem to affect similarly exact record linkage, probabilistic record linkage, statistical matching and other data integration procedures. Due to the fact that almost all the answers are concentrated on the exact record linkage method, the comparison of the two classes of problems among different methods cannot be investigated deeply. WP1 87 Method Problems Harmonization phase Non sampling errors Exact record linkage Probabilistic Statistical record linkage matching Other 30 20 6 4 3 2 4 3 In the table below the number of problems which affect the data integration projects are reported distinguished by the adopted integration methods. During the data integration process more than one type of problem usually arose. Only two projects declared no problem at all in integrating different sources while the majority of the processes faced more than 5 types of problems simultaneously. Method Number of problems arose simultaneously Exact record Probabilistic Statistical linkage record linkage matching No error affects the process Only 1 type of error 2 types of error 3 types of error 4 types of error 5 types of error 6 types of error 7 types of error 8 types of error 9 types of error 10 or more types of error Other 2 2 2 5 5 2 2 3 1 2 1 2 2 2 1 1 1 1 1 Question 14. Brief description of the method used. Regarding the exact record linkage procedures, in many projects the respondents highlight mostly the variable or variables used for merging the units. The most frequent merging variable is a single identification code. In business areas the code is a VAT registration number (or a tax number) or the enterprise number; whereas, in case of socio-demographic fields, instead, the personal identification number is obtained on the basis of a single variable or of a combination of variables (name, place and date of birth). In addition, in a few cases the integration phase is preceded by specific quality and editing controls (mainly on identifiers) so as to determine the best matching variables. Sometimes the missing and/or the erroneous WP1 88 values in the key variables are imputed on the basis of the time series or from other external sources. At the end, for the cases not resolved by the automatic exact record linkage procedure, the links are controlled and assigned manually. Few questionnaires treat the probabilistic record linkage method; in this context the link probability weights are based on the level of agreement between the matching variables. The statistical matching methods are performed via logistic and regression models or via a random selection of donors from classes determined by stratification and matching variables; proxy variables are used to avoid the conditional independence assumption. Another data integration procedure consists of imputation of unknown information on a small scale. Question 15. The quality assessment of the results of the data integration process consists in the following tools (please put an ‘X’ in one or more cells; multiple answers are allowed). Tool Quality indicator of the integration process Comparisons with respect to previous experiences Published reports (available also for people outside the institute) Other 20 23 12 2 Some questionnaires present missing values. Other tools for assessing the quality of the data integration procedures are: - the comparison of the results obtained with external sources, in particular by comparing registers data with survey data at individual level using identification numbers; - the analysis of the percentages of not matched individuals compared with the same data aggregated from other surveys. Question 16. Describe briefly the methods used for evaluating the data integration process. If one or more quality indicators are used, please mention their definition. The evaluation of the data integration procedures is performed mostly by means of external sources, of the experts’ analyses or of previous experiences. One project is validated by clerical review because of a small amount of data involved. The use of a re-matching procedure via the double independence strategy gives the level of quality in only one project; whereas in others some of the quality indicators used are: - the geographical and sectorial coverage; - the number of linked/non linked data; - the imputation fraction in case of link failed; - the quality of the link when it succeeded. In only one questionnaire the validation of the procedure is carried out by studying the statistical properties of the linked data, in detail the impact of the weights, both partial and whole regression, model re-specification and the clerical review of the outlier and anomalous values. WP1 89 In some cases the quality indicators are still being considered because of the early development of the projects. 5.5. Software issues Ondrej Vozár (CZSO) In this section there is a description of the answers on the software issues related to the collected integration processes. It is apparent that these projects make generally use of internally developed software, that can also be reused in other occasions (generalized software). A brief overview of the software tools used for the projects is also given, as well as documentation on internally developed software tools. Question 17. Has a generalized software been used? The following table shows the frequency of projects using generalized software Overall number of projects using generalized software % Yes No N.A. Total 20 15 2 37 54,1 40,5 5,5 100,0 More than half of the projects are using generalized software (20 of 37 cases); missing answers belong to planned projects not started yet. Question 18. If Yes, write if any of the following software characteristics is available (please put an ‘X’ in one or more cells; multiple answers are allowed) The following table shows the software used by its characteristics Overall number of software used by their characteristics % Free SW Open source SW 0 3 Internally developed SW 26 0 8,1 70,3 N.A. 8 21,6 Mostly internally developed software (26 of 37 cases) is used. Missing answers belong mostly to planned and not yet started projects and projects without generalized software (mostly solved by ad hoc queries sql etc.). Question 19. With reference to the software in question 17, please, describe briefly other characteristics of the software (name, main characteristics, what phase(s) of the data integration project does the software deal with, programming language(s), hardware and software platforms and requirements, operating system). WP1 90 The broad scope of software is used (mostly databases and statistical software). Commercial statistical software tools are utilised in 12 projects. The most used is SAS (10 cases), the remaining software is used just in 1 project – SPSS, GAUSS and STATA. Database software is applied in 12 projects – Oracle (5 cases), MS SQL Server (2 cases), FoxPro (1 case), ADABASE (1 case), MS ACCESS 2003 (1 case) and SYBASE (1 case). Combination of SAS and Oracle occurs in the following projects: i. Database of Social Statistics (Statistics Slovenia) – linking population register, administrative social survey data (improve social statistics surveys, improve population register and domain estimates), ii. Compilation of National Accounts data (Statistics Hungary) – linking of both quarterly and yearly individual data, survey aggregates and macro aggregates. Special software BLAISE is applied in the Social Statistics Database (Statistics Netherlands) where different registers and social survey data are linked by exact record linkage. A probabilistic linkage method is implemented by means of an internally developed software CAMS (Computer Assisted Matching System in Visual C++). The software is developed for needs of the UK 2001 One Number Census. It enables matching records using combination of automatic, probability and clerical matching. It is run on a Windows desktop PC with access over the secure network to the central Census Sybase databases. Special methods and software for matching addresses are also used. The open source software URBIS SPW is applied to link addresses among different registers in Statistics Belgium’s Microcensus 2006 Project. In Statistics Austria Business Register Maintenance Project, data of different sources are linked by names and addresses by the bigram method implemented both in PLI (for batch job in IBM Host) and in SAS (both IBM Host and PC under Windows XP). Question 20. With reference to the software in question 17, if it is proprietary and internally developed, please let us know if you are able to provide documentation regarding proprietary software tools or algorithms / source codes. Documentation on proprietary software is available in 8 projects (including 3 projects with source code available). Projects where both documentation and source code of proprietary software is available: 1. Statistics Italy, Non Cash Pension Benefits, 2. Statistics Italy, Social Accounting Matrix, 3. Statistics Switzerland, New Concept for the Federal Population Census (from 2010). Project where only documentation of proprietary software is available: 1. Statistics Denmark, Improvement of Quality of Social and Business Surveys, 2. Statistics Slovenia, Database of Social Statistics, 3. Statistics Netherlands, Social Statistical Database, 4. Statistics Hungary, Compilation of National Accounts, 5. Czech Republic, RegCensus (linking of administrative and survey data in demographic statistics – in preparation). 5.6. Documentation on the integration project Nicoletta Cibella and Tiziana Tuoto (Istat) This set of questions investigates whether there is any documentation on the integration projects, in terms of methodology and results. WP1 91 Question 21. The institution/unit/subunit where I work has produced documentation (technical reports, articles in journals, manuals,…) on the implemented methodologies. Yes 23 No 13 In progress 1 Among the 23 projects for which documentation is available more than the half (to be exactly: 13) provide documents in English. Question 22. If yes, provide not more than three main references on the implemented methodologies. Please specify the language, and if a main reference is not in English answer the question whether a translation (of the abstract) is available. The documentation available in English can be sub-divided into three main topics: 1. General Papers which regard the whole production of statistical information, not only the integration process. In some projects the reported documentation is strictly linked with the imputation and data editing procedures; 2. Documentation for ad hoc project described only some particular aspects arose and/or some solution implemented for the specific project considered; 3. Official Documentation written for meeting the Eurostat or other National or International Statistical Offices requirements General Papers Documentation for Official ad hoc project Documentation 8 2 5 One questionnaire provides a Handbook of Best Practices. Question 23. The unit where I work bases the data integration activities on documentation (technical reports, articles on journals, manuals,…) produced by other institutes/universities, as far as methodological issues are concerned. Question 24. If yes, provide not more than three main references. Please specify the language, and if a main reference is not in English answer the question whether a translation (of the abstract) is available. Only two projects answered that the unit where they work bases the data integration activities on documentation produced by other institutes/universities, as far as methodological issues are concerned, they show basically the use of regulations and manuals for the treatment of administrative data. 5.7. Possible changes Alois Haslinger (Statistics Austria) The following questions deal with the planned modifications of the integration projects, in terms of the implemented methodology and software tools. It is important to underline that almost 40% of the collected projects plan to improve some aspects of the integration process. WP1 92 Question 25: Do you plan to modify the data integration process in the near future? Modification Yes No21 16 21 For 16 (43%) out of 37 data integration projects the respondents plan to modify the data integration process in the near future. Question 26: If ‘yes’, which aspects do you plan to change (multiple answers allowed)? Aspects to change Methods Software Other 9 9 6 This question addresses the aspects of the data integration process which are planned to be modified in the near future. There is nearly a balance between the 3 mentioned aspects: The methods will be changed in 9 projects, the software also in 9 projects and other aspects in 6 projects. Actually, some projects plan to change more than one aspect of the integration process (for this reason, the total of the previous table is more than 16). In the following table, the total number of projects that plan to change at least one aspect of data integration is tabulated according to the combination of aspects for which a change is planned: Combination of aspects for which a change is planned Only Methods Only Software Only other Aspects Methods + Software Software + Other Aspects Methods + Software + Other Aspects 2 1 4 6 1 1 Obviously, there are two clusters of projects for which a change is planned. On the one hand you have 6 projects for which methods and software shall be changed, on the other hand only other aspects shall be changed in 4 projects. Only methods shall be changed in 2 projects. All other combinations of changes are mentioned in at most one project. Question 27: Please, describe briefly the planned modifications. For 15 projects details on their planned modifications have been provided. In Austria a further improvement and optimization of the data integration process for producing the SBS-statistics is mentioned. At the moment the methods and software for the Test Census 2006 are evaluated. It is not yet clear which changes are needed for the register-based Census 2011. 21 The response option ‚No’ includes the option ‚No answer’. WP1 93 The Czech Republic tries to include more variables from income tax return and monthly data of the social insurance in their data integration projects. Italy tries to improve all steps in business demography, especially to create generalized software for that project. The feasibility of the longitudinal use of a sample survey integrated with data on causes of death and hospitalization shall be evaluated further by using a broader database for Italy as a whole and more recent years. Better and more administrative data from State Revenue Service will be used for imputation in the SBS of Latvia. The administrative data shall be analyzed more deeply. Hungary plans to establish a joint database for SBS and taxation, to eliminate duplicate data publication. In the UK the 2001 Census was linked to a large post-enumeration survey and the results compared with demographic estimates and other aggregate administrative data. For the 2011 Census much the same process is planned but with more sources of data. The software may be expanded to allow integration of these other sources. IT developments in the NHS will change tracing methods and software used for the ONS Longitudinal Study. The Virtual Microdata Laboratory team began investigating the unit-record linking of social and personal data. The project is a pilot and expected to develop the data fusion methods used before. In Switzerland at least 3 administrative sources are linked with the business register. It is planned to improve the record-linkage methods, the construction of databases and the data treatment functionalities. For the 2010 register census a modification of data collection, data integration, pooling technologies, small area estimations and collating information from separate registers is planned. Germany plans an enlargement of sources used for the business register maintenance and the development of new software. 5.8. Possibility to establish links between experts Alois Haslinger (Statistics Austria) The following questions investigate the opportunity to establish a connection between all the interested people on some aspects of the integration projects. It turns out that there is great interest in establishing a connection on methodological issues. Question 28: Do you believe that the work of a committee/group of experts could provide a useful external support for your current activities? External support useful Yes No22 28 9 The majority of project leaders (28 of 37 or 76%) appreciates the work of a group of experts on data integration and expects support from that group for their own activities. 22 The response option ‚No’ includes the option ‚No answer’. WP1 94 Question 29: If ‘yes’, which aspects should the committee/group coordinate (multiple answers allowed)? If ‘Other’, please specify: Main focus of work of expert group Methodological aspects Software developments Other 27 13 3 This question addresses some aspects of work which a committee/group of experts should coordinate. From the specified response categories ‘methods’ have been marked most frequently (27 times) followed by ‘software development’ (13 entries). Only 3 projects wanted also other aspects as topic for the work of the expert group. Some people wanted that the expert group should focus on more than one aspect (for this reason, the total of the previous table is more than 28). In the following table, the total number of entries in question 29 is tabulated according to the combination of aspects on which the group of experts should focus: Combination of work aspects Only Methods Only other Aspects Methods + Software Methods + Other Aspects Methods + Software + Other Aspects 13 1 12 1 1 For 13 projects the expert group should concentrate only on methodological aspects, nearly as much (12) opt for Methods and Software. All other combinations of work aspects are mentioned in at most one project. Only in 3 projects proposals for other work aspects of a new expert group are submitted. The Czech Republic brings quality assessment forward. From Finland comes the idea to review the effects of using different administrative data sources and data integration on across-countries comparability. Spain remarks that several Working Groups and Seminars have been created wit the aim of discussing the best management procedures for the Business Registers in the European Union. A high-priority matter refers to develop national techniques that allow fulfilling the requirements of the Community Regulation. The Recommendations Manual on Business Registers (harmonised methodology) is available for all countries. WP1 95 Annex. Survey on the use and/or development of integration methodologies in the different ESS countries The Annex includes the letter of invitation and the questionnaire disseminated to the ESS MS, in collaboration with Eurostat. WP1 96 In 2005 Eurostat has launched the idea of establishing European Centres of Excellence in the field of Statistics as a way to reinforce cooperation between National Statistical Institutes. In this way the various institutes in Europe could benefit from each others experiences and together raise the level of their statistical production process. This CENEX project will be a one and a half year project and be active from December 2006 to June 2008. The area of interest of this CENEX is integration of surveys and administrative data (ISAD). An overview of the activities in this CENEX can be found at the CENEX website (http://cenex-isad.istat.it). One of the activities consists of having an overview of the state of art on the area of interest in different ESS countries. The objective of this overview is to identify possible areas in which convergence can be achieved and to develop a plan to meet common needs on tools and criteria for a harmonised treatment of data integration. In this context the CENEX on Methodology, area ISAD, has developed a questionnaire. The aim of the questionnaire is twofold: on the one hand, it investigates details on data integration projects; on the other hand, it aims to collect documents, software and tools on integration of surveys and administrative data. If possible, documentation (or details on how to download them) can be provided together with the filled in questionnaire. Documentation will be distributed on the CENEX-ISAD web site. The questionnaire should be filled in by the personnel responsible for the main projects on data integration within the institute (project manager or methodologist in charge of the project). The questionnaire is focused on only one project. If you are responsible for more than one project, please compile one questionnaire for each project. By data integration project we mean any project that implies the joint use of multiple sources of data for the data production process. The joint use of two or more sources has to include the following aspect: the link of records at the unit level (e.g. administrative data with records of another administrative source, a sample survey, a census, or a register/archive) for different objectives. These objectives include: construction and maintenance of registers/archives; enrichment of statistical surveys for improving coverage; enlargement of the set of variables; improvement of weighting, imputation, small area estimators; analyses of variables observed in two different data sources; construction of virtual censuses. A data integration project does not include the substitution of a survey with a single administrative source, or the simple collection of results from different sources (e.g. a statistical information system containing aggregated data computed distinctly from different sources). For additional terminological queries, please refer to the glossary at the end of this questionnaire. Replies should be sent to Mauro Scanu (E-mail: scanu@istat.it), project leader CENEX on Methodology, area ISAD, before 27 April 2007. After the processing we will publish a report of the results on the CENEX-ISAD website. We very much appreciate your cooperation. Mauro Scanu CENEX-ISAD project leader Istituto Nazionale di Statistica (ISTAT) Via Cesare Balbo 16 00184 Roma - ITALIA Email: scanu@istat.it Telephone: +39 06 46732887 Fax: +39 06 46732972 WP1 97 Contact person details 1) Name and surname: 2) Institute: 3) e-mail (to be included in a mailing list of experts) Data integration project details In the following part of the questionnaire, a single project on data integration is described in detail. 4) Please, describe briefly the project mentioning its name, if available 5) Main area of interest of the data integration activity (please put an ‘X’ in one cell only) Main area of interest To carry out a business survey To carry out a social survey To produce a population or housing census To produce a business census To produce a population archive/register To produce a business archive/register Other WP1 98 If ‘Other’, please specify: …………………………………………………………………………………………………………… …….. 6) Objectives of the data integration activity (please put an ‘X’ in one or more cells; multiple answers are allowed) Objectives Traditional census activities (for instance post-enumeration surveys) Reduction of costs and response burden Enhancement of editing and imputation processes Analysis of statistical relations Microsimulation policies Register/archive maintenance Improvement of estimation methods (weighting, imputation, small area estimators) Set up of a sampling frame (for instance improvement of coverage) Other If ‘Other’, please specify: …………………………………………………………………………………………………………… …………………………………………………………………………………………………………… ………… Sources used for the project 7) How many other sources have been used in the project? Total number of sources WP1 99 8) For the main sources used for the data integration project (maximum 14) specify the following details: source name and nature (please put an ‘X’ in the relevant cell to specify whether the source is an archive, a sample, a census or other) Source name Archive/ register Sample Census Other 1 2 3 4 5 6 7 8 9 10 11 12 13 14 If ‘Other’, please specify the kind of source …………………………………………………………………………………………………………… …………………………………………………………………………………………………………… …………………………………………………………………………………………………………… ……………… 9) Following the sources order given in question 8, answer the following questions about each source: number of units (approximate figures are acceptable); specify whether the source is managed by your institute or not; whether the source is well documented; whether WP1 100 each record of the source is integrated with other records of other files, or whether the source is used only in order to compute aggregated values (totals, frequencies) to be used in the integration process. Please put an ‘X’ in the relevant cells. Number of units Managed by your institute Yes No Quality of documentation Use of the data source Poor Unit level good 1 2 3 4 5 6 7 8 9 10 11 12 13 14 WP1 101 excellent Aggregated data Privacy issues 10) Is there a legal foundation which regulates the supply of administrative data and its usage by the NSI? Yes No 11) In which way regulations and laws on privacy affect the data integration project? (please put an ‘X’ in one or more cells; multiple answers are allowed) Problems Unit identifiers have to be cancelled in some data sets Some data sets can provide only aggregate data One or more archives/samples useful for integration goals can not be used Linking of some groups of administrative data is prohibited by law Other If ‘Other’, please specify: ………………………………………………………………………….……………………………… …………………………………………………………………………………………………………… ……………. WP1 102 Questions on the data integration process 12) Which problems affect the data integration process? (Please put an ‘X’ in one or more cells; multiple answers are allowed) Problems Harmonization of statistical units Harmonization of reference periods Completion of coverage Harmonization of variables definitions Harmonization of classifications Parsing23 Adjustment for measurement errors Imputation for item non-response Derivation of new variables Check the overall consistency Other If ‘Other’, please specify: …………………………………………………………………………………………………………… …………………………………………………………………………………………………………… ………… 23 Parsing divides a free-form name field into a common set of components that can be compared. Parsing algorithms often use hints based on words that have been standardised. This approach is usually used for comparing person names and surnames, street names, city names (see Section 20.3 of Winkler, W.E. (1995). Matching and Record Linkage. In Business Survey Methods (Cox B.G., Binder D.A., Chinnappa B.N., Christianson A., Colledge M., Kott P.S. (eds.)), pp. 355-384. Wiley, New York). WP1 103 13) Main method used (please put an ‘X’ in only one cell) Method Exact record linkage (linking records corresponding to the same unit from two data sources by merging identifying variables) Probabilistic record linkage (e.g. Fellegi- Sunter approach; linking records corresponding to the same unit from two data sources by probabilistic methods) Statistical matching (linking records corresponding to ‘similar units’ from two distinct sample surveys) Other data integration procedures 14) Brief description of the method used …………………………………………………………………………………………………….…… …………………………………………………………………………………………………………… …………………………………………………………………………………………………………… ………………… 15) The quality assessment of the results of the data integration process consists in the following tools (please put an ‘X’ in one or more cells; multiple answers are allowed) Tool Quality indicator of the integration process Comparisons with respect to previous experiences Published reports (available also for people outside the institute) Other If ‘Other’, please specify …………………………………………………………………………………………………………… …………………………………………………………………………………………………………… WP1 104 …………………………………………………………………………………………………………… ……………… 16) Describe briefly the methods used for evaluating the data integration process. If one or more quality indicators are used, please mention their definition. …………………………………………………………………………………………………………… …………………………………………………………………………………………………………… …………………………………………………………………………………………………………… ……………… WP1 105 Software issues of the project 17) Has a generalized software been used? (Refer to the main software tool, i.e. to the software used for the main phase of the integration project) Yes No 18) If Yes, write if any of the following software characteristics is available (please put an ‘X’ in one or more cells; multiple answers are allowed) Software characteristics Free software Open source software Internally developed software 19) With reference to the software in question 17, please, describe briefly other characteristics of the software (name, main characteristics, what phase(s) of the data integration project does the software deal with, programming language(s), hardware and software platforms and requirements, operating system) …………………………………………………………………………………………………………… …………………………………………………………………………………………………………… …………………………………………………………………………………………………………… ……………… 20) With reference to the software in question 17, if it is proprietary and internally developed, please let us know if you are able to provide the following (please put an ‘X’ in one or more cells; multiple answers are allowed) Proprietary software Documentation regarding proprietary software tools Algorithms/source codes WP1 106 Please submit documentation and/or algorithms/source codes to Mauro Scanu (scanu@istat.it) with the appropriate contact details. Documentation and codes with the corresponding contact details will be made available on the webpage http://cenexisad.istat.it. WP1 107 Documentation on project methodologies 21) The institution/unit/subunit where I work has produced documentation (technical reports, articles in journals, manuals,…) on the implemented methodologies. No Yes 22) If Yes, provide not more than three main references on the implemented methodologies. Please specify the language, and if a main reference is not in English answer the question whether a translation (of the abstract) is available. a- ………………………………………………………………………………………………….. b- ………………………………………………………………………………………………….. c- ………………………………………………………………………………………………….. If possible, submit the documents to Mauro Scanu (scanu@istat.it) with the appropriate contact details. Documentation with the corresponding contact details will be made available on the webpage http://cenex-isad.istat.it. WP1 108 23) The unit where I work bases the data integration activities on documentation (technical reports, articles on journals, manuals,…) produced by other institutes/universities, as far as methodological issues are concerned. Yes No 24) If Yes, provide not more than three main references. Please specify the language, and if a main reference is not in English answer the question whether a translation (of the abstract) is available. a- ………………………………………………………………………………………………….. b- ………………………………………………………………………………………………….. c- ………………………………………………………………………………………………….. If possible, submit the documents to Mauro Scanu (scanu@istat.it). WP1 109 Possible changes 25) Do you plan to modify the data integration process in the near future? Yes No 26) If Yes, which aspects do you plan to change? (Please put an ‘X’ in one or more cells; multiple answers are allowed) Aspects to change Methods Software Other 27) Please, describe briefly the planned modifications. …………………………………………………………………………………………………………… …………………………………………………………………………………………………………… …………………………………………………………………………………………………………… …………………………………………………………………………………………………………… …………………… WP1 110 Possibility to establish links between experts 28) Do you believe that the work of a committee/group of experts could provide a useful external support for your current activities? Yes No 29) If yes, which aspects should the committee/group coordinate? (Please put an ‘X’ in one or more cells; multiple answers are allowed) Aspects to include Methodological aspects Software developments Other If ‘Other’, please specify: …………………………………………………………………………………………………………… …………………………………………………………………………………………………………… …………………………………………………………………………………………………………… ……………… WP1 111 GLOSSARY Completion of population (coverage) Completion of population arises when sources to be integrated present different coverage of the population of interest. Derivation of new variables Every single operation which aims to recode the original variables stored in the database into a new unambiguous one which is more appropriate for the data integration procedure. Harmonization of classifications Harmonization activities that deal with particular categorical variables whose response categories are used to group units according to a hierarchical structure (e.g. NACE). Transcoding tables are necessary when different classifications are used in different data sources. Harmonization of reference periods Activities to make data from different sources in terms of output refer to the same period or the same point in time. When combining different data sources, it is important to ascertain that data refer to the same period or the same point in time. In the case of administrative data a clear distinction should be made between the moment that a phenomenon occurs and the moment that this phenomenon is registered. Also survey data may not always refer to the same point in time. Harmonisation of reference periods can cover two different kinds of actions. Firstly, the dates in one or more sources may be incorrect. The starting date or the ending date of a given situation may not be correctly registered. For example, the date of emigration of a person is not always known or correct, because people do not report this to the population register. Another example is tax data. They usually refer to a whole year, although not all income is earned during the whole year (e.g. in the case of temporary jobs, like holiday jobs). In that case the dates should be adjusted in order to reflect the correct reference period or the correct average period. Secondly, the dates in the sources are correct, but the time periods do not match. In this case one needs to make some assumptions. For example, the occupation of a person is measured in a survey (e.g. the Labour Force Survey) in October of a given year, but in our census we need data on occupation on 1 January of the next year. In the case that the job of the surveyed person in question in October still exists on 1 January, we may assume that the information about the occupation of this person is also valid on 1 January next year. So, in this case the reference period of the survey (October) is harmonised with the reference period of the census (1 January). Source: private communication from Paul van der Laan (Statistics Netherlands). Harmonization of statistical units All those activities that transform the units of observation/measurement of two different data sources (related to the same target population) in order to derive units that share the same definition across the different data sources. Harmonization of variables definitions WP1 112 Activities needed to transform similar characteristics or attributes observed for the same units but in different data sources in order to derive variables that share the same definition across the different data sources and therefore can be directly compared. Imputation Imputation is the process used to resolve problems of missing, invalid or inconsistent responses identified during editing. This is done by changing some of the responses or missing values on the record being edited to ensure that a plausible, internally coherent record is created. Source: Statistics Canada Quality Guidelines, 3rd edition, October 1998, page 38; Working group on quality: Assessment of the quality in statistics: methodological documents Glossary, 6th meeting, October 2003, EUROSTAT. Item non-response Item non-response occurs either when a respondent provides some, but not all, of the requested information, or when the reported information is not usable. Source: FCSM, Subcommittee on Measuring and Reporting the Quality of Survey Data, Measuring and reporting sources of error in survey, Statistical Policy working paper 31; Working group on quality: Assessment of the quality in statistics: methodological documents Glossary, 6th meeting, October 2003, EUROSTAT. Measurement error Measurement error refers to error in survey responses arising from the method of data collection, the respondent, or the questionnaire (or other instrument). It includes the error in a survey response as a result of respondent confusion, ignorance, carelessness, or dishonesty; the error attributable to the interviewer, perhaps as a consequence of poor or inadequate training, prior expectations regarding respondents' responses, or deliberate errors; and error attributable to the wording of the questionnaire, the order or context in which the questions are presented, and the method used to obtain the responses. Source: Biemer P.P., Groves R.M., Lyberg L.E., Mathiowetz N.A., Sudman S. (1991) Measurement errors in survey. Wiley, New York, p. 760; Working group on quality: Assessment of the quality in statistics: methodological documents Glossary, 6th meeting, October 2003, EUROSTAT Microsimulation Microsimulation (also known as microanalytic simulation) is a modelling technique that operates at the level of individual units such as persons, households, vehicles or firms. Within the model each unit is represented by a record containing a unique identifier and a set of associated attributes – e.g. a list of persons with known age, sex, marital and employment status; or a list of vehicles with known origins, destinations and operational characteristics. A set of rules (transition probabilities) are then applied to these units leading to simulated changes in state and behaviour. These rules may be deterministic (probability = 1), such as WP1 113 changes in tax liability resulting from changes in tax regulations, or stochastic (probability <=1), such as chance of dying, marrying, giving birth or moving within a given time period. In either case the result is an estimate of the outcomes of applying these rules, possibly over many time steps, including both total overall aggregate change and, crucially, the distributional nature of any change. Source: http://www.microsimulation.org Parsing The process of parsing and standardisation of linking variables involves identifying the constituent parts of the linking variables and representing them in a common standard way through the use of look-up tables, lexicons and phonetic coding systems. Source: Statistics New Zealand (2006) Data integration manual, Statistics New Zealand publication, Wellington, August 2006, p. 40. Post enumeration survey A sample survey aimed to check the accuracy of coverage and/or response of another census or survey. Source: Statistics New Zealand (2006) Data integration manual, Statistics New Zealand publication, Wellington, August 2006. Record linkage Record linkage is the action of identifying records corresponding to the same entity from two or more data sources or finding duplicates within files. Entities of interest include individuals, companies, geographic region, families, or households. Record linkage is defined as exact (or deterministic) if a unique identifier or key of the entity of interest is available in the record fields of all of the data sources to be linked. The unique identifier is assumed error-free and there is no uncertainty in exact linkage results. A unique identifier might either be a single variable (for example: tax number, passport number or driver’s license number) or a combination of variables (such as name, date of birth and sex), as long as they are of sufficient quality to be used in combination to uniquely define a record. Record linkage is defined as probabilistic when there are errors or lacking information in the record identifiers. Source: Gu L., Baxter R., Vickers D., Rainsford D. (2003). “Record linkage: current practice and future directions”. CSIRO Mathematical and information sciences, Canberra Australia. Statistics New Zealand (2006) Data integration manual, Statistics New Zealand publication, Wellington, August 2006. Register Complete written record containing regular entries of items and details on particular set of objects. Administrative registers come from administrative sources and become statistical registers after passing through statistical processing in order to make it fit for statistical purposes (production of register based statistics, frame creation, etc.). Source: Daniel W. Gillman (Ed) Common terminology of METIS. Version of: 29 September, 1999, p. 7. WP1 114 Software, free Free software is software that comes with permission for anyone to use, copy, and distribute, either verbatim or with modifications, either gratis or for a fee. In particular, this means that source code must be available. Notice that free software may be used as synonymous of freeware software. Freeware software is software at no cost that typically permits redistribution but not modification (and the source code is not available). Source: http://www.gnu.org/philosophy/categories.html#FreeSoftware; http://www.fsf.org/ Software, generalized Generalized software is software developed for “horizontal” activities, namely activities that do not depend on a specific application domain, and thus on specific functional requirements. Software, internally developed Internally developed software is software whose code has been entirely developed by programmers internal to a specific organization. Software, open source Open source software is software released with an open source licence by the Open Source Initiative (OSI - http://www.opensource.org/index.php). The licence gives the right to use, copy, modify and distribute the software's original “source code”. Open source software is often used to mean more or less the same category as free software. However, it is not exactly the same class of software: some open source licenses are considered too restrictive with respect to free software licences, and there are free software licenses that are not accepted as open source ones. However, the differences are small: nearly all free software is open source, and nearly all open source software is free. Source: http://www.opensource.org/docs/definition.php; http://www.gnu.org/philosophy/categories.html#OpenSource Statistical matching Statistical matching, also known as data fusion or synthetical matching, aims to integrate two (or more) sample surveys characterized by the fact that (a) the units observed in the samples are different (disjoint sets of units); (b) some variables are commonly observed in the two surveys (matching variables). In order to distinguish statistical matching from record linkage, the former is also defined as procedure that links “similar” units from two sample surveys without any unit in common. (where similarity concerns the matching variables). WP1 115 Source: D’Orazio M., Di Zio M, Scanu M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester. Unit identifiers Any variable or set of variables that is structurally unique for each population unit (person, place, event or other unit). Source: Statistics New Zealand (2006). Data integration manual. Statistics New Zealand publication, Wellington, August 2006; OECD Glossary of statistical terms WP1 116