Confidentiality and Anonymization of Microdata 1 MOLLA HUNEGNAW STATISTICIAN AFRICAN CENTRE FOR STATISTICS ECASTATS.UNECA.ORG United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011 Contents 2 Microdata Microdata Confidentility Anonymization Anonymization techniques Anonymization tools Conclusion United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011 Microdata 3 The term microdata can refer to data about an individual, person, household, business or other entity. Microdata may be data collected by surveys, censuses or obtained from administrative records. Microdata are collected using valuable resources of a country. Microdata are usually of significance yet access to countries’ microdata is often limited by the fear that confidentiality protection cannot be guaranteed. Microdata must be used and reused. United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011 Microdata 4 Microdata access obstacles: legal obstacles (in many countries the statistical legislation still in use is out dated and does not recognise the dissemination of electronic data particularly microdata technical and financial obstacles including in-house capacity to handle the complex aspects of microdata dissemination such as data anonymization, political obstacles, and psychological obstacles - the tendency to control access perhaps because of concerns over its mis-interpretation or because ‘data is power’. United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011 Confidentiality 5 Confidentiality refers to limiting data access and disclosure to authorized users and preventing access by or disclosure to unauthorized ones Confidentiality is also related to the broader concept of data privacy -- limiting access to individuals’ personal information. United Nations Regional Seminar on Census Data Archiving for Africa Confidentiality 6 The sixth United Nations Fundamental Principle of Official Statistics about confidentiality. “Individual data collected by statistical agencies for statistical compilation, whether or not they refer to natural or legal persons, are to be strictly confidential and used exclusively for statistical purposes.” United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011 Confidentiality 7 How to make sure confidentiality Legal arrangements Legislation be in place before any microdata are released. Legally enforceable agreement may be one of the requirements of access. It should be possible to set up an arrangement where a prior agreement needs to be signed even where access to microdata is available online. Technical arrangements The risk of confidentiality can be reduced by the use of a number of techniques. The downside is that these techniques may reduce the usefulness of the underlying microdata. United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011 Confidentiality 8 Technical arrangements Remote Access Facilities (RAFs) The key characteristic is that users do not have access to the microdata itself but tasks using that microdata can be submitted remotely over the Internet. Data laboratories /Safe sites Effective in controlling identification risk whilst enabling users access, particularly for data sets where release of a confidentialised microdata file is not possible. Lack of convenience to the user, including sometimes being forced to use unfamiliar data analysis software. They are also expensive for the NSO to manage. Anonymization techniques Removing or modifying the identifying variables contained in the dataset United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011 Anonymization 9 Anonymization is process of removing or modifying the identifying variables contained in the microdata dataset. Identifying variables are those describe characteristic of a person Direct identifiers, which are variables such as names, addresses, or identity card numbers. Indirect identifiers, which are characteristics that may be shared by several respondents, and whose combination could lead to the reidentification of one of them. Anonymizing the data consists in: determining which variables are potential identifiers (this relies on one's personal judgment), and in modifying the level of precision of these variables to reduce the risk of re-identification to an acceptable level. United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011 Anonymization techniques 10 Statistical disclosure limitation techniques can be mainly classified in two categories: data reduction: aim at increasing the number of individuals in the sample sharing the same or similar identifying characteristics presented by the investigated statistical unit. Avoids the presence of unique or rare recognizable individuals. data perturbation: Such methods achieve data protection from a twofold perspective. First, if the data are modified, re-identification by means of record linkage or matching algorithms is harder and uncertain. Secondly, even when an intruder is able to re-identify a unit, he/she cannot be confident that the disclosed data are consistent with the original data. Synthetic microdata: Synthetic microdata are an alternative approach to data protection, and are produced by using data simulation algorithms. United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011 Anonymization techniques 11 Data Reduction Removing variables: The first obvious application of this method is the removal of direct identifiers from the datasets (race, religion, HIV, etc.) Removing records: Removing records can be adopted as an extreme measure of data protection when the unit is identifiable in spite of the application of other protection techniques. Global recoding: The global recoding method consists in aggregating the values observed in a variable into pre-defined classes (for example, recoding the age into five-year age groups, or the number of employees in three-size classes: small, medium and large). United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011 Anonymization techniques 12 Data reduction Top and bottom coding: A special case of global recoding that can be applied to numerical or ordinal categorical variables. The variables "Salary" and "Age" are two typical examples. The highest values of these variables are usually very rare and therefore identifiable. Local suppression: Local suppression consists in replacing the observed value of one or more variables in a certain record with a missing value. United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011 Anonymization techniques 13 Data perturbation Micro-aggregation: replace an observed value with the average computed on a small group of units (small aggregate or microaggregate), including the investigated one. The units belonging to the same group will be represented in the released file by the same value. Methods in micro-aggregation include individual ranking and multivariate micro-aggregation. The easiest way to group records before aggregating them is to sort the units according to their similarity and the values resulting from this criterion, and to aggregate consecutive units into fixed size groups. United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011 Anonymization techniques 14 Data perturbation Data swapping: a perturbation technique for categorical microdata, and aimed at protecting tabulation stemming from the perturbed microdata file. Data swapping consists in altering a proportion of the records in a file by swapping values of a subset of variables between selected pairs of records (swap pairs). Post-randomization (PRAM): induces uncertainty in the values of some variables by exchanging them according to a probabilistic mechanism. PRAM can therefore be considered as a randomized version of data swapping. Adding noise: Adding noise consists in adding a random value ε, with zero mean and predefined variance σ2, to all values in the variable to be protected. Generally, methods based on adding noise are not considered very effective in terms of data protection. Resampling: a protection method for numerical microdata that consists in drawing with replacement t samples of n values from the original data, sorting the sample and averaging the sampled values. Data protection level guaranteed by this procedure is generally considered quite low. United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011 Anonymization techniques 15 Synthetic Microdata An alternative approach to data protection, and are produced by using data simulation algorithms. The rationale for this approach is that synthetic data do not pose problems with regard to statistical disclosure control because they do not contain real data but preserve certain statistical properties. Generally, users are not keen to work with synthetic data as they cannot be confident of the results of their statistical analysis. Nevertheless, this approach can also help to producing “test microdata set.” In this case, synthetic data files would be released to allow users to test their statistical procedures to successively access “true” microdata in a safe site. United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011 Anonymization tools 16 Several computer software have been developed. The broadly used software is µ-Argus. µ-Argus was created by Statistics Netherlands and further developed. Many of the statistical disclosure methods described above can be applied by µ-Argus. The program is interactive and goes through several steps such as the identification of the dataset (metadata); the selection and computation of frequency tables on which several statistical Disclosure Control methods are based; the selection of possible recoding and inspection of the results (global recoding); the selection and application of other protection methods and finally of the anonymisation microdata file including the production of a data process report. IHSN has also developed tools and guidelines on microdata anonymization. These tools are developed as Stata, SPSS and/or SAS programs. A free software for data swapping, the Data Swapping Toolkit (DSTK), is also available. The DSTK is a comprehensive software package for performing and analyzing data swapping on categorical data. Several other software are also used in specialized areas like health, retailers, ecommerce, etc. United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011 Conclusion 17 There is always some chance of identification, even if very small. Software now exists which can estimate the proportion of records which are unique and therefore at some risk of identification. The decision lies in the NSO to release microdata file, whether it an anonymised microdata file, through a remote access facility, or through a data laboratory or via electronic media. the risk of identification is sufficiently small; the adjustments made to the data items have not damaged the microdata file for research purposes; and the variables that have been collapsed are the most appropriate, taking into account both the needs of researchers and the identification risk. Supporting legislation is the key to protect confidentiality and access to microdata is the key United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011 References 18 1. 2. 3. 4. 5. 6. 7. 8. Anonymized Microdata Files, Eurostat International Household Survey Network Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, October 2011, UNECE IPUMS-International Statistical Disclosure Controls: 159 Census Microdata Samples in Dissemination Managing Statistical Confidentiality & Microdata Access: Principles and Guidelines of Good Practice, United Nations Economic Commission For Europe Meeting of the African Association of Statistical Data Archivists, AASDA United Nations Fundamental Principles of Official Statistics, UNSD Data Privacy Through Optimal k-Anonymization United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011 19 Thankyou United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011