Confidentiality and Anonymization of Microdata 1

advertisement
Confidentiality and
Anonymization of Microdata
1
MOLLA HUNEGNAW
STATISTICIAN
AFRICAN CENTRE FOR STATISTICS
ECASTATS.UNECA.ORG
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011
Contents
2
 Microdata
 Microdata Confidentility
 Anonymization
 Anonymization techniques
 Anonymization tools
 Conclusion
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011
Microdata
3
 The term microdata can refer to data about an




individual, person, household, business or other
entity.
Microdata may be data collected by surveys,
censuses or obtained from administrative records.
Microdata are collected using valuable resources of a
country.
Microdata are usually of significance yet access to
countries’ microdata is often limited by the fear that
confidentiality protection cannot be guaranteed.
Microdata must be used and reused.
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011
Microdata
4
 Microdata access obstacles:
 legal obstacles (in many countries the statistical legislation still
in use is out dated and does not recognise the dissemination of
electronic data particularly microdata
 technical and financial obstacles including in-house capacity to
handle the complex aspects of microdata dissemination such
as data anonymization,
 political obstacles, and
 psychological obstacles - the tendency to control access
perhaps because of concerns over its mis-interpretation or
because ‘data is power’.
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011
Confidentiality
5
 Confidentiality refers to limiting data access and
disclosure to authorized users and preventing
access by or disclosure to unauthorized ones
 Confidentiality is also related to the broader concept
of data privacy -- limiting access to individuals’
personal information.
United Nations Regional Seminar on Census Data
Archiving for Africa
Confidentiality
6
 The sixth United Nations Fundamental Principle of
Official Statistics about confidentiality.

“Individual data collected by statistical agencies for statistical
compilation, whether or not they refer to natural or legal
persons, are to be strictly confidential and used exclusively
for statistical purposes.”
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011
Confidentiality
7
 How to make sure confidentiality
 Legal arrangements
Legislation be in place before any microdata are released.
 Legally enforceable agreement may be one of the requirements of
access. It should be possible to set up an arrangement where a
prior agreement needs to be signed even where access to
microdata is available online.


Technical arrangements
The risk of confidentiality can be reduced by the use of a number
of techniques.
 The downside is that these techniques may reduce the usefulness
of the underlying microdata.

United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011
Confidentiality
8
 Technical arrangements

Remote Access Facilities (RAFs)


The key characteristic is that users do not have access to the
microdata itself but tasks using that microdata can be submitted
remotely over the Internet.
Data laboratories /Safe sites
Effective in controlling identification risk whilst enabling users access,
particularly for data sets where release of a confidentialised microdata
file is not possible.
 Lack of convenience to the user, including sometimes being forced to
use unfamiliar data analysis software. They are also expensive for the
NSO to manage.

 Anonymization techniques

Removing or modifying the identifying variables contained in the
dataset
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011
Anonymization
9
 Anonymization is process of removing or modifying the
identifying variables contained in the microdata
dataset.
 Identifying variables are those describe characteristic of
a person


Direct identifiers, which are variables such as names, addresses, or
identity card numbers.
Indirect identifiers, which are characteristics that may be shared by
several respondents, and whose combination could lead to the reidentification of one of them.
 Anonymizing the data consists in:


determining which variables are potential identifiers (this relies on
one's personal judgment), and in
modifying the level of precision of these variables to reduce the risk
of re-identification to an acceptable level.
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011
Anonymization techniques
10
 Statistical disclosure limitation techniques can be mainly
classified in two categories:



data reduction: aim at increasing the number of individuals in the
sample sharing the same or similar identifying characteristics
presented by the investigated statistical unit. Avoids the presence of
unique or rare recognizable individuals.
data perturbation: Such methods achieve data protection from a
twofold perspective. First, if the data are modified, re-identification
by means of record linkage or matching algorithms is harder and
uncertain. Secondly, even when an intruder is able to re-identify a
unit, he/she cannot be confident that the disclosed data are
consistent with the original data.
Synthetic microdata: Synthetic microdata are an alternative
approach to data protection, and are produced by using data
simulation algorithms.
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011
Anonymization techniques
11
 Data Reduction
 Removing variables: The first obvious application of this
method is the removal of direct identifiers from the datasets
(race, religion, HIV, etc.)
 Removing records: Removing records can be adopted as an
extreme measure of data protection when the unit is
identifiable in spite of the application of other protection
techniques.
 Global recoding: The global recoding method consists in
aggregating the values observed in a variable into pre-defined
classes (for example, recoding the age into five-year age
groups, or the number of employees in three-size classes:
small, medium and large).
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011
Anonymization techniques
12

Data reduction
 Top and bottom coding: A special case of global
recoding that can be applied to numerical or ordinal
categorical variables. The
variables "Salary" and "Age" are two typical examples.
The highest values of these variables are usually very
rare and therefore identifiable.
 Local suppression: Local suppression consists in
replacing the observed value of one or more variables in
a certain record with a missing value.
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011
Anonymization techniques
13
 Data perturbation



Micro-aggregation: replace an observed
value with the average computed on a small
group of units (small aggregate or microaggregate), including the investigated one.
The units belonging to the same group will
be represented in the released file by the
same value.
Methods in micro-aggregation include
individual ranking and
multivariate micro-aggregation.
The easiest way to group records before
aggregating them is to sort the units
according to their similarity and the values
resulting from this criterion, and to
aggregate consecutive units into fixed size
groups.
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011
Anonymization techniques
14
 Data perturbation
 Data swapping: a perturbation technique for categorical microdata,
and aimed at protecting tabulation stemming from the perturbed
microdata file. Data swapping consists in altering a proportion of the
records in a file by swapping values of a subset of variables between
selected pairs of records (swap pairs).
 Post-randomization (PRAM): induces uncertainty in the values of
some variables by exchanging them according to a probabilistic
mechanism. PRAM can therefore be considered as a randomized version
of data swapping.
 Adding noise: Adding noise consists in adding a random value ε, with
zero mean and predefined variance σ2, to all values in the variable to be
protected. Generally, methods based on adding noise are not considered
very effective in terms of data protection.
 Resampling: a protection method for numerical microdata that
consists in drawing with replacement t samples of n values from the
original data, sorting the sample and averaging the sampled values. Data
protection level guaranteed by this procedure is generally considered
quite low.
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011
Anonymization techniques
15
 Synthetic Microdata
 An alternative approach to data protection, and are produced
by using data simulation algorithms. The rationale for this
approach is that synthetic data do not pose problems with
regard to statistical disclosure control because they do not
contain real data but preserve certain statistical properties.
 Generally, users are not keen to work with synthetic data as
they cannot be confident of the results of their statistical
analysis. Nevertheless, this approach can also help to
producing “test microdata set.” In this case, synthetic data files
would be released to allow users to test their statistical
procedures to successively access “true” microdata in a safe
site.
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011
Anonymization tools
16
 Several computer software have been developed. The broadly used software




is µ-Argus.
µ-Argus was created by Statistics Netherlands and further developed.
Many of the statistical disclosure methods described above can be applied
by µ-Argus. The program is interactive and goes through several steps such
as the identification of the dataset (metadata); the selection and
computation of frequency tables on which several statistical Disclosure
Control methods are based; the selection of possible recoding and
inspection of the results (global recoding); the selection and application of
other protection methods and finally of the anonymisation microdata file
including the production of a data process report.
IHSN has also developed tools and guidelines on microdata anonymization.
These tools are developed as Stata, SPSS and/or SAS programs.
A free software for data swapping, the Data Swapping Toolkit (DSTK), is
also available. The DSTK is a comprehensive software package for
performing and analyzing data swapping on categorical data.
Several other software are also used in specialized areas like health,
retailers, ecommerce, etc.
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011
Conclusion
17
 There is always some chance of identification, even if very
small.
 Software now exists which can estimate the proportion of
records which are unique and therefore at some risk of
identification.
 The decision lies in the NSO to release microdata file, whether
it an anonymised microdata file, through a remote access
facility, or through a data laboratory or via electronic media.



the risk of identification is sufficiently small;
the adjustments made to the data items have not damaged the microdata
file for research purposes; and
the variables that have been collapsed are the most appropriate, taking
into account both the needs of researchers and the identification risk.
 Supporting legislation is the key to protect
confidentiality and access to microdata is the key
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011
References
18
1.
2.
3.
4.
5.
6.
7.
8.
Anonymized Microdata Files, Eurostat
International Household Survey Network
Joint UNECE/Eurostat Work Session on Statistical Data
Confidentiality, October 2011, UNECE
IPUMS-International Statistical Disclosure Controls: 159
Census Microdata Samples in Dissemination
Managing Statistical Confidentiality & Microdata Access:
Principles and Guidelines of Good Practice, United Nations
Economic Commission For Europe
Meeting of the African Association of Statistical Data
Archivists, AASDA
United Nations Fundamental Principles of Official
Statistics, UNSD
Data Privacy Through Optimal k-Anonymization
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011
19
Thankyou
United Nations Regional Seminar on Census Data Archiving for Africa, Addis Ababa, 20-23 September 2011
Download