5. Reiter, J. P. and Kinney, S. K. (2011)

advertisement
Abstract
There are enormous benefits to sharing epidemiological data. However, doing so could risk the
confidentiality of data subjects' identities and sensitive attributes. We present a primer on methods of
sharing confidential data, including both restricted access and restricted data methods.
1. Introduction
In a recent editorial, the editors of Epidemiology invited authors to make their analytic and simulation
codes, questionnaires, and data used in analyses available to other researchers [1]. Much has been
written about the need for data sharing and reproducible research, and many journals and funding
agencies have explicit data sharing policies [2, 3]. When data are confidential, however, investigators
typically cannot release them as collected, because doing so could reveal data subjects' identities or
values of sensitive attributes, thereby violating ethical and potentially legal obligations to protect
confidentiality.
At first glance, safely sharing confidential data seems a straightforward task: simply strip unique
identifiers like names, addresses, and identification numbers before releasing the data. However, these
actions alone may not suffice when other readily available variables, such as geographic or demographic
data, remain on the file. These quasi-identifiers can be used to match units in the released data to other
databases. For example, Sweeney [4] showed that 97% of the records in publicly available voter
registration lists for Cambridge, Massachusetts, could be uniquely identified using birth date and nine
digit zip code. By matching on the information in these lists, she was able to identify Massachusetts
Governor William Weld in an anonymized medical database. As the amount of information that is
readily available to the public continues to expand, e.g., via the Internet and private companies,
investigators releasing large-scale epidemiologic data run the risks of similar breaches.
In this commentary, we present a primer on techniques for sharing confidential data. We classify
techniques into two broad classes: restricted access, in which access is provided only to trusted users,
and restricted data, in which the original data are somehow altered before sharing.
2. Restricted Access Strategies
If one or more researchers have been trusted to analyze confidential data, it seems logical that they
should be able to share the data with additional trusted researchers. However, researchers may be
prevented by law or by the policies of their institution or sponsoring agency from sharing the data, even
with their colleagues. Without procedures and rules for sharing data, there is little an individual
researcher can do. The means by which unaltered confidential data are shared with external
researchers, while preserving confidentiality, are referred to as restricted access methods.
The primary restricted access methods employed by most data stewards, including government
agencies and individual investigators, are licensing agreements and restricted data centers [5]. Under
the licensing model, researchers with legitimate research questions that can be answered with the
confidential data enter into an agreement with the data steward so that they are provided with a copy
of the data for use at their home institution. Typically, agreements specify how the data should be
protected and may require certification of destruction at project completion. Fees may be charged to
cover data processing and administrative costs. In some cases, such as licensing agreements with the
National Center for Education Statistics, the data are slightly altered to introduce a small amount of
uncertainty, but not enough to concern analysts.
In the data center model, researchers must carry out their research in a physically and electronically
secure facility controlled by the data steward. Often, researchers are required to submit a proposal
describing the research plans, the confidential data needed, and in some cases how their research will
benefit the data steward. In additional to travel expenses, researchers may be required to pay
substantial fees to access the data center, and they cannot remove any output without approval from
the data steward. The U.S. Census Bureau and National Center for Health Statistics provide researchers
access to confidential data in this manner.
An alternative is a secure online data center, such as the NORC Virtual Data Enclave. The confidential
data are stored on secure servers that approved researchers can access via remote login from prespecified IP addresses. The researcher can see and analyze the actual data, but the server disables local
saving, printing of the data and analysis results, and cut-and-paste operations. As with a physical data
enclave, analysis results are checked for potential disclosure violations by NORC staff before being
approved for publication outside the enclave. The virtual enclave has advantages over licensing models,
in that the risks of breaches are centrally managed, e.g., there are no misplaced CD ROMs or
unapproved file sharing.
A variant on the virtual data enclave is the remote analysis server. This is a query-based system that
provides the results of analyses of the confidential data without actually allowing users to see the
confidential values, thus reducing disclosure risks [6—8]. Examples include the U.S. Census Bureau’s
American FactFinder, which is available to anyone but produces only tabular summaries, or the
Australian Remote Access Data Laboratory, which restricts users but allows more flexible analyses.
Other query-based methods under research include verification servers [9], which provide users with
feedback on the quality of analyses based on redacted public use data, and output perturbation via
differential privacy [10], whereby noise is added to query results so that they do not provide information
about any individual with certainty.
3. Restricted Data Strategies
While restricted access policies use trust and physical protection, restricted data policies provide
unfettered access to data that have been modified before release. The key issue for data stewards is
determining how much alteration is needed to ensure protection without completely destroying the
usefulness of the data. To make informed decisions about this trade off, data stewards can quantify
disclosure risk and data usefulness associated with particular release strategies. For example, when two
competing policies result in approximately the same disclosure risk, select the one with higher data
usefulness. Quantifiable metrics can help data stewards decide when the risks are sufficiently low, and
the usefulness is adequately high, to justify releasing the altered data [11, 12].
The literature on disclosure risk assessment highlights two main types of disclosures, namely (i)
identification disclosures, which occur when an ill-intentioned user of the data, henceforth called an
intruder, correctly identifies individual records in the released data, and (ii) attribute disclosures, which
occur when an intruder learns the values of sensitive variables for individual records in the data.
Attribute disclosures usually are preceded by identity disclosures—for example, when original values of
attributes are released, intruders who correctly identify records learn the attribute values—so that focus
typically is on identification disclosure risk assessments. See the reports of the National Research
Council [13, 14] and Federal Committee on Statistical Methodology [15] for more information about
attribute disclosure risks [16].
Identification disclosure risk measures are often based on estimates of the probabilities that individuals
can be identified in the released data. Probabilities of identification are easily interpreted: the larger the
probability, the greater the risk. Stewards determine their own threshold for unsafe probabilities. A
variety of approaches have been used to estimate these probabilities [17--22]. Individuals who are
unique in the population (as opposed to the sample) are particularly at risk [23]. Therefore, much
research has gone into estimating the probability that a sample unique record is in fact a population
unique record [24, 25].
Data usefulness is usually assessed with two general approaches: (i) comparing broad differences
between the original and released data, and (ii) comparing differences in specific models between the
original and released data. Broad difference measures essentially quantify some statistical distance
between the distributions of the data on the original and released files, for example a Kullback-Leibler or
Hellinger distance [26]. As the distance between the distributions grows, the overall quality of the
released data generally drops [27, 28].
Comparison of measures based on specific models is often done informally. For example, the steward
looks at the similarity of point estimates and standard errors of regression coefficients after fitting the
same regression on the original data and on the data proposed for release. If the results are considered
close, for example the confidence intervals obtained from the models largely overlap, the released data
have high utility for that particular analysis [29]. Such measures are closely tied to how the data are
used, but they provide a limited representation of the overall quality of the released data. Thus, it is
prudent to examine models that represent a wide range of uses of the released data.
Risks can be reduced by altering data using one or more statistical disclosure limitation strategies.
Common approaches include coarsening, data swapping, noise addition, and synthetic data. Examples of
coarsening include releasing geography as aggregated regions, reporting exact values only below or
above certain thresholds (known as top or bottom coding), collapsing levels of categorical variables, and
rounding numerical values. Recoding reduces disclosure risks by turning atypical records—which
generally are most at risk—into typical records. Recodes frequently are specified to reduce estimated
probabilities of identification to acceptable levels, e.g., no cell in a cross-tabulation of a set of key
categorical variables has fewer than k individuals, where k often is three or five.
Data swapping refers to switching the data values for selected records with those for other records [30].
The protection afforded by swapping is based in large part on perception: data stewards expect that
intruders will be discouraged from linking to external files, since any apparent matches could be
incorrect due to the swapping. The details of the swapping mechanism are almost always kept secret to
reduce the potential for reverse engineering.
Adding random noise to sensitive or identifying data values can reduce disclosure risks because errors
are introduced into the released data, making it difficult for intruders to match on the released values
with confidence [31, 32]. For numerical data, typically the noise is generated from a distribution with
mean zero to ensure point estimates of means remain unbiased, although the variance associated with
those estimated mean may increase. However, zero-mean noise still can lead to attenuation in point
estimates of correlations and regression coefficients.
Partially synthetic data methods replace some collected values, such as sensitive values at high risk of
disclosure or values of key identifiers, with multiple imputations [8, 33—35]. For example, suppose that
the steward wants to replace income when it exceeds $100,000—because the steward considers only
these individuals to be at risk of disclosure—and is willing to release all other values as collected
(alternatively, the steward could synthesize the key identifiers for the individuals with incomes
exceeding $100,000, with the goal of reducing risks that these individuals might be identified). The
steward generates replacement values for the incomes over $100,000 by randomly simulating from the
distribution of income conditional on all other variables. To avoid bias, this distribution must be
conditional on income exceeding $100,000. The distribution is estimated using the collected data and
possibly other relevant information. The result is one synthetic data set. The steward repeats this
process multiple times and releases the multiple datasets to the public. The multiple datasets enable
secondary analysts to reflect the uncertainty from simulating replacement values in inferences. Among
the restricted data techniques described here, partial synthesis is the newest and least commonly used.
A stronger version of synthetic data is full synthesis, in which all values in the released data are
simulated [36, 37]. This approach may become appealing in the future if confidentiality concerns grow
to the point where no original data can be released in unrestricted public use files.
4. Conclusion
To reproduce research conducted on confidential data, researchers likely will need access to the same
confidential data used by the original investigators. It is possible in some cases for restricted data
methods to be used, but for the purpose of reproducibility, restricted access methods hold more
promise. This is not to say that restricted data methods do not have their place, as they enable many
other benefits of data sharing.
For both restricted data and restricted access, the methods that we described may be too cumbersome
for individual researchers to implement independently. We believe it is incumbent upon research
institutions and sponsors, journals, and other advocates of reproducibility to facilitate sharing of
confidential data by establishing procedures and policies for sharing confidential data and perhaps
creating confidential data repositories. Authors who obtain their confidential data from such
repositories could comply with a data sharing invitation such as the one put forth by Epidemiology by
providing detailed instructions for obtaining the confidential data.
References
1 Hernan MA, Wilcox AJ. Epidemiology, data sharing, and the challenge of scientific replication
[commentary]. Epidemiology. 2009; 20(2), 167—168.
2 Donoho DL. An invitation to reproducible computational research. Biostatistics. 2010; 11(3): 385--388.
3 Sedransk N, Young LJ, Kelner KL, Moffitt RA, Thakar A, Raddick J, Ungvarksy EJ, Carlson RW, Apweiler
R, Cox LH, Nolan D, Soper K, Spiegelman C. Make research data public? – Not always so simple: A
dialogue for statisticians and science editors. Statistical Science. 2010; 25(1): 41—50.
4 Sweeney L. Computational disclosure control: Theory and Practice. Ph.D. thesis, Department of
Computer Science, Massachusetts Institute of Technology, 2001.
5 Kinney SK, Karr AF, Gonzalez JF. Data confidentiality: The next five years summary and guide to
papers, Journal of Privacy and Confidentiality. 2009; 1(2): Article 1
6 Gomatam S, Karr AF, Reiter JP, Sanil AP. Data dissemination and disclosure limitation in a world
without microdata: A risk-utility framework for remote access analysis servers. Statistical Science. 2005;
20: 163—177.
7 Reiter JP. Model diagnostics for remote-access regression servers. Statistics and Computing. 2003;
13: 371—380.
8 Reiter JP. New approaches to data dissemination: A glimpse into the future (?). Chance. 2004; 17(3):
12—16.
9 Reiter JP, Oganian A, Karr AF. Verification servers: enabling analysts to assess the quality of inferences
from public use data. Computational Statistics and Data Analysis. 2009; 53: 1475—1482.
10 Dwork C. Differential privacy. 33rd International Colloquium on Automata, Languages, and
Programming, Part II. Berlin: Springer; 2006: 1—12.
11 Willenborg L, de Waal T. Elements of Statistical Disclosure Control. New York: Springer-Verlag; 2001.
12 Duncan GT, Keller-McNulty SA, Stokes SL. Disclosure risk vs. data utility: The R-U confidentiality map.
Technical Report, U.S. National Institute of Statistical Sciences; 2001.
13 National Research Council. Expanding Access to Research Data: Reconciling Risks and Opportunities.
Panel on Data Access for Research Purposes, Committee on National Statistics, Division of Behavioral
and Social Sciences and Education. Washington DC: The National Academies Press; 2005.
14 National Research Council. Putting People on the Map: Protecting Confidentiality with Linked SocialSpatial Data. Panel on Confidentiality Issues Arising from the Integration of Remotely Sensed and SelfIdentifying Data, Committee on the Human Dimensions of Global Change, Division of Behavioral and
Social Sciences and Education. Washington DC: The National Academies Press; 2007.
15 Federal Committee on Statistical Methodology. Statistical policy working paper 22: Report on
statistical disclosure limitation methodology (second version). Washington, D.C.: Confidentiality and
Data Access Committee, Office of Management and Budget, Executive Office of the President; 2005.
16 Lambert D. Measures of disclosure risk and harm. Journal of Official Statistics. 1993; 9: 313—331.
17 Duncan GT, Lambert D. Disclosure-limited data dissemination. Journal of the American Statistical
Association. 1986; 81: 10—28.
18 Duncan GT, Lambert D. The risk of disclosure for microdata, Journal of Business and Economic
Statistics. 1989; 7: 207—217.
19 Fienberg S, Makov UE, Sanil AP. A Bayesian approach to data disclosure: Optimal intruder behavior
for continuous data. Journal of Official Statistics. 1997; 13: 75—89.
20 Domingo-Ferrer J, Torra V. Disclosure risk assessment in statistical microdata via advanced record
linkage. Statistics in Computing. 2003; 13: 111—133.
21 Reiter JP. Estimating risks of identification disclosure for microdata. Journal of the American
Statistical Association. 2005; 100: 1103—1113.
22 Shlomo N, Skinner CJ. Assessing the protection provided by misclassification-based disclosure
limitation methods for survey microdata. Annals of Applied Statistics. 2009; 4: 1291—1310.
23 Bethlehem JG, Keller WJ, Pannekoek J. Disclosure control of microdata. Journal of the American
Statistical Association. 1990; 85: 38—45.
24 Elamir E, Skinner CJ. Record-level measures of disclosure risk for survey microdata. Journal of Official
Statistics. 2006; 85: 38—45.
25 Skinner CJ, Shlomo N. Assessing identification risk in survey microdata using log-linear models.
Journal of the American Statistical Association. 2008; 103: 989—1001.
26 Shlomo N. Statistical disclosure control for census frequency tables. International Statistical Review.
2007; 75: 199—217.
27 Domingo-Ferrer J, Torra V. A quantitative comparison of disclosure control methods for microdata.
Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies.
Doyle P, Lane J, Zayatz L, Theeuwes J, eds. Amsterdam: North-Holland; 2001: 111—133.
28 Woo M, Reiter JP, Oganian A, Karr AF. Global measures of data utility for microdata masked for
disclosure limitation. Journal of Privacy and Confidentiality. 2009; 1(1): 111—124.
29 Karr AF, Kohnen CN, Oganian A, Reiter JP, Sanil AP. A framework for evaluating the utility of data
altered to protect confidentiality. The American Statistician. 2006; 60: 224—232.
30 Dalenius T, Reiss SP. Data-swapping: A technique for disclosure control. Journal of Statistical
Planning and Inference. 1982; 6: 73—85.
31 Brand R. Micro-data protection through noise addition. Inference Control in Statistical Databases.
Domingo-Ferrer J, ed. New York: Springer; 2002: 97—116.
32 Kim JJ. A method for limiting disclosure in microdata based on random noise and transformation.
Proceedings of the Section on Survey Research Methods of the American Statistical Association. 1986,
1986; 370—374.
33 Little RJA. Statistical analysis of masked data. Journal of Official Statistics. 1993; 9: 407—426.
34 Reiter JP, Raghunathan TE. The multiple adaptations of multiple imputation. Journal of the American
Statistical Association. 2007; 102: 1462—1471.
35 Rubin DB. Statistical disclosure limitation. Journal of Official Statistics. 1993; 9: 461—468.
36 Raghunathan TE, Reiter JP, Rubin DB. Multiple imputation for statistical disclosure limitation. Journal
of Official Statistics. 2003; 19: 1—16.
37 Reiter JP. Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical
study. Journal of the Royal Statistical Society, Series A. 2005; 168: 185—205.
About the Authors
Jerome Reiter is the Mrs. Alexander Hehmeyer Associate Professor of Statistical Science at Duke
University and the current chair of the American Statistical Association Committee on Privacy and
Confidentiality. Satkartar Kinney is research scientist at the National Institute of Statistical Sciences.
Their research focuses on methodology for protecting confidential data. They have worked closely with
the Census Bureau to develop a public use file for the Longitudinal Business Database—the first ever
public use, establishment-level data product in the U. S.
Download