Abstract There are enormous benefits to sharing epidemiological data. However, doing so could risk the confidentiality of data subjects' identities and sensitive attributes. We present a primer on methods of sharing confidential data, including both restricted access and restricted data methods. 1. Introduction In a recent editorial, the editors of Epidemiology invited authors to make their analytic and simulation codes, questionnaires, and data used in analyses available to other researchers [1]. Much has been written about the need for data sharing and reproducible research, and many journals and funding agencies have explicit data sharing policies [2, 3]. When data are confidential, however, investigators typically cannot release them as collected, because doing so could reveal data subjects' identities or values of sensitive attributes, thereby violating ethical and potentially legal obligations to protect confidentiality. At first glance, safely sharing confidential data seems a straightforward task: simply strip unique identifiers like names, addresses, and identification numbers before releasing the data. However, these actions alone may not suffice when other readily available variables, such as geographic or demographic data, remain on the file. These quasi-identifiers can be used to match units in the released data to other databases. For example, Sweeney [4] showed that 97% of the records in publicly available voter registration lists for Cambridge, Massachusetts, could be uniquely identified using birth date and nine digit zip code. By matching on the information in these lists, she was able to identify Massachusetts Governor William Weld in an anonymized medical database. As the amount of information that is readily available to the public continues to expand, e.g., via the Internet and private companies, investigators releasing large-scale epidemiologic data run the risks of similar breaches. In this commentary, we present a primer on techniques for sharing confidential data. We classify techniques into two broad classes: restricted access, in which access is provided only to trusted users, and restricted data, in which the original data are somehow altered before sharing. 2. Restricted Access Strategies If one or more researchers have been trusted to analyze confidential data, it seems logical that they should be able to share the data with additional trusted researchers. However, researchers may be prevented by law or by the policies of their institution or sponsoring agency from sharing the data, even with their colleagues. Without procedures and rules for sharing data, there is little an individual researcher can do. The means by which unaltered confidential data are shared with external researchers, while preserving confidentiality, are referred to as restricted access methods. The primary restricted access methods employed by most data stewards, including government agencies and individual investigators, are licensing agreements and restricted data centers [5]. Under the licensing model, researchers with legitimate research questions that can be answered with the confidential data enter into an agreement with the data steward so that they are provided with a copy of the data for use at their home institution. Typically, agreements specify how the data should be protected and may require certification of destruction at project completion. Fees may be charged to cover data processing and administrative costs. In some cases, such as licensing agreements with the National Center for Education Statistics, the data are slightly altered to introduce a small amount of uncertainty, but not enough to concern analysts. In the data center model, researchers must carry out their research in a physically and electronically secure facility controlled by the data steward. Often, researchers are required to submit a proposal describing the research plans, the confidential data needed, and in some cases how their research will benefit the data steward. In additional to travel expenses, researchers may be required to pay substantial fees to access the data center, and they cannot remove any output without approval from the data steward. The U.S. Census Bureau and National Center for Health Statistics provide researchers access to confidential data in this manner. An alternative is a secure online data center, such as the NORC Virtual Data Enclave. The confidential data are stored on secure servers that approved researchers can access via remote login from prespecified IP addresses. The researcher can see and analyze the actual data, but the server disables local saving, printing of the data and analysis results, and cut-and-paste operations. As with a physical data enclave, analysis results are checked for potential disclosure violations by NORC staff before being approved for publication outside the enclave. The virtual enclave has advantages over licensing models, in that the risks of breaches are centrally managed, e.g., there are no misplaced CD ROMs or unapproved file sharing. A variant on the virtual data enclave is the remote analysis server. This is a query-based system that provides the results of analyses of the confidential data without actually allowing users to see the confidential values, thus reducing disclosure risks [6—8]. Examples include the U.S. Census Bureau’s American FactFinder, which is available to anyone but produces only tabular summaries, or the Australian Remote Access Data Laboratory, which restricts users but allows more flexible analyses. Other query-based methods under research include verification servers [9], which provide users with feedback on the quality of analyses based on redacted public use data, and output perturbation via differential privacy [10], whereby noise is added to query results so that they do not provide information about any individual with certainty. 3. Restricted Data Strategies While restricted access policies use trust and physical protection, restricted data policies provide unfettered access to data that have been modified before release. The key issue for data stewards is determining how much alteration is needed to ensure protection without completely destroying the usefulness of the data. To make informed decisions about this trade off, data stewards can quantify disclosure risk and data usefulness associated with particular release strategies. For example, when two competing policies result in approximately the same disclosure risk, select the one with higher data usefulness. Quantifiable metrics can help data stewards decide when the risks are sufficiently low, and the usefulness is adequately high, to justify releasing the altered data [11, 12]. The literature on disclosure risk assessment highlights two main types of disclosures, namely (i) identification disclosures, which occur when an ill-intentioned user of the data, henceforth called an intruder, correctly identifies individual records in the released data, and (ii) attribute disclosures, which occur when an intruder learns the values of sensitive variables for individual records in the data. Attribute disclosures usually are preceded by identity disclosures—for example, when original values of attributes are released, intruders who correctly identify records learn the attribute values—so that focus typically is on identification disclosure risk assessments. See the reports of the National Research Council [13, 14] and Federal Committee on Statistical Methodology [15] for more information about attribute disclosure risks [16]. Identification disclosure risk measures are often based on estimates of the probabilities that individuals can be identified in the released data. Probabilities of identification are easily interpreted: the larger the probability, the greater the risk. Stewards determine their own threshold for unsafe probabilities. A variety of approaches have been used to estimate these probabilities [17--22]. Individuals who are unique in the population (as opposed to the sample) are particularly at risk [23]. Therefore, much research has gone into estimating the probability that a sample unique record is in fact a population unique record [24, 25]. Data usefulness is usually assessed with two general approaches: (i) comparing broad differences between the original and released data, and (ii) comparing differences in specific models between the original and released data. Broad difference measures essentially quantify some statistical distance between the distributions of the data on the original and released files, for example a Kullback-Leibler or Hellinger distance [26]. As the distance between the distributions grows, the overall quality of the released data generally drops [27, 28]. Comparison of measures based on specific models is often done informally. For example, the steward looks at the similarity of point estimates and standard errors of regression coefficients after fitting the same regression on the original data and on the data proposed for release. If the results are considered close, for example the confidence intervals obtained from the models largely overlap, the released data have high utility for that particular analysis [29]. Such measures are closely tied to how the data are used, but they provide a limited representation of the overall quality of the released data. Thus, it is prudent to examine models that represent a wide range of uses of the released data. Risks can be reduced by altering data using one or more statistical disclosure limitation strategies. Common approaches include coarsening, data swapping, noise addition, and synthetic data. Examples of coarsening include releasing geography as aggregated regions, reporting exact values only below or above certain thresholds (known as top or bottom coding), collapsing levels of categorical variables, and rounding numerical values. Recoding reduces disclosure risks by turning atypical records—which generally are most at risk—into typical records. Recodes frequently are specified to reduce estimated probabilities of identification to acceptable levels, e.g., no cell in a cross-tabulation of a set of key categorical variables has fewer than k individuals, where k often is three or five. Data swapping refers to switching the data values for selected records with those for other records [30]. The protection afforded by swapping is based in large part on perception: data stewards expect that intruders will be discouraged from linking to external files, since any apparent matches could be incorrect due to the swapping. The details of the swapping mechanism are almost always kept secret to reduce the potential for reverse engineering. Adding random noise to sensitive or identifying data values can reduce disclosure risks because errors are introduced into the released data, making it difficult for intruders to match on the released values with confidence [31, 32]. For numerical data, typically the noise is generated from a distribution with mean zero to ensure point estimates of means remain unbiased, although the variance associated with those estimated mean may increase. However, zero-mean noise still can lead to attenuation in point estimates of correlations and regression coefficients. Partially synthetic data methods replace some collected values, such as sensitive values at high risk of disclosure or values of key identifiers, with multiple imputations [8, 33—35]. For example, suppose that the steward wants to replace income when it exceeds $100,000—because the steward considers only these individuals to be at risk of disclosure—and is willing to release all other values as collected (alternatively, the steward could synthesize the key identifiers for the individuals with incomes exceeding $100,000, with the goal of reducing risks that these individuals might be identified). The steward generates replacement values for the incomes over $100,000 by randomly simulating from the distribution of income conditional on all other variables. To avoid bias, this distribution must be conditional on income exceeding $100,000. The distribution is estimated using the collected data and possibly other relevant information. The result is one synthetic data set. The steward repeats this process multiple times and releases the multiple datasets to the public. The multiple datasets enable secondary analysts to reflect the uncertainty from simulating replacement values in inferences. Among the restricted data techniques described here, partial synthesis is the newest and least commonly used. A stronger version of synthetic data is full synthesis, in which all values in the released data are simulated [36, 37]. This approach may become appealing in the future if confidentiality concerns grow to the point where no original data can be released in unrestricted public use files. 4. Conclusion To reproduce research conducted on confidential data, researchers likely will need access to the same confidential data used by the original investigators. It is possible in some cases for restricted data methods to be used, but for the purpose of reproducibility, restricted access methods hold more promise. This is not to say that restricted data methods do not have their place, as they enable many other benefits of data sharing. For both restricted data and restricted access, the methods that we described may be too cumbersome for individual researchers to implement independently. We believe it is incumbent upon research institutions and sponsors, journals, and other advocates of reproducibility to facilitate sharing of confidential data by establishing procedures and policies for sharing confidential data and perhaps creating confidential data repositories. Authors who obtain their confidential data from such repositories could comply with a data sharing invitation such as the one put forth by Epidemiology by providing detailed instructions for obtaining the confidential data. References 1 Hernan MA, Wilcox AJ. Epidemiology, data sharing, and the challenge of scientific replication [commentary]. Epidemiology. 2009; 20(2), 167—168. 2 Donoho DL. An invitation to reproducible computational research. Biostatistics. 2010; 11(3): 385--388. 3 Sedransk N, Young LJ, Kelner KL, Moffitt RA, Thakar A, Raddick J, Ungvarksy EJ, Carlson RW, Apweiler R, Cox LH, Nolan D, Soper K, Spiegelman C. Make research data public? – Not always so simple: A dialogue for statisticians and science editors. Statistical Science. 2010; 25(1): 41—50. 4 Sweeney L. Computational disclosure control: Theory and Practice. Ph.D. thesis, Department of Computer Science, Massachusetts Institute of Technology, 2001. 5 Kinney SK, Karr AF, Gonzalez JF. Data confidentiality: The next five years summary and guide to papers, Journal of Privacy and Confidentiality. 2009; 1(2): Article 1 6 Gomatam S, Karr AF, Reiter JP, Sanil AP. Data dissemination and disclosure limitation in a world without microdata: A risk-utility framework for remote access analysis servers. Statistical Science. 2005; 20: 163—177. 7 Reiter JP. Model diagnostics for remote-access regression servers. Statistics and Computing. 2003; 13: 371—380. 8 Reiter JP. New approaches to data dissemination: A glimpse into the future (?). Chance. 2004; 17(3): 12—16. 9 Reiter JP, Oganian A, Karr AF. Verification servers: enabling analysts to assess the quality of inferences from public use data. Computational Statistics and Data Analysis. 2009; 53: 1475—1482. 10 Dwork C. Differential privacy. 33rd International Colloquium on Automata, Languages, and Programming, Part II. Berlin: Springer; 2006: 1—12. 11 Willenborg L, de Waal T. Elements of Statistical Disclosure Control. New York: Springer-Verlag; 2001. 12 Duncan GT, Keller-McNulty SA, Stokes SL. Disclosure risk vs. data utility: The R-U confidentiality map. Technical Report, U.S. National Institute of Statistical Sciences; 2001. 13 National Research Council. Expanding Access to Research Data: Reconciling Risks and Opportunities. Panel on Data Access for Research Purposes, Committee on National Statistics, Division of Behavioral and Social Sciences and Education. Washington DC: The National Academies Press; 2005. 14 National Research Council. Putting People on the Map: Protecting Confidentiality with Linked SocialSpatial Data. Panel on Confidentiality Issues Arising from the Integration of Remotely Sensed and SelfIdentifying Data, Committee on the Human Dimensions of Global Change, Division of Behavioral and Social Sciences and Education. Washington DC: The National Academies Press; 2007. 15 Federal Committee on Statistical Methodology. Statistical policy working paper 22: Report on statistical disclosure limitation methodology (second version). Washington, D.C.: Confidentiality and Data Access Committee, Office of Management and Budget, Executive Office of the President; 2005. 16 Lambert D. Measures of disclosure risk and harm. Journal of Official Statistics. 1993; 9: 313—331. 17 Duncan GT, Lambert D. Disclosure-limited data dissemination. Journal of the American Statistical Association. 1986; 81: 10—28. 18 Duncan GT, Lambert D. The risk of disclosure for microdata, Journal of Business and Economic Statistics. 1989; 7: 207—217. 19 Fienberg S, Makov UE, Sanil AP. A Bayesian approach to data disclosure: Optimal intruder behavior for continuous data. Journal of Official Statistics. 1997; 13: 75—89. 20 Domingo-Ferrer J, Torra V. Disclosure risk assessment in statistical microdata via advanced record linkage. Statistics in Computing. 2003; 13: 111—133. 21 Reiter JP. Estimating risks of identification disclosure for microdata. Journal of the American Statistical Association. 2005; 100: 1103—1113. 22 Shlomo N, Skinner CJ. Assessing the protection provided by misclassification-based disclosure limitation methods for survey microdata. Annals of Applied Statistics. 2009; 4: 1291—1310. 23 Bethlehem JG, Keller WJ, Pannekoek J. Disclosure control of microdata. Journal of the American Statistical Association. 1990; 85: 38—45. 24 Elamir E, Skinner CJ. Record-level measures of disclosure risk for survey microdata. Journal of Official Statistics. 2006; 85: 38—45. 25 Skinner CJ, Shlomo N. Assessing identification risk in survey microdata using log-linear models. Journal of the American Statistical Association. 2008; 103: 989—1001. 26 Shlomo N. Statistical disclosure control for census frequency tables. International Statistical Review. 2007; 75: 199—217. 27 Domingo-Ferrer J, Torra V. A quantitative comparison of disclosure control methods for microdata. Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies. Doyle P, Lane J, Zayatz L, Theeuwes J, eds. Amsterdam: North-Holland; 2001: 111—133. 28 Woo M, Reiter JP, Oganian A, Karr AF. Global measures of data utility for microdata masked for disclosure limitation. Journal of Privacy and Confidentiality. 2009; 1(1): 111—124. 29 Karr AF, Kohnen CN, Oganian A, Reiter JP, Sanil AP. A framework for evaluating the utility of data altered to protect confidentiality. The American Statistician. 2006; 60: 224—232. 30 Dalenius T, Reiss SP. Data-swapping: A technique for disclosure control. Journal of Statistical Planning and Inference. 1982; 6: 73—85. 31 Brand R. Micro-data protection through noise addition. Inference Control in Statistical Databases. Domingo-Ferrer J, ed. New York: Springer; 2002: 97—116. 32 Kim JJ. A method for limiting disclosure in microdata based on random noise and transformation. Proceedings of the Section on Survey Research Methods of the American Statistical Association. 1986, 1986; 370—374. 33 Little RJA. Statistical analysis of masked data. Journal of Official Statistics. 1993; 9: 407—426. 34 Reiter JP, Raghunathan TE. The multiple adaptations of multiple imputation. Journal of the American Statistical Association. 2007; 102: 1462—1471. 35 Rubin DB. Statistical disclosure limitation. Journal of Official Statistics. 1993; 9: 461—468. 36 Raghunathan TE, Reiter JP, Rubin DB. Multiple imputation for statistical disclosure limitation. Journal of Official Statistics. 2003; 19: 1—16. 37 Reiter JP. Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. Journal of the Royal Statistical Society, Series A. 2005; 168: 185—205. About the Authors Jerome Reiter is the Mrs. Alexander Hehmeyer Associate Professor of Statistical Science at Duke University and the current chair of the American Statistical Association Committee on Privacy and Confidentiality. Satkartar Kinney is research scientist at the National Institute of Statistical Sciences. Their research focuses on methodology for protecting confidential data. They have worked closely with the Census Bureau to develop a public use file for the Longitudinal Business Database—the first ever public use, establishment-level data product in the U. S.