Preserving Patient Privacy in Biomedical Data ARCHIVES Analysis by MASSACHUSElTS INSTITUTE OF TECHNOLOGY Sean Kenneth Simmons DEC 24 2015 B.S., University of Texas (2011) LIBRARIES Submitted to the Department of Mathematics in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Applied Mathematics at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2015 @ Massachusetts Institute of Technology 2015. All rights reserved. Author . . Signature redacted Department of Mathematics July 21st, 2015 Certified by. Signature redacted Prfso fAple ahmtc Bonnie Berger Professor of Applied Mathematics Thesis Supervisor Accepted by ..... Signature redacted r, I Peter Shor Chairman, Applied Mathematics Committee 2 Preserving Patient Privacy in Biomedical Data Analysis by Sean Kenneth Simmons Submitted to the Department of Mathematics on July 21st, 2015, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Applied Mathematics Abstract The growing number of large biomedical databases and electronic health records promise to be an invaluable resource for biomedical researchers. Recent work, however, has shown that sharing this data- even when aggregated to produce p-values, regression coefficients, count queries, and minor allele frequencies (MAFs)- may compromise patient privacy. This raises a fundamental question: how do we protect patient privacy while still making the most out of their data? In this thesis, we develop various methods to perform privacy preserving analysis on biomedical data, with an eye towards genomic data. We begin by introducing a model based measure, PrivMAF, that allows us to decide when it is safe to release MAFs. We modify this measure to deal with perturbed data, and show that we are able to achieve privacy guarantees while adding less noise (and thus preserving more useful information) than previous methods. We also consider using differentially private methods to preserve patient privacy. Motivated by cohort selection in medical studies, we develop an improved method for releasing differentially private medical count queries. We then turn our eyes towards differentially private genome wide association studies (GWAS). We improve the runtime and utility of various privacy preserving methods for genome analysis, bringing these methods much closer to real world applicability. Building off this result, we develop differentially private versions of more powerful statistics based off linear mixed models. Thesis Supervisor: Bonnie Berger Title: Professor of Applied Mathematics 3 4 Acknowledgments First and foremost, I would like to thank my thesis supervisor, Bonnie Berger, for all her help and guidance throughout the years. I couldn't have done it without her and her constant encouragement! I would also like to thank everyone in the Berger lab, past and present, who have not only helped me grow as a researcher, but who were there to help me keep my sanity- thank you all! In particular, thanks to PoRu, George and Jian for helping get me started on research; to Bianca, Sumaiya, Sepehr, Deniz, and William for all the conversations I had with them about things both biology and not biology related; to Noah for all the advice (in terms of writing papers, performing research, and deciding what I should do career wise) over the years, as well as to Rachel Daniels for all her help in editing; to Yaron for all the interesting conversations we've had over the past year about RNA-Protein interactions, among many other topics; and to Patrice for everything (you really do keep the lab from exploding!). I am also indebted to Jadwiga for being a great collaborator and source of advice over the past few years- I always looked forward to our weekly meetings! Thanks also to Vinod Vaikuntanathan and Jon Kelner for agreeing to be on my thesis committee. I'm also indebted to all the administrators in the math department who have helped me over the years, and to my friends in the REFS program. Thanks also to all the great advisers and mentors I've had over the year outside of MIT. This includes Dr. Cline, Dr. Laude and Dr. Vick at UT for all their great advice and interesting conversations. In addition, I owe a debt to Dr. Blanchet-Sadri at UNCG for the three summers that introduced me to research and lots of interesting problems. Along the same lines, I want to thank my mentors at NKU and the NSA for their mentor ship during my stays there, and to Dr. Gordon and Dr Helleloid at UT for all they did as undergraduate research advisers (albeit at different points in my education). I wouldn't have made it through my PhD without all my friends in the CS and Math departments at MIT, including John, Hans, Ruthi, Padma, Adrian, and many others. Thanks for all the companionship and advice! 5 Thanks also to my friends from outside MIT- Jessie, Rachel, Anna, Ariel, Dan, and many others- for helping me keep my sanity. In particular, thanks to Lupe. I'm not sure how I would have kept my sanity these past few months without you around! I would also like to thank my parents for all they have done over the years, I wouldn't be here without their support. I'd even like to thank my sisters- a little sibling rivalry always helps to motivate! Finally I would also like to thank the NSF and the MIT mathematics department for their generous funding. 6 Contents Model Based Approaches . . . . . . . . . . . . . . . . . . . . . . . 23 1.2 M odel Free Approaches . . . . . . . . . . . . . . . . . . . . . . . . 24 1.3 The Place of Differential Privacy in Biomedical Research . . . . . 26 29 Background 29 2.1.1 Basic Genetics. . . . . . . . . . . . . . . . . . 29 2.1.2 G WA S . . . . . . . . . . . . . . . . . . . . . . 30 2.1.3 The Rise of Electronic Health Records . . . . 32 . . . . . . . . . . . . . . . . . . 33 2.2.1 Differential Privacy . . . . . . . . . . . . . . . 33 2.2.2 Other Approaches to Privacy . . . . . . . . . 35 2.3 Privacy Concerns and Biomedical Data . . . . . . . . 36 2.4 Previous Applications of Privacy Preserving Approaches . . . . . . . . . . . . . . ... .. . . . . Biomedical . Data . 2.5 . Privacy Background .. . . . . . . . . 38 . . 38 HIPAA and Other Legislative Approaches 2.4.2 Access Control . . . . . . . . . . . . . . . . . 38 2.4.3 Differential Privacy . . . . . . . . . . . . . . . 39 Other approaches . . . . . . . . . . . . . . . . . . . . 39 . 2.4.1 . 2.2 Biology Background . . . . . . . . . . . . . . . . . . . 2.1 3 . . . 1.1 . 2 21 Introduction . 1 One Size Doesn't Fit All: Measuring Individual Privracy in Aggre- 41 gate Genomic Data 7 41 3.1.1 Previous work . . . . . . . . . . . . . . . . . . 42 3.1.2 Our Contribution . . . . . . . . . . . . . . . . 44 M ethods . . . . . . . . . . . . . ... . . . . . . . . . . 45 3.2.1 The Underlying Model . . . . . .. . . . . . . . 45 3.2.2 Measuring Privacy of MAF . . . . . . . . . . 46 3.2.3 Measuring Privacy of Truncated Data . . . . . 47 3.2.4 Measuring Privacy of Adding Noise . . . . . . . 48 3.2.5 Choosing the Size of the Background Population 49 3.2.6 Release Mechanism . . . . . . . . . . . . . . . 50 3.2.7 Simulated Data . . . . . . . . . . . . . . . . . 51 R esults . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.1 Privacy and MAF . . . . . . . . . . . . . . . . 51 3.3.2 Privacy and Truncation . . . . . . . . . . . . 52 3.3.3 Privacy and Adding Noise . . . . . . . . . . . 53 3.3.4 Worst Case Versus Average . . . . . . . . . . 55 3.3.5 Comparing 3 to a . . . . . . . . . . . . . . . 55 3.3.6 Reidentification Using PrivMAF . . . . . . . . 55 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5 Derivation of Methods . . . . . . . . . . . . . . . . . 61 3.5.1 Basic Model . . . . . . . . . . . . . . . . . . . 61 3.5.2 PrivMAF . . . . . . . . . . . . . . . . . . . . 62 3.5.3 PrivMAF for Data with Noise Added . . . . . 65 3.5.4 PrivMAF for Data with Truncation . . . . . . 66 3.5.5 Comparison to previous approaches 67 3.5.6 A Release Mechanism: Allele Leakage Guarantee Test 68 3.5.7 Changing the Assumptions . . . . . 72 3.5.8 Estimating the parameters . . . . . 75 . . . . . . . . . . . . . . . . . . . . . . . 3.3 . 3.2 . Introduction . . . . . . . . . . . . . . . . . . . . . . . 3.1 8 4.1 Introduction ......... 77 4.2 Previous Work 78 4.3 Exponential Mechanism 79 4.4 Derivation of Mechanism 80 4.5 Theoretical Comparison 83 4.6 Results . . . . . . . . . . 85 4.7 Choosing c . . . . . . . . 87 4.8 Conclusion . . . . . . . . 90 . . . . . . . . . 93 5.1 Introduction . . . . . . . 93 5.1.1 94 . Picking Top SNPs Privately with the All elic Te st Statistic Previous Work Our Contributions 5.3 Set Up . . . . . . . . . . 95 5.4 . 95 5.2 GWAS Data . . . . . . . . . . . . . . . . . . . 96 5.5 Picking Top SNPs with the Neighbor Mechanism 96 5.6 Fast Neighbor Distance Calculation with Private Genotype Data 99 5.6.1 Method Description . . . . . . . . . . . 99 5.6.2 Proof Overview . . . . . . . . . . . . . 101 5.6.3 Proof for Significant SNPs . . . . . . . 106 5.6.4 Proof for Non-Significant SNPs 112 . . . . Results: Applying Neighbor Mechanism to Real- vorld Data 115 Measuring Utility . . . . . . . . . . . . 115 5.7.2 Comparison to Other Approaches . . . 115 5.7.3 Comparison to Arbitrary Boundary Value 117 5.7.4 Runtim e . . . . . . . . . . . . . . . . . 117 . . . . . . . . . . . . . . 117 5.8.1 Calculating Sensitivity . . . . . . . . . 119 5.8.2 Output Perturbation . . . . . . . . . . 120 Output Perturbation . . . . 5.7.1 . 5.8 . . . . . 5.7 . 5 Improved Privacy Preserving Counting Queries in Medical Databases 77 . 4 9 5.8.3 5.9 6 Input Perturbation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 123 Correcting for Population Structure in Differentially Private GWAS127 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.2 Previous Work 128 6.3 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.4 GWAS and LMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.5 Achieving Differential Privacy Attempt One: Laplace Mechanism . . 132 6.6 Achieving Differential Privacy Attempt Two: Exponential Mechanism 132 6.7 Results: Testing Our Method . . . . . . . . . . . . . . . . . . . . . . 138 6.8 Picking Top SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.8.1 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.8.2 Application to Data 140 6.9 7 Picking Top SNPs with the Laplacian Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimating o-e and c-g in a differentially private manner . . . . . . . . 142 6.10 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 143 Conclusion 145 10 List of Figures 1-1 An adversary cannot get access to medical data directly. Instead, by looking at published analyses (or querying the system with seemingly legitimate queries) they are able to gain access to private information. The data under consideration might be part of a large biomedical repository, a hospital's health records, or some other source. . . . . . . . . 22 2-1 An example of a snippet of the genome . . . . . . . . . . . . . . . . . 30 2-2 An example of a SNP . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3-1 PrivMAF applied to the WTCCC dataset. In all plots we take n=1000 research subjects and a background population of size N=100,000. (a) Our privacy measure PrivMAF increases with the number of SNPs. The blue line corresponds to releasing MAFs with no rounding, the green line to releasing MAFs rounded to one decimal digit, and the red line to releasing MAFs rounded to two decimal digits. Rounding to two digits appears to add very little to privacy, whereas rounding to one digit achieves much greater privacy gains. (b) The blue line corresponds to releasing MAF with no noise, the red line to releasing MAF 5 , and the green line to releasing MAF 1 . Adding noise corresponding to c = .5 seems to add very little to privacy, whereas taking c = .1 achieves much greater privacy gains. . . . . . . . . . . . . . . . 11 49 3-2 Truncating simulated data to demonstrate scaling. We plot our privacy measure PrivMAF versus the number of SNPs for simulated data with n=10000 subjects and a background population of size N=1,000,000. The green line corresponds to releasing MAFs with no rounding, the blue line to releasing MAFs rounded to three decimal digit, and the red line to releasing MAFs rounded to two decimal digits. Rounding to three digits seems to add very little to privacy, whereas rounding to two digits achieves much greater privacy gains. . . . . . . . . . . . . . 3-3 52 Worst Case Versus Average Case PrivMAF. Graph of the number of SNPs, denoted m, versus PrivMAF. The blue curve is the maximum value of PrivMAF(d, MAF(D)) taken over all d E D for a set of n = 1, 000 randomly chosen participants in the British Birth Cohort, while the green curve is the average value of PrivMAF(d, MAF(D)) in the same set. The the maximum value of PrivMAF far exceeds the average. By the time m = 1000 it is almost five times larger. . . . . . . . . . . 3-4 54 ALGT applied to the WTCCC dataset. A graph of the uncorrected threshold, a, versus the corrected threshold, / = #(a), from ALGT is given in blue. The green line corresponds to an uncorrected threshold. We see that for some choices of a, correction may be desired. For example, for a = .05 the corrected threshold is approximately 3 = .03. Here we again use the British Birth Cohort with n=1000 study participants, m=1000 SNPs, and a background population of size N=100,000. 3-5 . . . 56 ROC Curves of PrivMAF and Likelihood Ratio. ROC curves obtained using PrivMAF (green triangles) and the likelihood ratio method (red circles) to reidentify individuals in the WTCCC British birth cohort with n=1,000 study participants and 1,000 SNPs. 12 . . . . . . . . . . . 57 3-6 ROC Curves of PrivMAF with Truncation. ROC curves obtained using PrivMAF for reidentification of unperturbed data (in red, AUC=.686), data truncated after two decimal digits (aka k = 2, in blue, AUC=.682), and data truncated after one decimal digit (aka k = 1, in green, AUC=.605 ). We see that truncation can greatly decrease the effec- tiveness of reidentification. Note that the ROC of the unperturbed data here is different from that in the previous figure. This is because we used a different random division of our data in each case. . . . . . 3-7 58 ROC Curves of PrivMAF with noisy data. ROC curves obtained using PrivMAF for reidentification of unperturbed data (in red, AUC=.696), with noise corresponding to c = .5 (in green, AUC=.693), and with E = .1 (in blue, AUC=.656). We see that adding noise can decrease the effectiveness of reidentification. Note that the ROC of the unperturbed data here is different from that in the previous figures. This is because we used a different random division of our data in each case. . . . . . 4-1 59 Here we plot privacy parameter c versus the risk (where the risk is on a log scale) for the naive exponential mechanism (blue) and our mechanism with search parameters k = 10 (green) and k = 100 (red). We see that in all cases ours performs much better than the naive exponential mechanism. 4-2 . . . . . . . . . . . . . . . . . . . . . . . . . 86 We plot rma, the maximum value returned by our algorithm, versus the runtime (both on a log scale) for the naive exponential mechanism (blue) and our algorithm with search parameters k = 10 (green) and k = 100 (red). We see that, though our algorithm is slower, it still runs in a few minutes in all cases. . . . . . . . . . . . . . . . . . . . . 13 88 4-3 We plot privacy parameter c versus the parameter p (which increases with utility) for both the naive exponential mechanism (blue) as well as our analysis with k = 10 (green) and k = 100 (blue). Note that P corresponds to utility (a higher It means a higher utility), while a higher e corresponds to less privacy. We see that our analysis shows that a given c corresponds to a larger [, which means that in many algorithms that balance utility and privacy when choosing c, our analysis will result in adding less noise to the answer of the count querry. 5-1 . . . . . 89 Our algorithm for finding the solution, 6, of our relaxed optimization problem relies on the fact that there are only two possible types of solutions: (a) extreme point solutions and (b) tangent point solutions. Our algorithm finds all such extreme points and tangent points, and iterates over them to find the solution to our relaxed optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2 107 We measure the performance of our modified neighbor method for picking top SNPs (red) as well as the score based (blue) and Laplacian based (green) methods for met (the number of SNPs being returned) equal to a. mret = 3 b. 5 c. 10 and d. 15 for varying values of c. For 3, 5 we consider c between 0 and 5, while in the other cases we consider c between 0 and 30. We see that in all four graphs our method leads to the best performance by far. These results are averaged over 20 iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 116 5-3 We measure the performance of our modified neighbor method for picking top SNPs (in red) as well as the traditional neighbor method with cutoffs corresponding to a Bonferroni corrected p-value of .05 (in green) and .01 (in blue) for met (the number of SNPs being returned) equal to a. 3 b. 5 c. 10 and d. 15 for varying values of c. For met = 3,5 we consider c between 0 and 5, while in the other cases we consider c between 0 and 30. We see that in the first three cases the traditional method slightly outperforms ours. When met = 15, however, the tra- ditional methods can only get maximum utility around .85, where-as ours can get utility arbitrarily close to 1. This shows how we are able to overcome one of the major concerns about the neighbor method with only minimal cost. These results are averaged over 20 iterations. . . . 5-4 118 Comparing two forms of output perturbation in scenario 2- the first coming from applying the Laplace mechanism directly to the allelic test statistic (green), the other applying it to the square root of the allelic test statistic then squaring the result (blue), comparing the L 1 error on the y axis with E on the x. We first apply it to 1000 random SNPs (a), then to the top ten highest scoring SNPs (b). We see that in both cases applying the statistic to the square root outperforms the standard approach. ....... 5-5 121 ............................ Comparing the output perturbation of the allelic test statistic for scenarios 1 and 2, comparing the L1 error on the y axis with c on the x. In scenarios 2 (the blue curve) we add the noise to the square root then square the result, where as for scenario 1 (the green curve) we apply the Laplacian mechanism directly to the test statistic (this choice is motivated by the previous figures). We first apply it to 1000 random SNPs (a), then to the top ten highest scoring SNPs (b). We see that in scenario 2 we require much less noise than scenario 1. 15 . . . . . . . 122 5-6 We measure the performance of the Laplacian method for picking top SNPs in scenarios 1 (in blue) and 2 (in green) with met (the number of SNPs being returned) equal to a. 3 b. 5 c. 10 and d. 15 for varying values of c. For met = 3, 5 we consider c between 1 and 10, while in the other cases we consider e between 10 and 100. We see that in all four graphs that scenario 2 leads to the best performance. Scenario 1, which is the one that appeared in previous work, leads to a greater loss of utility. These results are averaged over 100 iterations. 5-7 . . . . . . . 124 Comparing the output perturbation of the allelic test statistic for scenarios 2 (blue) to the input perturbation method in scenario 1 (green). We see that in this case, as opposed to previous cases, scenario 1 outperforms scenario 2 despite requiring stronger privacy guarantees. This demonstrates that input perturbation is preferable to output pertur- bation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8 125 Comparing the input perturbation of the allelic test statistic for scenarios 1 (green) and 2 (blue), comparing the L1 error on the y axis with c on the x. In all three cases we use input perturbation. We see that scenario 1 requiring more noise to be added. 6-1 . . . . . . . . . . . 126 Comparing the output perturbation of the Laplacian based method (green) with our neighbor based method (blue) with U2 = with both (a) 1000 random SNPs and (b) the causative SNPs. = .5, We see that our method performs better in both cases for the choices of 6 considered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 139 6-2 We measure the performance of the three methods for picking top SNPs using score (blue), neighbor (red) and Laplacian (green) based methods with mret (the number of SNPs being returned) equal to a. 3 b. 5 c. 10 and d. 15 for varying values of c between 10 and 100. We see that in all four graphs that score method leads to the best performance, followed by the neighbor mechanism. These results are averaged over 20 iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 141 18 List of Tables 5.1 We demonstrate the runtime of our exact method as well as the approximate method for various boundaries (where the boundary at mret is the average of the mretth and meet + 1st highest scoring SNPs), as well as the average L, error per SNP that comes from using the approximate method. We see that our exact method is much faster than the approximate method. In addition, its runtime is fairly steady for all choices of mret. We see the approximate method is faster and more accurate for larger mret- this makes sense since the average SNP will be closer to the boundary, so there will be less loss. These results are averaged over 20 trials. . . . . . . . . . . . . . . . . . . . . . . . . . . 19 119 Symbol Description MAF Minor Allele Frequency SNP Single Nucleotide Polymorphism EHR Electronic health record PrivMAF Our privacy measure 2 Variance due to the environment og Variance due to genetics 01 The tuple (Ge, ug) K, K, !XXT D Study cohort data (usually genomic) X The normalized genotype matrix In The n by n matrix of all ones In The n-dimensional identity matrix N(p, E) The multivariate normal distribution with mean y, covariance E exp(x) The exponential of x P(S) The probability of S Lap(A) The one dimensional Laplacian distribution with mean 0, standard deviation A Lap,(A) An n dimensional random variable whose entries at drawn from Lap(A) Aq The sensitivity of q R The number of cases in a case-control study S The number of controls in a case-control study ri The number of cases with genotype i at a given SNP, i E {0, 1, 2} si The number of controls with genotype i at a given SNP, i E {0, 1, 2} Y The allelic test statistic det The determinant m The number of SNPs under consideration mret The number of SNPs a given algorithm returns PPV Positive Predictive Value + U2I 20 Chapter 1 Introduction In the last few decades, the use of sensitive patient data in biomedical research has been on the rise [55, 24, 103, 28, 87, 721. Spurred by both the genomics revolution and the rise of electronic health records (among other new sources of medical data [71]), this valuable data holds the promise of revolutionizing medical research, giving rise to the possibilities of personalized medicine and understanding the genetic basis of various diseases. This growth, however, has led to privacy concerns among patients and care providers alike [80, 79, 3, 10, 83]. Medical records, for example, often contain sensitive data (disease status, etc.) that patients might not want revealed. It has been shown that this concern leads many patients to withhold valuable medical information from doctors. In order to get patients to participate in research we need to find ways of balancing privacy with the needs of the medical community. At first glance it seems like this should be an easy task. Deanonymizing data (such as removing identifiers such as name, social security numbers, etc.) is an obvious solution. Again and again, however, it has been shown that such approaches can be circumvented [18, 91, 59]. Perhaps the most famous example of this occurred when Latayna Sweeney used voter registration records to reidentify supposedly anonymized medical records belonging to the Governor of Massachusetts [92]. More recently, Gymrek et al. were able to use deidentified genomic data in conjunction with public databases to reidentify many participants in genetic studies [59]. 21 41, Published Analysis Figure 1-1: An adversary cannot get access to medical data directly. Instead, by looking at published analyses (or querying the system with seemingly legitimate queries) they are able to gain access to private information. The data under consideration might be part of a large biomedical repository, a hospital's health records, or some other source. Surprisingly, even aggregate data can be a privacy risk. Korolova showed that count queries can give away private information about Facebook users [70]. In the biomedical realm it has been shown that various genomic statistics (such as MAF [60] and regression coefficients [63]) can lead to privacy concerns as well. This realization has led various groups, including the National Institute of Health (NIH), to pull aggregate data from public databases and put it in access controlled databases [57]. Though such precautions help preserve privacy, they also take a toll on the scientific community. The medical community seems to be at an impasse: how can we allow researchers access to medical data (data that could possibly save lives), while still ensuring the privacy of the individuals involved? In this thesis we consider various methods for allowing researchers access to confidential medical data while still preserving patient 22 privacy. After introducing some important biological and privacy background in Chapter 2, we develop possible approaches to this problem. 1.1 Model Based Approaches The work of Homer et al. [60] demonstrated that even simple aggregate statistics from genomic data can lead to privacy issues. In particular, they showed that minor allele frequencies (MAF, the prevalence of a given allele in the study population) can give away private information about study participants. This realization shocked the biological community, leading many (including the NIH) to move public aggregate data into control access repositories. Since then, many researchers have looked into methods that allow the release of this aggregate data while still ensuring privacy [19, 108, 88, 107, 95, 89]. Though many of these methods are powerful, they have multiple drawbacks- either requiring so much noise to be added to the data that it destroys utility, or not ensuring privacy for all study participants. In Chapter 3 we introduce a new, model based method that aims to overcome both of these barriers. This method, known as PrivMAF, allows us to measure the privacy lost by each individual in a study after releasing MAFs. We are able to apply this measure to real data, showing that the level of privacy loss varies wildly between individuals. PrivMAF can also be modified to measure the privacy gained by adding noise to the MAF from a study, allowing us to ensure privacy with less utility loss than required by model free methods such a differential privacy (see below). Finally, we go on to show how PrivMAF can be used to decide when it is safe to release the MAF from a study, giving confidence to researchers who want to ensure privacy while still being able to perform useful research. 23 1.2 Model Free Approaches Model based approaches to privacy allow us to use our beliefs about possible adversaries (assumptions about what they know, what they don't, etc.) in order to increase the amount of data that we can safely release. Unfortunately, such approaches do not protect against all possible adversaries. In order to do that model free approaches, such as differential privacy [17, 95], are required. Loosely speaking, differential privacy works by releasing a perturbed statistic that can not be used to distinguish between a database containing a certain individual and one not containing that individual, thus ensuring privacy (see Chapter 2 for a more formal treatment). It has been suggested [67, 97] that differentially private statistics could be used as a way to allow researchers access to EHR and other medical data without violating privacy. In Chapters 4-6 we look at various ways this can be achieved in practice, thus improving on the state of the art both in terms of accuracy and computational efficiency. One of the main areas where differential privacy has been considered is in study design. Researchers often want to query medical data bases to figure out how many patients in a particular hospital qualify as study participants. In order to allow this while still preserving privacy, various systems, such as I2B2 and STRIDE [103, 281, release noisy versions of these count queries. Unfortunately, most of these noisy mechanisms are ad hoc and do not provide formal privacy guarantees. More recently, it has been suggested that one could release differentially private versions of these counts queries to help preserve privacy while retaining utility [97]. In Chapter 4, we show how a modified version of this differentially private mechanism allows for more utility while still ensuring privacy. With the projected rise of genotyping in the clinic [103], it has been suggested that the data resulting from such tests might be used to further genomic research. Again, however, such uses raise privacy concerns for those involved. As a result, various authors [95, 106, 67] studying ways of using this genomic data in a privacy preserving way. In Chapters 5 and 6 we further investigate this line of research, improving on 24 the state of the art. Most previous work has focused on using differentially private versions of the allelic test statistic to perform genomic studies in a privacy preserving fashion [67, 95, 66, 105, 102]. In Chapter 5 we improve upon these approaches. Our main contribution is to improve the accuracy and computational efficiency of the neighbor method, first introduced by Johnson and Shmatikov [671, for picking high scoring SNPs. We are able to use an adaptively selected boundary value to remove problems arising with the accuracy of this method. Moreover, we present a new algorithm, based loosely off convex optimization, to show that the neighbor method can be made computationally feasible (more specifically, we show that, for a given SNP, the neighbor distance can be calculated in constant time). We then go on to compare the utility trade offs for two different privacy scenarios that occur in the literature- one in which we assume that all genomic data is private versus one in which only genomic data from the case cohort is private [95, 105]. We see that, if we allow some leakage of information about the control cohort, we can greatly improve the Laplacian based method for picking high scoring SNPs introduced by Uhler et al. [95]. Moreover, in this case we see that we can modify the traditional approach for output perturbation in order to better estimate the allelic test statistic for a chosen SNP. The final contribution of this chapter is to show that, in both scenarios, one can better estimate the allelic test statistic using input perturbation instead of output perturbation. This novelty allows us to get better estimates for the significance of a given SNP. The allelic test statistic is a powerful tool for better understanding the genetics underpinnings of disease. In more diverse populations, however, the presence of multiple subpopulations (people of different ethnicities, etc.) can lead to many false positives. Alternative statistics have been suggested for dealing with this prob- lem [21, 27, 23, 94, 43]. Many of these are based of linear mixed models (LMM) [27, 23, 94, 43]. Though there has been some work on fitting LMM in a differentially private way [4], no work has attempted to use differentially private LMM in order to 25 perform association analysis. In Chapter 6, we remedy this issue by introducing a differentially private LMM based statistic. After showing how to calculate the statistic in a differentially private way, we go on to apply it to picking high scoring SNPs. We also briefly touch on how to fit such models in a differentially private way. 1.3 The Place of Differential Privacy in Biomedical Research Though the tools of differential privacy have been around for years, the biomedical community has been slow to adopt them [15]. Though this delay is partially due to the limited knowledge about such approaches in the biomedical field, perhaps a bigger reason is that current privacy techniques greatly reduce the utility of data and their analysis. In a field whose main concern is human health, there is extra incentive to give the most accurate analysis possible since lives could be on the line. Despite this concern, there are a few important areas where differential privacy might play a role in biomedical research. The most obvious one is when institutional concerns or legal rules prevent data from being published- for example, consider NIH's policy of not releasing much of its aggregate genomic data [57]. When such limitations exist, it might be possible to release differentially private versions of the data under consideration instead. Though this option is not ideal, if there is a choice between noisy data and no data, the noisy data is often preferable. The other application where differential privacy might be useful is when untrusted users query a database. This situation has motivated many of the previous works on differential privacy [97, 67], and some of the only applications of data perturbation that have been implemented in real world systems [28, 80, 79]. In a nutshell, the idea is that users who might want to use a large medical database to help design a study (for example, to come up with hypotheses to test, researchers must first find participants with certain traits for a study) or validate results can do so by sending queries to the database and getting differentially private answers to those queries. 26 This ability also allows researchers to better determine if a given dataset would be useful to them before going through the, often arduous, task of requesting access [96]. This approach allows researchers access to the database while minimizing privacy concerns. As an added bonus, since the queries are being used as a preliminary step, as opposed to being part of a rigorous analysis, there is less concern about the ethical implications of returning inaccurate results. Furthermore, systems such a GUPT [441 can be used to allow users to make any query on the database they desire-although using such a general architecture does greatly decrease accuracy. Therefore, it makes sense to come up with specialized methods such as the ones we introduce here for dealing with common queries. 27 Go Cl Chapter 2 Background In this thesis we will be using several concepts from basic genetics and the privacy community. To help orient the reader we give a brief overview of these concepts here. After introducing some basic concepts related to genomics and electronic health records in the first few sections, we go on to introduce some basic concepts from the privacy community. Finally, we finish off the chapter with an overview of some of the privacy concerns facing the biomedical community today. 2.1 2.1.1 Biology Background Basic Genetics A large portion of human biological information is coded in the genome. For our purposes the genome can be thought of as a string consisting of the letters A, C, T and G. Each of these letters is known as a nucleotide. Note that each location of the human genome has two copies, one inherited from our father, one from our mother (ignoring mitochondrial DNA, cancer, etc). A single nucleotide polymorphism (SNP) is a location in the genome that differs between different people in the human population. For a given SNP, an allele is one possible value of the genome at that location. For most common SNPs there are at most two alleles, one more common one (the major allele) and one less common (the 29 ACTTTG CCCCATAAAA ACTTTGCCCCGTAAAA Figure 2-1: An example of a snippet of the genome A CTTTG CC CCAT A A A A ACTTTGCCCCGTAAAA Figure 2-2: An example of a SNP minor allele). At a given SNP each individual can have either 0, 1 or 2 copies of the minor allele. As such, the genotype of an individual is often thought of as a high dimensional vector with one dimension for each SNP, where each entry in that vector is either 0, 1 or 2. 2.1.2 GWAS One of the main aims of modern genomics is to link genomic polymorphism to disease [27]. Increasingly, it is becoming possible to achieve this goal with genome-wide association studies (GWAS). The basic idea behind GWAS is simple: A researcher collects a large cohort of people, and measures some trait, referred to as the phenotype (either a continuous trait like height or a discrete one like disease status). The researcher also collects genotype data about each individual. It is then a simple matter to take the data and perform a statistical test (such as linear regression, logistic regression, x 2 , allelic test statistic, etc.) to see which SNPs are related to the trait under consideration. 30 Unfortunately, there are some complications that arise when performing this analysis naively. The fact that human genetic history is quite messy (due to the many subpopulations of humans that have existed over time) leads to many associations that are not relevant to the underlying biology. This problem is known as population structure. As an example of how population structure may complicate things, consider performing a GWAS related to height [431. Assume we have a study cohort containing Northern European and East Asian individuals. Thanks to an accident of history Northern Europeans tend to be taller than East Asians. Moreover, Northern Europeans are more likely to have a mutated version of the lactase gene that allows them to drink milk as adults. This would lead a naive analysis to suggest that height is related to the lactase gene, an association that seems meaningless biologically (discounting, perhaps, the effects of drinking milk on height). In order to avoid these false positives, numerous methods have been suggested. These include rescaling the results of the X 2 test to account for such associations [16] and adding principal components as covariates in linear or logistic regression [21, 94]. A more recent solution has been to use linear mixed models to correct for this structure [27, 23, 94, 43]. Assume that X is a given n x m genotype matrix with each column corresponding to a SNP, each row to an individual (normalized so each column has mean 0, variance 1- other normalizations are also used, we use this one for simplicity), y a phenotype, and y the mean centered phenotype. Then a mixed linear model is of the form X, =X+ E where # = N(0, RIm) and E= N(0, aIn) for some parameters og and oe. One can fit this model and use it to try to find which SNPs are associated with y (more details are given in later chapters). Furthermore, the estimated parameters o-e and ug are of biological interest, since they help researchers figure out how much of a trait's variance is due to genetics (aka how heritable it is) [30]. 31 This model has two main upsides. The first is that it is able to help correct for population stratification [31]. At the same time, this model gives the user more statistical power by including information about all the SNPs instead of considering them one at a time. Note that there are many different variations of this model [27, 23, 94, 43], each of which has different drawbacks and benefits in terms of statistical power, ability to correct for population stratification, and run time. As a final note, the above model assumes y is a continuous phenotype. It turns out, however, that this model can also be applied to discrete traits (such as disease status) using the liability threshold framework [48]. 2.1.3 The Rise of Electronic Health Records In the past decade electronic health records (EHR) have become standard. Driven both by new technologies and various government programs [11, the hope was that, by replacing paper records, EHR would improve medical care for everyone. One of the major benefits of EHR is that they make secondary use of clinical data for scientific studies feasible. Programs such as 12B2, eMerge and STRIDE [103, 28, 72] have been established to help use this patient data to gain insight into the basic biology behind human disease. Unfortunately, this data tends to be noisy and incomplete, leading to many biases. Despite these drawbacks there have been many success stories- such as using these records to identify disease subtypes [24] or using them to better identify victims of domestic abuse [85J. Moreover, in order to make better use of these records there have been attempts, such as the MIMIC database 1871, to release deidentified versions of these records that academics can use in their research without going through an involved application process. 32 2.2 2.2.1 Privacy Background Differential Privacy Assume that we have a dataset D = (di,... , dn) E Dn and want to calculate f(D) for some f f: D" -+ Q , where Q and ID are both sets. This is simple enough (assuming is easy to calculate). It may be the case, however, that f(D) releases private information about di for some i. For example, if D is a set of patients with a given condition, then f(D) may reveal the fact that di is in D, and thus has the condition. In order to deal with this worry we want to release a perturbed version of f, let us call it F, that does not have the same privacy concerns. This idea is formalized using differential privacy 117]. We say that D and D' = (d', ... , d') are neighboring databases if both are the same size and differ in exactly one entry (aka there is exactly one i such that di # d'). We then have the following definition. Definition 1. A random function F : IDn -+ Q is c-differentially private for some e > 0 if, for all neighboring databases D and D' and all sets S C Q, we have that P(F(D) E S) < exp(c)P(F(D') E S) Intuitively, the above definition says that, if D and D' differ by one entry, then F(D) and F(D') are statistically hard to distinguish. This property ensures that no individual has too large an influence on F(D), so they can not lose much privacy. The parameter c is a privacy parameter: the closer to 0 it is the more privacy is ensured, while the larger it is the weaker the privacy guarantee. This means we would like to set e as small as possible, but unfortunately this comes at the cost of having less useful outputs. The problem of figuring out the correct e to use is quite tricky and often ill-defined [621. Our goal is to find a differentially private F that closely approximates f. One of the simplest ways to do this is with what is known as the Laplacian mechanism [17]. Formally, if Q C Rn , we define the sensitivity of a function be equal to 33 f, denoted Af, to Af = max If(D) - (Xi,..., X,) x E Rn is defined as D,D' neighbors where the L 1 norm for a vector = f(D')I1 n lXl1 Z=Xi More than that, let Lapn(A) be a random variable that returns an n dimensional vector with density equal to P(Lapn(A) = x) = exp (Wi) then A is equal to the standard deviation of the coordinate functions of Lapn(A). The Laplacian Mechanism is then achieved by letting F(D) = f(D) + Lapn (f Theorem 1. If F is defined as above then F is c-differentially private. Exponential Mechanism Though the Laplacian mechanism was the first such mechanism suggested, many others have appeared over the years ([20, 55, 78, 11, 451, just to name a few). In particular, we are interested in coming up with estimates of f when there exists some loss function, q, such that f(D) = argmincq(D,c). One common approach is known as the exponential mechanism. Theorem 2. Let f and q be as above, and define F so that F(D) is chosen according to P(F(D) = c) oc exp where Aq = max D,D' neighbors,cEQ q(D,c) q(D, c) - q(D', c)|. Then F is c-differentially private. In many cases the exponential mechanism gives us a F that closely approximates 34 f [78]. Unfortunately, it is difficult to sample from F, so the exponential mechanism is not always a practical choice. 2.2.2 Other Approaches to Privacy For the sake of completeness we will say a few words about some other paradigms for private data release. One common approach is to deanonymize the data. This approach involves re- moving certain known identifiers from the data before sharing it. These identifiers can include anything from names to zip codes to ages. This approach has the upside of preserving almost all the utility in a dataset, though in many cases it has led to privacy breaches when data fields not thought to be identifying are connected to outside information [92]. To deal with this shortcoming the idea of k-anonymity (as well as many variations there of) was suggested [92]. In a nutshell, k-anonymity works by taking the data and, through various transformations (removing data entries, generalizing data fields, etc), generates a version of the dataset such that every record in the dataset matches k - 1 other records. This prevents an adversary from determining which of these k records belongs to a given individual. Although this improves privacy there have been attacks on it in various instances 1731. Moreover, it does come at the cost of having less useful data. Another idea that has been proposed is auditing [29]. This framework, also known as trust but verify, involves sharing data with certain individuals (aka a controlled access framework), but making sure to go back into the records to check that the users did not abuse this access. The approach has the advantage of giving users all the utility of the dataset with increased privacy protections. This framework can help deter and punish misconduct, but does not actively prevent it. On the down side, this approach is often cumbersome, since it involves users applying for access, an often painful process. In addition, it is not always clear what behaviors should be flagged as misconduct, though there is active research on figuring this out [29]. Finally, we should mention cryptographic approaches to privacy. Often it is the 35 case that the final output of an analysis does not threaten privacy, but that the database itself is still private. This situation does not present a problem if one user holds all of the data and that user has the computational tools needed to analyze it. There are times, however, when the user may want to either outsource their computation, or to combine their data with data from other users. The simple solution to these problems is for the data owner to share their data. This solution, however, clearly violates privacy. Luckily, there are cryptographic solutions to this problem. If the data owner wants to outsource data analysis they can use homomorphic or functional encryption [56, 8]. Similarly, if numerous users want to pool their data they can use multiparty computations [104]. Unfortunately, both of these approaches are rather computationally intensive at the moment, so are probably not useful for analysis of large, high-dimensional datasets such as those present in genomics. 2.3 Privacy Concerns and Biomedical Data Over the years there have been many different examples of how carelessness in the way data is shared can lead to privacy concerns. Unfortunately, we do not have the time to go into them all here- though various reviews exist for any interested readers [19, 76]- and instead give a brief overview of some of the most important. Perhaps the most prevalent source of privacy risk comes from human error. There have been numerous cases of medical professionals losing laptops with sensitive data and other mistakes that have led to privacy breaches. If these were the only privacy issue that existed then biomedical privacy would not be a very interesting area of research. obvious privacy concerns exist. It turns out, however, that other, less Perhaps the first of these to be exposed was what is known as a linkage attack [93, 91]. The idea is simple: an adversary has access to a deidentified database and some outside database that is not deidentified. The adversary can then use the outside database to reidentify the individuals who are in the deidentified database by looking for fields that occur in both the outside database 36 and the private database. One of the first demonstrations of this was performed by Latayna Sweeney [911. After Massachusetts released deidentified medical health records, Latayna was able to link these records to the voter registration using zip code and date of birth. To add a bit of flare to this result, she found the record of the governor of Massachusetts, and sent it to him by mail. Similar linkage attacks can occur when sharing genetic data. Gymrek et al. [59] recently showed how it is possible, using online ancestry databases, to reidentify supposedly deidentified genomic sequences. Using data from the Y chromosome they were able to determine the surname of a large percentage of participants in some online genomic databases. Using other information (such as genealogy information, state of birth, etc), they could go even further and narrow the candidate down to a few individuals, sometimes even reidentifying the sample completely. One suggested method for protecting genomic data has been to withhold sensitive or identifying parts of the genome, and only release the rest. It turns out that such measures can be defeated [82]. In particular, since mutations of nearby locations in the genome are not independent (due to linkage disequalibrium), one can reconstruct private data contained in the genome. Even aggregate genomic data- our main concern here- is not free of privacy concerns. Homer et al. [60] showed that MAFs could be used to determine if individuals had participated in a genetic study. Similar work has demonstrated that releasing regression coefficients [63] can lead to privacy loss as well. Why do medical researchers care about privacy concerns? Beyond the obvious ethical reasons, there is also the worry that privacy breaches will lead to lack of trust, resulting in fewer individuals being willing to participate in studies. Previous research has already shown that privacy concerns lead many individuals not to go to their doctor with medical concerns [83], and privacy breaches would likely make this worse. Moreover, privacy concerns have led several agencies (such as the NIH) to hide medical data in controlled access repositories, something that many believe hurts the scientific enterprise. It might be hoped that a better understanding of the boundary between medicine and privacy can lead to a loosening of these restrictions. 37 2.4 Previous Applications of Privacy Preserving Approaches to Biomedical Data As one might imagine, the concern about privacy in biomedical data analysis has led to a slew of suggested approaches for dealing with these issues. Such approaches range from completely policy based to new cryptographic methods. Below we give a brief overview of some of these approaches. 2.4.1 HIPAA and Other Legislative Approaches Various legislative approaches have been suggested to help deal with the issue of privacy and medical data. In the US, the main such legislation is HIPAA, the Health Insurance Portability and Accountability Act of 1996 [2]. HIPAA sets up various requirements for releasing medical data publicly. For releasing individual level data, the act requires that either a statistician looks at the data and declares it to be safe to release, or that the data meets the safe harbor requirement that requires various pieces of identifying information to be removed. 2.4.2 Access Control Access control is one of the most common approaches in biomedicine to protecting patient data- from EHR to the NIH and beyond [103, 39]. In a nutshell, access control involves taking data and storing it away from prying eyes. In order to get access to the data researchers have to apply and pass some kind of background check. After passing this check they are then given access to the data. Increasingly there have been suggestions that, instead of allowing researchers to download data, researchers should instead be allowed to submit code to the repositories who perform the analysis for them [52]. This approach helps prevent the loss of control which occurs after data is downloaded by researchers, and allows for the possibility of auditing to find out if researchers are abusing their privileges- such auditing methods have been suggested many places in the literature [51, 581. 38 2.4.3 Differential Privacy Differentially privacy (see above) has also been suggested as a possible solution to the privacy conundrum. Methods have been developed for performing differentially private medical count queries [97, 65], statistical tests on medical databases [99], genomic studies [66, 67, 95, 106, 46, 53], and model fitting 132]. There has even been a competition, known as iDASH, to help come up with better methods for performing differentially private GWAS [66]. Though such methods have been steadily improving they have yet to see many real applications. The closest that we are aware of is in the case of study design. Databases such as 12B2 and STRIDE often return perturbed count queries to researchers as a way of preserving privacy. Note that, although much of this research has been encouraging, there is still a long way to go [37]. In a similar vein their has been work using methods similar to differential privacy to protect patient location data [49, 68]. 2.5 Other approaches k-anonymity is one of the most common privacy methods applied to medical data, including applying it to GWAS studies [26] and many other areas. Similarly, there have been model based approaches to release DNA sequences without violating relatives privacy [38]. There has been a lot of interest in being able to combine various private databases to perform a joint analysis without losing any privacy. Such techniques are known as multiparty computations. Based on cryptography [104] such methods allow users to perform joint operations without losing privacy, though they come at a cost to performance. In the medical realm such approaches have been suggested for many applications, including drug monitoring [33], GWAS studies [36, 50], and paternity testing [25], among many others 142, 14, 40]. There have also been attempts to allow users to outsource computation to the cloud in privacy preserving ways using homomorphic encryption (that is to say encryption that allows someone to compute on the data without decrypting it). Such 39 approaches have been applied to mapping genomic sequences to a reference sequence [5] and simple genomic analysis [6]. There has also been work on figuring out novel schemes to encrypt genomes so that even brute force attacks will fail [54], in order to allow users to outsource genotype storage to the cloud. A final area of interest has been in devising new methods for deidentifying patient records. There have been numerous methods investigated for deidentfying medical notes, either by removing identifiers to make them HIPAA compliant, or by removing other identifying information not covered by HIPAA [22, 41]. Though such methods are an interesting area of research, it is not yet clear how much private information actually leaks through. 40 Chapter 3 One Size Doesn't Fit All: Measuring Individual Privacy in Aggregate Genomic Data 3.1 Introduction Note: The work in this chapter was presented at the GenoPri workshop in IEEE Symposium on Security and Privacy 2015. Recent research has shown that sharing aggregate genomic data, such as p-values, regression coefficients, and minor allele frequencies (MAFs) may compromise participant privacy in genomic studies [60, 19, 108, 90, 63]. In particular, Homer et al. showed that, given an individual's genotype and the MAFs of the study participants, an interested party can determine with high confidence if the individual participated in the study (recall that the MAF is the frequency with which the least common allele occurs at a particular location in the genome). Following the initial realiza- tion that aggregate data can be used to reveal information about study participants, subsequent work has led to even more powerful methods for determining if an individual participated in a study based on MAFs [98, 88, 64, 9]. These methods work by comparing an individual's genotype to the MAF in a study and to the MAF in 41 the background population. If their genotype is more similar to the MAF in the study, then it is likely that the individual was in the study. This raises a fundamental question: how do researchers know when it is safe to release aggregate genomic data? To help answer this question we introduce a new model-based measure, PrivMAF, that provides provable privacy guarantees for MAF data obtained from genomic studies. Unlike many previous privacy measures, PrivMAF gives an individual privacy measure for each study participants, not just an average measure. These individual measures can then be combined to measure the worst case privacy loss in the study. Our measure also allows us to quantify the privacy gains achieved by perturbing the data, either by adding noise or binning. 3.1.1 Previous work Several methods have been proposed to help determine when MAFs are safe to release. The simplest method- one suggested for regression coefficients 175]- is to just choose a certain number and release the MAFs for at most that many single nucleotide polymorphisms (SNPs, e.g. locations in the genome with multiple alleles). Sankararaman et al. [881 suggested calculating the sensitivity and specificity of the likelihood ratio test to help decide if the MAs for a given dataset are safe to release. More recently, Craig et al. [13] advocated a similar approach, using the Positive Predictive Value (PPV) rather than sensitivity and specificity. These measures provide a powerful set of tools to help determine the amount of privacy lost after releasing a given dataset. One limitation of these approaches, however, is that they ignore the fact that a given piece of aggregate data might reveal different amounts of information about different individual study participants, and instead look at an average measure of privacy over all participants. For the unlucky few who lose a lot of privacy in a given study, a privacy guarantee for the average participant is not very comforting. The only sure way to avoid potentially harmful repercussions is to produce provable privacy guarantees for all participants when releasing sensitive research data. Some researchers have recently suggested k-anonymity [92, 108, 74] or differential privacy [17, 951 based approaches, which allow release of a transformed version of 42 the aggregate data in such a way that privacy is preserved. The idea behind these methods is that perturbing the data decreases the amount of private information released. Though such approaches do give improved privacy guarantees, they limit the usefulness of the results, as the data has often been perturbed beyond its usefulness; thus, there is a need to develop methods that perturb the data as little as possible in order to maximize its utility. Identifying individuals whose genomic information has been included in an aggregate result can have real-world repercussions. genetics of drug abuse [35]. Consider, for example, studies of the If the MAFs of the cases (e.g. people who had abused drugs) were released, then knowing someone contributed genetic material would be enough to tell that they had abused drugs. Along the same lines, there have been numerous genome-wide association studies (GWAS) related to susceptibility to numerous STDs, including HIV [771. Since many patients would want to keep their HIV status secret, these studies need to use care in deciding what kind of information they give away. Such privacy concerns have led the NIH and the Wellcome Trust, among others, to move genomic data from public databases to access-controlled repositories [84, 57, 107]. Such restrictions are clearly not optimal, since ready access to biomedical databases has been shown to enable a wide range of secondary research [101, 83]. Many types of biomedical research data may compromise individual's privacy, not just MAF [19, 76, 75, 59, 93, 18, 911. For instance, even if we just limit ourselves to genomic data there are several broad categories of privacy challenges that depend on the particular data available, e.g. determining from an individual's genotype and aggregated data whether they participated in a GWAS study [90], from an individual's genotype whether they are in a gene-expression database 163], or, alternately, determining an individual's identity from just genotype and public demographic information [59]. 43 3.1.2 Our Contribution We introduce a privacy statistic, our measure PrivMAF, which provides provable privacy guarantees for all individuals in a given study when releasing MAFs for unperturbed or minimally perturbed (but still useful) data. The guarantee we give is straightforward: given only the MAFs and some knowledge about the background population, PrivMAF measures the probability of a particular individual being in the study. This guarantee implies that, if d is any individual and PrivMAF(d, MAF) is the score of our statistic, then, under reasonable assumptions, knowledge of the minor allele frequencies implies that d participated in the study with probability at most PrivMAF(d, MAF). Intuitively, this measure bounds how confident an adversary can be in concluding that a given individual is in our study cohort based off the available information. Moreover, the PrivMAF framework can measure privacy gains achieved by perturbing MAF data. Even though it is preferential to release unperturbed MAFs, there may be situations in which releasing perturbed statistics is the only option that ensures the required level of privacy- such as when the number of SNPs whose data we want to release is very large. With this scenario in mind, PrivMAF can be modified to measure the amount of privacy lost when releasing perturbed MAFs. In particular, the statistic we obtain allows us to measure the privacy gained by adding noise to (common in differential privacy) or binning (truncating) the MAFs. To our knowledge, PrivMAF is the first method for measuring the amount of privacy gained by binning MAFs. In addition, our method shows that much less noise is necessary to achieve reasonable differential privacy guarantees, at the cost of adding realistic assumptions about what information potential adversaries have access to, thus providing more useful data. In addition to developing PrivMAF, we apply our statistic to genotype data from the Wellcome Trust Case Control Consortium's (WTCCC) British Birth Cohorts genotype data. This allows us to demonstrate our method on both perturbed and unperturbed data. Moreover, we use PrivMAF to show that, as claimed above, dif44 ferent individuals in a study can experience very different levels of privacy loss after the release of MAFs. 3.2 3.2.1 Methods The Underlying Model Our method assumes a model implicitly described by Craig et al. [13], with respect to how data were generated and what knowledge is publicly available. PrivMAF assumes a large background population. Like previous works, we assume this population is at Hardy-Weinberg (H-W) equilibrium. We choose a subset (B) of this larger population, consisting of all individuals who might reasonably be believed to have participated in the study. Finally, the smallest set, denoted D, consists of all individuals who actually participated in the study. As an example, consider performing a GWAS study at a hospital in Britain. The underlying population might be all people of British ancestry; B, the set of all patients at the hospital; and D, all study participants. As a technical aside, it should be noted that- breaking with standard conventionswe allow repetitions in D and B. Moreover, we assume that the elements in D and B are ordered. In our model B is chosen uniformly at random from the underlying population, and D is chosen uniformly at random from B. An individual's genotype, d can be viewed as a vector in {0, 1, 2 }M, = (dj, . . . , din), where m is the number of SNPs we are considering releasing. Let p3 be the minor allele frequency of SNP j in the underlying population. We assume that each of the SNPs is chosen independently. By definition of H-W equilibrium, for any d E B, the probability that dj = i for i E {0, 1, 2} is i)(- pj)2-$ Let MAFj (D) dj be the minor allele frequency of SNP j in D, the = dED frequency with which the least common allele occurs at SNP (MAF 1(D), .. . , MAFm(D)). j. Then MAF(D) = We assume the parameters, {pi}, the size of B (denoted 45 N), and the size of D (denoted n) are publicly known. We are trying to determine if releasing MAF(D) publicly will lead to a breach of privacy. Note that our model does assume the SNPs are independent, even though this is not always the case due to linkage disequalibrium (LD). This independence assumption is made in most previous approaches. We can, however, extend PrivMAF to take into account LD by using a Markov Chain based model (see Section 3.5.7). The original WTCCC paper [12] looked at the dependency between SNPs in their dataset and found that there are limited dependencies between close-by SNPs. In situations where LD is an issue one can often avoid such complications by picking one representative SNP for each locus in the genome. 3.2.2 Measuring Privacy of MAF Consider an individual d E B. We want to determine how likely it is that d E D based on publicly released information. We assume that it is publicly known that d E B. This is a realistic assumption, since it corresponds to an attacker believing that d may have participated in the study. This inspires us to use P(d E DIMAF(D) = MAF(D), d E B) (3.1) as the measure of privacy for individual d, where D and B are drawn from the same distribution as D and B. Informally, h and B are random variables that represent our adversary's a priori knowledge about D and B. More precisely, we calculate an upper bound on Equation 3.1, denoted by PrivMAF(d, MAF(D)). In practice we use the approximation: PrivMAF(d, MAF(D)) 1 (N)P((D)) nPn_1(x(D)-d) where x(D) = 2nMAF(D) and Pm(x) = ) x (1 46 2 p ) n-xi It should be noted that, for reasonable parameters, this upper bound is almost tight. We can then let PrivMAF(D) = max PrivMAF(d, MAF(D)) dED Informally, for all d C D, PrivMAF(D) bounds the probability that d participated in our study given only publicly-available data and MAF(D). All derivations are given in Section 3.5. This measure allows a user to choose some privacy parameter, a, and release the data if and only if PrivMAF(D) < a. It is worth noting, however, that deciding whether or not to release the data gives away a little bit of information about D, which can weaken our privacy guarantee. While in practice this seems to be a minor issue, we develop a method to correct for it in Section 3.5.6. 3.2.3 Measuring Privacy of Truncated Data In order to deal with privacy concerns it is common to release perturbed versions of the data. This task can be achieved by adding noise (as in differential privacy), binning (truncating results), or using similar approaches. Here we show how PrivMAF can be extended to perturbed data. We first consider truncated data. Let MAFtrunc(k) (D) be obtained by taking the minor allele frequencies of the jth SNP and truncating it to k decimal digits. For example, if k = 1 then .111 would become .1, and if k = 2 it would become .11. We are interested in P(d E DIMAFrunc(k)(f) = MAFtrunc(k)(D), d E B) As above, we can calculate an upper bound, denoted by PrivMAFtrunc(k) (d, MAFtrunc(k) (D)). The approximation we use to calculate this is 47 given in Section 3.5. We then have PrivMAFtrunc(k) (D) = max PrivMAFtrunc(k) (d, MAFtrunc(k) (D)) dED For each d E D, this measure upper bounds the probability that individual d participated in our study given only publicly-available data and knowledge of MAFtrunc(k) (D). 3.2.4 Measuring Privacy of Adding Noise Another way to achieve privacy guarantees on released data is by perturbing the data using random noise (this is a common way of achieving differential privacy). Though there are many approaches to generate this noise, most famously by drawing it from the Laplace distribution [17], we investigate one standard approach to adding noise that is used to achieve differential privacy when releasing integer values [20]. Consider E > 0. Let rj be an integer valued random variable such that P(q = i) is MAFP(D) _\,-/ where 71, ., = MAFP (f) + proportional to e-flf. Let 2n . are independently and identically distributed (iid) copies of 71. It is worth noting that MAF'(D) is 2E-differentially private. Recall [17]: Definition 1. Let n be an integer, Q and E sets, and X a random function that maps n element subsets of Q (we call such subsets 'databases of size n') into E. We say that X is c-differentially private if, for all databases D and D' of size n that differ in exactly one element and all S C E, we have that P(X(D) E S) < exp(e)P(X(D') E S) Using the same framework as above we can define PrivMAF(d, MAF(D)) and PrivMAF(D) to measure the amount of privacy lost by releasing MAFC(D). above the approximation we use to calculate this is given in Section 3.5. 48 As 10 0.8 0.8 LO0.6 LL 0.6 i04 04 0.2 0,2 200 600- 400 Number of SNPs 8-i0- 0 1000 200 600 400 800 1000 Number of SNPs (b) (a) Figure 3-1: PrivMAF applied to the WTCCC dataset. In all plots we take n=1000 research subjects and a background population of size N=100,000. (a) Our privacy measure PrivMAF increases with the number of SNPs. The blue line corresponds to releasing MAFs with no rounding, the green line to releasing MAFs rounded to one decimal digit, and the red line to releasing MAFs rounded to two decimal digits. Rounding to two digits appears to add very little to privacy, whereas rounding to one digit achieves much greater privacy gains. (b) The blue line corresponds to releasing MAF with no noise, the red line to releasing MAF 5 , and the green line to releasing MAF 1 . Adding noise corresponding to 6 = .5 seems to add very little to privacy, whereas taking c = .1 achieves much greater privacy gains. 3.2.5 Choosing the Size of the Background Population One detail we did not go into above is the choice of N, where N is the number of people who could reasonably be assumed to have participated in the study. This parameter depends on the context, and giving a realistic estimate of it is critical. In most applications the background population from which the study is drawn is fairly obvious. That being said, one needs to be careful of any other information released publicly about participants-just listing a few facts about the participants can greatly reduce N, thus greatly reducing the bounds on privacy guarantees (since the amount of privacy lost by an individual is roughly inversely proportional to N - n). Note that N can be considered as one of the main privacy parameters of our method. The smaller the N, the stronger the adversary we are protected against. Therefore we want to make N as large as possible, while at the same time ensuring the privacy we need. In our method, an adversary who has limited his pool of possible contenders to fewer than N individuals before we publish the MAF can be considered 49 to have already achieved a privacy breach; thus it is a practitioner's job to choose N small enough that such a breach is unlikely. 3.2.6 Release Mechanism Often one might like to use PrivMAF to decide if it is safe to release a set of MAF from a study. This can be done by choosing a between 0 and 1 and releasing the MAF if and only if PrivMAF(D) < a. The action of deciding to release D or not release D, however, gives away a little information. In practice this is unlikely to be an issue, but in theory it can lead to privacy breaches. This issue can be dealt with by releasing the MAF if and only if PrivMAF(D) < 3(a), where 3 = 3(a) is chosen so that: a> 1+ 1 - - maxdE{0l2}m Pr(d E f - b) where P0 = Pr(max PrivMAF(d, MAF(b)) Jx(b) = x) deD We call this release mechanism the Allele Leakage Guarantee Test (ALGT). Unlike the naive release mechanism ALGT gives us the following privacy guarantee: Theorem 3. Choose / as above. Then, if PrivMAF(D) /, for any choice of d E D we get that P (d E bid E b, x(D) = x, / PrivMAF(d, MAF(b)) < is less than or equal to a. Note that the choice of a determines the level of privacy achieved. Picking this level is left to the practitioner- perhaps an approach similar to that taken by Hsu et al. [621 is appropiate. A more detailed proof of the privacy result above can be found in Section 3.5. 50 3.2.7 Simulated Data In what follows, all simulated genotype data was created by choosing a study size, denoted n, and a number of SNPs, denoted m. For each SNP a random number, p, in the range .05 to .5 was chosen uniformly at random to be the MAF in the background population. Using these MAFs we then generated the genotypes of n individuals independently. Note that all computations were run on a machine with 48GB RAM, 3.47GHz XEON X5690 CPU liquid cooled and overclocked to 4.4GHz, using a single core. 3.3 3.3.1 Results Privacy and MAF As a case study we tested PrivMAF on data from the Wellcome Trust Case Control Consortium (WTCCC)'s 1958 British Birth Cohort [12]. This dataset consists of genotype data from 1500 British citizens born in 1958. We first looked at the privacy guarantees given by PrivMAF for the WTCCC data for varying numbers of SNPs (blue curve, Fig. 3.5.2a), quantifying the relationship between number of SNPs released and privacy lost. The data were divided into two sets: one of size 1,000 used as the study participants, the other of size 500 which was used to estimate our model parameters (pi's). We assumed that participants were drawn from a background population of 100,000 individuals (N = 100, 000; see Methods for more details). Releasing the MAFs of a small number of SNPs results in very little loss of privacy. If we release 1,000 SNPs, however, we find that there exists a participant in our study who loses most of their privacy- based on only the MAF and public information we can conclude they participated in the study with 90% confidence. In addition, we considered the behavior of PrivMAF as the size of the population from which our sample was drawn increases. From the formula for our statistic we see that PrivMAF approaches 0 as the background population size, N, increases, since 51 0.45 0.40 0.35 0.30 LL <0.25 0.20 0L 0.15 0.10 0.05 0.000 2000 4000 6000 8000 10000 Number of SNPS Figure 3-2: Truncating simulated data to demonstrate scaling. We plot our privacy measure PrivMAF versus the number of SNPs for simulated data with n=10000 subjects and a background population of size N=1,000,000. The green line corresponds to releasing MAFs with no rounding, the blue line to releasing MAFs rounded to three decimal digit, and the red line to releasing MAFs rounded to two decimal digits. Rounding to three digits seems to add very little to privacy, whereas rounding to two digits achieves much greater privacy gains. there are more possibilities for who could be in the study, while it goes to 1 as N decreases towards n. 3.3.2 Privacy and Truncation Next we tested PrivMAF on perturbed WTCCC MAF data, showing that both adding noise and binning result in large increases in privacy. First we considered perturbing our data by binning. We bin by truncating the unperturbed MAFs, first to one decimal 2 digit (MAFtrunc(l), k = 1) and then to two decimal digits (MAFtrunc( ), k = 2). As depicted in Fig. 3.5.2a we see that truncating to two digits gives us very little in terms 52 of privacy guarantees, while truncating to one digit gives substantial gains. In practice, releasing the MAF truncated to one digit may render the data useless for most purposes. It seems reasonable to conjecture, however, that as the size of GWAS continues to increase similar gains can be made with less sacrifice. As a demonstration of how population size affects the privacy gained by truncation, we generated simulated data for 10,000 study participants and 10,000 SNPs, choosing N to be one million. We then ran a similar experiment to the one performed on truncated WTCCC data, except with k = 2 and k = 3; we found the k = 2 case had similar privacy guarantees to those seen in the k = 1 case on the real data (Fig. 3-2). For example, we see that if we consider releasing all 10000 SNPs then PrivMAF is near 0.35, while when k = 2 it is below 0.2 (almost a factor of two difference). 3.3.3 Privacy and Adding Noise We also applied our method to data perturbed by adding noise to each SNPs MAF (Fig. 3.5.2b). Methods). We used c = 0.1 and 0.5 as our noise perturbation parameters (see We see that when c = 0.5, adding noise to our data resulted in very small privacy gains. When we change our privacy parameter to C = 0.1, however, we see that the privacy gains are significant. For example, if we were to release 500 unperturbed SNPs then PrivMAF(D) would be over 0.4, while PrivMAF 1 (D) is still under 0.2. The noise mechanism we use here gives us 2mE-differential privacy (see Methods), where m is the number of SNPs released. For c = .1, if m = 200 then the result is 40-differentially private, which is a nearly useless privacy guarantee in most cases. Our measure, however, shows that the privacy gains are quite large in practice. This suggests that PrivMAF allows one to use less noise to get reasonable levels of privacy, at the cost of having to make some reasonable assumptions about what information is publicly available. 53 1.0 0.8 L 0.6 CL 0.4 0.2 0.00 200 600 400 800 1000 Number of SNPs Figure 3-3: Worst Case Versus Average Case PrivMAF. Graph of the number of SNPs, denoted m, versus PrivMAF. The blue curve is the maximum value of PrivMAF(d, MAF(D)) taken over all d E D for a set of n = 1, 000 randomly chosen participants in the British Birth Cohort, while the green curve is the average value of PrivMAF(d, MAF(D)) in the same set. The the maximum value of PrivMAF far exceeds the average. By the time m = 1000 it is almost five times larger. 54 3.3.4 Worst Case Versus Average As stated earlier, the motivation for PrivMAF is that previous methods do not measure privacy for each individual in a study but instead provide a more aggregate measure of privacy loss. This observation led us to wonder exactly how much the privacy risk differs between individuals in a given study. To test this question, we compared the maximum and mean score of PrivMAF(d, MAF(D)) in the WTCCC example for varying values of m, the number of released SNPs. The result is pictured in 3-3. The difference is stark-the person with the largest loss of privacy (pictured in blue) loses much more privacy than the average participant (pictured in green). By the time m = 1000 the participant with the largest privacy loss is almost five times as likely to be in the study as the average participant. This result clearly illustrates why worse case, and not just average, privacy should be considered. 3.3.5 Comparing # to a To justify our release test ALGT, we compared the naive threshold, a, to the corrected threshold, 0 (Figure 3-4). For larger values of a we see that the two thresholds are fairly close. As a decreases, however, the two quantities start to diverge, with the corrected threshold decreasing much faster than the naive one. Moreover, we see that when a is roughly 0.04, 3 suddenly drops to around 0 and remains at that level for all smaller a- this behavior is due to the negligible probability that a study population would have an PrivMAF less than .04 given this choice of parameters. This suggests that, in most cases, using a instead of 3.3.6 / will not reduce privacy by too much. Reidentification Using PrivMAF Thus far we have presented PrivMAF as a means of helping ensure participant privacy. As it turns out, PrivMAF can also be used in exactly the opposite way, as a means of compromising subjects' privacy. To do this choose some threshold -y. For a given genotype d we predict d E D if and only if PrivMAF(d, MAF(D)) > -Y. Used in this way, our approach performs comparably to previous approaches; we plot the ROC 55 0.20 6 0.15 0.10 4-J 00.05 U. 0 .08 .00 0.05 0.10 0.15 0.20 Uncorrected Threshold Figure 3-4: ALGT applied to the WTCCC dataset. A graph of the uncorrected threshold, a, versus the corrected threshold, / = 3(a), from ALGT is given in blue. The green line corresponds to an uncorrected threshold. We see that for some choices of a, correction may be desired. For example, for a = .05 the corrected threshold is approximately / = .03. Here we again use the British Birth Cohort with n=1000 study participants, m=1000 SNPs, and a background population of size N=100,000. 56 1.0 A 0.8 AS A A A A A 0.6 U) A S 0.4 AA AA S A 0.2 0.1 0.2 0.3 0.5 0.4 0.6 0.7 0.8 0.9 False Positive Rate Figure 3-5: ROC Curves of PrivMAF and Likelihood Ratio. ROC curves obtained using PrivMAF (green triangles) and the likelihood ratio method (red circles) to reidentify individuals in the WTCCC British birth cohort with n=:1,000 study participants and 1,000 SNPs. curve of the likelihood ratio test [86] as well as the ROC curve obtained by using our test statistic (see Figure 3-5). We see that both methods perform similarly. Since it is known that the likelihood ratio test gives the highest power for a given false positive rate of any test, this curve suggest that our privacy measure is doing a good job in terms of measuring how much privacy is lost in a given dataset by releasing the minor allele frequencies. Note that we can also use this as a reidentification on perturbed data, using the perturbed PrivMAF. The results of this analysis are shown in Figure 3-6 for truncated data and Figure 3-7 for data with noise added. 57 1.0 0.8 0.6 0.4 0.2 000 0.2 0.6 0.4 0.8 1.0 False Positive Rate Figure 3-6: ROC Curves of PrivMAF with Truncation. ROC curves obtained using PrivMAF for reidentification of unperturbed data (in red, AUC=.686), data truncated after two decimal digits (aka k = 2, in blue, AUC=.682), and data truncated after one decimal digit (aka k = 1, in green, AUC=.605 ). We see that truncation can greatly decrease the effectiveness of reidentification. Note that the ROC of the unperturbed data here is different from that in the previous figure. This is because we used a different random division of our data in each case. 58 1.0 0.8 0.6 0.4 0.2 0 0.2 0.6 0.4 0.8 1.0 False Positive Rate Figure 3-7: ROC Curves of PrivMAF with noisy data. ROC curves obtained using PrivMAF for reidentification of unperturbed data (in red, AUC=.696), with noise corresponding to c= .5 (in green, AUC=.693), and with c = .1 (in blue, AUC=.656). We see that adding noise can decrease the effectiveness of reidentification. Note that the ROC of the unperturbed data here is different from that in the previous figures. This is because we used a different random division of our data in each case. 59 3.4 Conclusion On the one hand, to facilitate genomic research, many scientists would prefer to release even more data from studies 1101], 1861. Though tempting, this approach can sacrifice study participants' privacy. As highlighted in the introduction, several different classes of methods have been previously employed to balance privacy with the utility of data. Methods such as sensitivity/PPV based methods are dataset specific, but only give average-case privacy guarantees. Because our method provides worst-case privacy guarantees for all individuals, we are able to ensure improved anonymity for individuals. Thus, PrivMAF can provide stronger privacy guarantees than sensitivity/PPV based methods. Moreover, since our method for deciding which SNPs to release takes into account the genotypes of individuals in our study, it allows us to release more data than any method based solely on MAFs with comparable privacy guarantees. Our findings demonstrate that differential privacy may not always be the method of choice for preserving privacy of genomic data. Notably, perturbing the data appears to provide major gains in privacy, though these gains come at the cost of utility. That said, our results suggest that, when n is large, truncating minor allele frequencies may result in privacy guarantees without the loss of too much utility. Moreover, the method of binning we used here is very simple- it might be worth considering how other methods of binning may be able to achieve similar privacy guarantees while resulting in less perturbation on average. We further show that adding noise can result in improved privacy, even if the amount of noise we add does not provide reasonable levels of differential privacy. Note that our method is based off a certain model of how the data is generated, a model that is similar to those used in previous approaches. It will not protect against an adversary that has access to insider information. This caveat, however, seems to be unavoidable if we do not want to turn to differential privacy or similar approaches that perturb the data to a greater extent to get privacy guarantees, thus greatly limiting data utility. 60 Having presented results on moderate-sized real datasets, we test the ability of PrivMAF to scale as genomic data sets grow. In particular, we ran our algorithm on larger artificial datasets (with 10,000 individuals and 1000 SNPs) and have found our PrivMAF implementation still runs in a short amount of time (19.14 seconds on our artificial dataset of size 10,000 described above, with a running time of O(mn), where n is the study size and m is the number of SNPs). Though our work focuses on the technical aspects related to preserving privacy, a related and equally important aspect comes from the policy side. Methods similar to those presented here offer the biomedical community the tools it needs to ensure privacy; however, the community must determine appropriate privacy protections (ranging from the release of all MAF data to use of controlled access repositories) and in what contexts (i.e., do studies of certain populations, such as children, require extra protection?). It is our hope that our work helps inform this debate. Our tool could, for example, be used in combination with controlled access repositories to release the MAFs of a limited number of SNPs depending on what privacy protections are deemed reasonable Our work addresses the critical need to provide privacy guarantees to study participants and patients by introducing a quantitative measurement of privacy lost by release of aggregate data, and thus may encourage release of genomic data. A Python implementation of our method is available at http://groups.csail.mit.edu/cb/PrivMAF/. 3.5 3.5.1 Derivation of Methods Basic Model Before describing the model underlying our results we want to motivate it. Often the size of an underlying population is so large compared to that of the study population that we might as well consider the underlying population to be infinite (consider for example the population of all people of English ancestry versus the participants in the British Birth Cohort). In practice, however, it might be that we know the study 61 participants are drawn from some smaller subpopulation (for example the British Birth Cohort is drawn from the population of all children born in Britain during a certain week in 1958). This subpopulation is small enough that we can not consider it infinite. Therefore we can think of our study population as being generated by first generating this smaller subpopulation out of an infinite background population, then choosing the study participants out of this smaller population. This is the point of view our model takes, and it is formally described below. It is worth noting that, breaking with standard notation, we assume all sets are ordered and can have repetitions. Assume that the genotypes of study participants are drawn from some theoretical infinite population. We have m SNPs, which we label with 1, ... , m, each of which is independent of the others. Let pi be the minor allele frequency of the ith SNP in our infinite population, and assume our population is in Hardy-Weinberg (H-W) Equilibrium. We first produce a small background population B = {bi, - - - , bN} where each bj E {0, 1, 2} (B is the finite set of people who in reality might have participated in the study), and where each member of the population is generated independently of the others. Our study population, denoted D = {di, ... , d,}, is a population of size n produced from the background population by choosing n members of B uniformly at random with no repetitions (note, since B can have repetitions in it, it is possible to have di = dj even if i k = # j. This is because it is possible to have 1 so that bk = b1.). It is worth noting that the marginal probability distribution on D is exactly the same as the probability distribution we would get by generating D directly from the infinite population. 3.5.2 PrivMAF Let MAFj(D) be the minor allele frequency of the ith SNP in our population D. We want to release - = MAF(D) = (MAF 1 (D),...,MAFm(D)) (where xi = Zdi is dED the number of times the minor allele occurs at SNP i in our study population). To simplify notation let x(D) = 2nMAF(D). 62 We want some kind of measure of how much privacy is lost by each study participant after releasing MAF(D). We achieve this goal by measuring the probability that an individual participated in the study given the data released. For a given individual d we want to calculate how likely it is under our model that d is in D given x(D). Note that (in practice) we know that d E B (that is to say if an adversary is trying to figure out if d E D they already know d E B), so what we want to calculate is the probability that d is in D conditional on d being in B and on x equaling x(D). More formally, we want to consider: P(d E DId E B, x(D) = x) where f) and ( have the same distribution as D and B. We would like to devise a formula to calculate an upper bound on this probability. First we need to build a few tools. Let B - D be the set of all people in B who are not in D. Note B - b and b are independent random variables, so P(d E B, d = P(d E Djx(b) B - D)P(d = x) = P(d E j - Dbx(b) = x)P(d bjx(D) = x) Djx(D) = x) = P(d E B - D)(1 - P(d E Djx(D) = x)) We also see, since d E b implies d C B, that P(d C D, d E B\x(D) = x) = P(d c Djx(D) = x) Using Bayes' rule and some algebra we see that P(dCEDdEB,x(D) =x)P( P(d C- b, d C- j-\x(b) =x) P(d E Djx(D) = x) + P(d E B, d blx(b) P(d E bx(b) = x) P(d E bx(h) = x) + P(d E b - b)(1 - P(d C D1x(b) = x)) 63 = x) 1 1 + P(d E f - D)(P d The next step is to consider P(d E 1-P(d D|x(D) i(I)=) (3.2) -- = x). This equals Djx(D) = x) = I-fl P(d = jdix(b) = x) = 1-(1-P(d = d1|x(D) = x))n Then note that P(d =jjx(D) Pd P = x) P(d = di, x(b) = x) = = x x) 21|(DP = X) P(d = diP P(X(x) x(d2 ,.. , d) = x P(x(D) = x) d) - (3.3) Let Pn(x) = P(x(b) = x); then equation 3.3 equals Pn 1 (x - d) ) =P(d =d 1 Substituting this in to equation 3.2 we get that 1 P(d E bd E 3, x(b) = x) = - P(D E 1+ ((P'dT__d))) 1--T-ndJ1 -(- 5 - b) )) Using the fact that (1 - z)' > 1 - nz when 0 < z < 1 (this follows from the inclusion exclusion principle) we get that <K 1 = 1 - P(d C B - D) + P(dEBD(P.(x)) PrivMAF(d, MAF(D)) nP(d=d1)P,-1(x-d) It is worth mentioning that the above upper bound is likely to be fairly tight, since z = P(d = d1|x(b) = x), which in practice is likely to be very small (especially 64 when the data set is anywhere near being safe to release, since P(d = dilx(D) = x) < P(d = d 1 |d E B, x(D) = x)). This quantity, PrivMAF(d, MAF(D)), is our measure of privacy. Note that for realistic choices of n, N, p and m we get that P(d E f - D) is approximately equal to (N - n)P(d = d1 ) and that P(d E f - D) << 1, so 1 - P(d E B - b) ~ 1. Plugging this in we get the measure 1 PrivMAF(d, MAF(D)) ~--,+(N-n)Pn(x) nP _1(x-d) which is what we use in practice. Moreover, we see that Pmx = (2n x(Ii)2n-xi This allows us to calculate PrivMAF(d, MAF(D)) easily. PrivMAF allows us to determine how much privacy is lost by a particular individual. What we want is the total privacy loss by releasing a study. It makes sense to look at maximum loss to any individual in our study, which is to say max PrivMAF(d, MAF(D)). We call this quantity PrivMAF. If PrivMAF is bounded dED above by a then, for any participant d E D, an adversary can be at most a percent confident that d actually participated in the study, which is the privacy guarantee we want. Naively it seems like calculating PrivMAF for m SNPs and n individuals has complexity O(mrn 2 ). By using the cancellation in P(X) it only ends up taking O(mn) time, which is asymptotically optimal. 3.5.3 PrivMAF for Data with Noise Added Note that the above framework can be generalized to measure the privacy loss present in releasing noisy version of MAF(D). In particular, let q be some random variable. Then we can let MAF(D) = MAFj(D) + 2, where 91,... , rm are iid random variables distributed as q. Then we want to measure how well MAF"(D) preserves 65 privacy. As above, we are interested in P(d E DIMAP(D) = MAF(D),d c B). The same derivation used in the previous section implies that this probability is upper bounded by: PrivMAP(d, MAF(D)) = 1 1 P(d E - D) + - P(deB-b)(P,7(MAF7 (D))) nP(d=di)P,_1 (MAF7 (D)-d) where Pm(v ) = 2 ( _ p)2n-i P(q = 2nrv - i) j=1 i=0 Note that the same approximations used in the previous section apply here. In this paper we will let P(r = i) be chosen proportional to e-"'', where i is an integer and c is a user chosen privacy parameter (relating to -differential privacy guarantees). We can then let MAF MAP and PrivMAFe 3.5.4 = PrivMAF PrivMAF for Data with Truncation Similarly we can consider the gain in privacy we get by rounding our MAF. More specifically, consider k > 1, then if MAF (D) = L, we let MAF runc(k) (D) be the result of truncating each entry in MAF(D) after k decimal digits. More formally MAF trunc(k) (D) = [MAF(D) * 1 0 kj In the below we let v = MAFtrunc(k) (D) to make the equations more readable. In order to measure privacy we want to calculate P(d E bMAFtrunc(k) (D) As above, we can upper bound this by 66 - v, d E B). 11 P(d E B - -- b) + = PrivMAFtrunc(k) (d, MAFtrunc(k) (D)) P(MAFtrunc(k)(D)=V) P(dEB-D) nP(d=di) P(MAFtrunc(k) (b)=vlddi) Note m P(MAF run(k)(D) = v) P(MAFrunc(k)(b) = v) = j=1 and m P(MAFtrunc(k)(b) = vld = di) = ) P(MAF runc(k)(f) = vjld = d 1 j=1 If Sk(vj) = {xI! truncates to vj}, then P(MAFtunc(k)(D - ( ) - 2n-i 2n iESk(vj) and P(MAF tunc(k)(D) = v.Id = (2 1) = 2) ;(i - p)2n-i+d,-2 iESk (v) This allows us to calculate PrivMAFtrunc(k) (d, MAFtrunc(k) (D)), just as we wanted. 3.5.5 Comparison to previous approaches Our approach is the first to give privacy guarantees for all individuals in a study. The method endorsed by Sankararaman et al. [861 provides guarantees of a sort- since the log likelihood test gives the best power for a given false positive ratio, it ensures that the power of any test can not be too big. The problem with their guarantee is that it is an aggregate guarantee, and does not ensure the safety of all participants. Our approach, on the other hand, does ensure privacy for all involved. Our method also takes into account the size of the pool from which our study is drawn, something the 67 likelihood approach does not take into account but which is important in measuring privacy. The PPV approach suggested by Craig et al. [76] does take the background population size into account, but again does not come with any privacy guarantees that hold for all participants. We also believe that our method gives a more intuitive measure of privacy than previous ones (though of course this is subjective). One might argue that the difference between the worse case and average case privacy loss are not that different, but our experiments do not seem to support this claim. Zhou, et al. [71 have also presented work with strong privacy guarantees; however, they examined the frequency and likelihood of pairs of alleles rather than MAF. Moreover, they give guarantees of a combinatorial nature (using k-anonymity), where as ours are probabilistic in nature. 3.5.6 A Release Mechanism: Allele Leakage Guarantee Test Motivation As mentioned in the paper, we want to use the above measure to decide if it is safe to release MAF(D) (Note that we could do something similar for perturbed MAF. but do not do so here). Assume we want to bound the probability of an adversary figuring out if someone took part in the study to be at most a, where 0 < a < 1. A first guess at how to do this might be to look at PrivMAF, and release if and only if it is at most a. Though in practice this approach seems to work well, in theory we can have trouble. The problem with this approach is that the decision of whether or not to release gives away a little information about D and thus destroys our probability guarantees (to understand why this is note that deciding to release if and only if PrivMAF is less than a means that any data being released gives away two pieces of information, namely the value of MAF and the fact that PrivMAF is less than a. If PrivMAF is greater than a with non-negligible probability (say 50 percent probability or so) this extra bit of information can actually be very informative). An obvious fix is to release if and only if max dE{0,1,2}m 68 PrivMAF(d, MAF(D)) < a. In this case the decision of whether or not to release gives no more information than outputting MAF(D) by itself. It turns out that this quantity is easy to calculate, and gives us the security guarantee we want. Unfortunately it is also overkill: the worst-case behavior is often much worse than the average case, so this policy is likely to tell us a data set is not safe to release even when it is. This leads us to propose another solution. For any choice of 3 we can define P0 = P(max PrivMAF(d, MAF(D)) < #jx(D) = x) dED where the probability is taken over the choice of D. Choose #3 so that 1 + - P - maxde{o,1,2}m P(d E B - b) and release the data if and only if PrivMAF(D) is less than or equal to /. This release test is what is referred to as the Allele Leakage Guarantee Test (ALGT) in the paper. We can show that ALGT gives us the privacy we require without too much overkill. On the other hand it is much slower than the above methods, since calculating P0 is slow (described below). Derivation ALGT tells us that, given both MAF(D) and the knowledge leaked by the decision to release, then from the adversaries view the probability that d E D is at most a for any choice of d E D. More formally: Theorem 4. Choose / as above. Then, if PrivMAF(D) /, for any choice of d E D we get that P (d E bd E B, x(D) = x, max PrivMAF(d, MAF(D)) < 3) a dc:D Proof. The proof basically comes down to repeated applications of the definition of conditional probability, independence, and Bayes rule. Let R be the event that 69 max PrivMAF(di, MAF(D)) </. Then P(d E )Id E B, x(h) = x, Max PrivMAF(d, MAF(D)) < #) = P(d E Djx(D) = x, R) dED P(d E Bix(D) = x, R) P(d E Djx(D) P(d E Dbx(b) = x, R) = x, R) + P(d e A - b)(1 - P(d E bx(b) = x, R)) 1 1+ P(dh-() 1+P(dEbjx(b)=x,R) P(d E B - (3.4) D) To simplify this note that P(d E Djx(b) = x, R) P(R,d E bix(b) = x) P(d E f - b) P(d C f - h)P PrivMAF(d, MAF(D))P(d E B|x(b) = x) P(d -b - D)PO - P(d EDjx(D) = x) P(d E B - D)P To simplifying this we look at ), which we see equals P(d G bx(h) = x) + P(d EA - b)(1 - P(d EDjx(D) = x)) P(d E B - D) P(d ED|x(D) x)(1 - P(d E B P(d E B - D) + =1 Using the fact that P(d E - b) = P(d c A - D)) - D|x(b) = x) this becomes P(d E bjx(b) = x)(1 - P(d E A - b)) P(d G fx(D) = x) - P(d E Dlx(D) = x) + P(d G blx(b) =1+ = x)P(d E A - b) ~d E f,x(D)=x)(1- P(d E j -b)) 1 - P(d E bd E A, x(b) = x) + P(d E bid E A, x(b) = x)P(d G A - b) P(d E <1 + 1 A = PrivMAF(d, MAF(D))(1 - P(d E B - b)) 1 - PrivMAF(d, MAF(D)) + PrivMAF(d, MAF(D))P(d E B - D) 70 1 =1+ 1 1<1+<+ 1 PrivMAF(d,MAF(D)) (1 -P(dE5--))PrvMAF(d,MAF(D)) Substituting this in to equation 3.4 results we get PrivMAF(d, MAF(D)) (1 (1-P(dEf -b))PrvMAF(d,MAF(D)) < PrivMAF(d, MAF(D))( 1 1 PrivMAF(d, MAF(D))( PrivMAF(d,MAF(D)) -1 1 1 - P(d E D~x(D) = x, R) P(d E f ) Putting it all together we see that P(d E bid E fx(b) = x, R) 1 + 1 1 PrivMAF(d,MAF(D)) I -- P(d E3 1 D) - + -O P 1 + 7- 131 P3 - maxde{o,1,2}m P(d E B - D)- which is what we wanted. Note that, in practice, since mate # max dE{O,1,2}m P(d E B - D) << 1, we choose an approxi- such that 1 0Z1 + POPO Scalability of ALGT We have presented results on moderate-sized datasets. We have also run our algorithm on larger artificial datasets (with 10,000 individual's and 1000 SNPs) and have found our ALGT implementation still runs in a reasonable amount of time, completing in just over 8 hours on a single core (Methods). Although our current implementation runs on a single core, the PrivMAF framework permits parallelization of Monte Carlo 71 sampling, the major computational bottleneck in our pipeline, i.e. computing 3, and thus is able to benefit from any parallel or distributed computing system. As dataset sizes grow, we expect to be able to keep pace by computing the PrivMAF statistic more efficiently. 3.5.7 Changing the Assumptions The above model makes a few assumptions (assumptions that are present in most previous work that we are aware of). In particular it assumes that there is no linkage disequilibrium (LD) (which is to say that the SNPs are independently sampled), that the genotypes of individuals are independent of one another (that there are no relatives, population stratification, etc. in the population), and that the background population is in Hardy-Weinberg Equilibrium (H-W Equilibrium). The assumption that genotypes of different individuals are independent from one another is difficult to remove, and we do not consider it here. We can, however, remove either the assumption of H-W Equilibrium or of SNPs being independent. First consider the case of H-W Equilibrium. Let us consider the ith SNP, and let pi be the minor allele frequency. We also let po,i, pi,i and P2,i be the probability of us having zero, one, or two copies of the minor allele respectively. Assuming the population is in H-W equilibrium is the same as assuming that po,i = (1 - p,)2, pi,i = 2pi(I - pi), and P2,i = p2. Dropping this assumption, we see that all of the calculations above still hold, except we get that x- Pr(xi() = xi) = where we use the convention that (") c = 0 when c < 0. x -2c c This allows us to remove the assumption of H-W Equilibrium. Unfortunately there are two problems with this approach. The first is statistical- instead of having to just estimate one parameter per SNP (pi), we have to estimate two (po,i and p1,i, since P2,i can be calculated from the other two). The other problem is that calculating Pr(xi(D) = xi) suddenly becomes 72 more computationally intensive, so much so that it is prohibitive for large data sets. In order to allow us to drop the assumption of no LD we can model the genome as a Markov model (you could also use a hidden Markov model instead which allows for more complex relationships, but for simplicity sake we will only talk about Markov models since the generalization to HMM is straightforward). In such a model the state of a given SNP only depends on the state of the previous SNP. To specify such a model we need to specify the probability distribution of the first SNP, and for each subsequent SNP we need to specify its distribution conditional on the previous SNP. It is then straightforward to modify our framework to deal with this model. As above, however, this requires us to estimate lots of parameters and also is much more time consuming; thus it is not likely to be useful in practice. Estimating P8 Unfortunately, P0 = P(PrivMAF(D) < 31x(D) = x) is not so easy to calculate. We use a Monte Carlo type approach to calculate it. More precisely we sample D conditional on x(D) = x, then estimate P3 as being the percentage of the D we /. generated for which max PrivMAF(d, MAF(D)) dED This approach requires us to be able to sample D such that x(D) = x. In order to do this consider t, = #{jldj,i = 2}, where dj,j is the genotype of dj at SNP i. Then the probability that t, = t is proportional to (;) ( t n - t +t - ) pit(2pi(1 - pi))xi2t(I - p.)2(n+t-xi) xi where we hold to the convention that (") = 0 if Ti < 0, m < 0 or r1 < m. This allows us to sample from ti. Knowing t, we can then calculate the number of that dj,= 1 (namely xi - 2 * ti) and the number that equal 0 (namely n j so + ti - xi). We can then randomly choose t, individuals to have dj,j = 2, and similarly for djj = 1 and dj,j = 0. Repeating this process for all of the SNPs gives us a random sample of D conditional on x(D) = x. Of course Monte Carlo estimation is often very slow. What alternatives do we 73 have? One is to note that, if M log = (di,) n( - 1)) then (conditional on x(D) = x) as m goes to infinity we get that (under reasonable assumptions, such as a MAF > .05, n fixed) M-EM y/var(M) X where EM is the expected value of M, var(M) is its variance (both of which we can calculate), and x is a unit normal centered at 0. This result follows from considering the distribution, Q, on the pairs xi, pi. Using the Central Limit Theorem, it is straight- forward to show the result holds when Q has finite support. One can then use a limiting argument to show it holds for more general Q (we do not include the detail here). This fact gives us a means of estimating P(PrivMAF(di, MAF(D)) < 31x(D) = x), which can then be used to estimate PO Unfortunately this is only an asymptotic bound, and experiments show that it often gives poor estimates in practice, so we have chosen not to use it in practice. It can be hoped, however, that more robust approximations are possible to speed up this calculation. Approximating Calculating / /3 can be quite time consuming, so one might be tempted to try to avoid calculating 3. One way to do this could be to use a instead of simulated datasets show that there is some equal to a, while below /30 we see / #o 3. Experiments on so that if a is above /0 then 3 is about quickly decays to 0 (this can be seen, for example, in Figure 3-4). This implies that, if PrivMAF is significantly below 0z (where we do not attempt to define significantly below here), then we should expect to a, so PrivMAF should be below / to be close 3 as well. This is a heuristic, but this line of reasoning seems like it could lead to something more reliable- more work is needed to know for sure. 74 3.5.8 Estimating the parameters The above model require estimates of the pi. How are they estimated? The straightforward method is to take another collection of individuals (our reference population) drawn from the same background population as our study participants. The minor allele frequencies of this population can then serve as an estimate of the minor allele frequencies for the background population. Alternatively, we can estimate the pi parameters from the union of this collection of individuals with the study participants, a method advocated by some previous papers. An alternative approach is to use Bayesian methods to place a prior on pi, which can then be updated based on the data in the outside population. We can then use this posterior probability on pi to estimate P(xi(D) = xi). In our results we used the naive approach, though arguments can be made for the other two. The other parameter that one must consider is N, the size of the background population. This depends a lot on the context, and giving a realistic estimate of it is critical. In most applications the background population from which the study is drawn is fairly obvious. That being said, one needs to be careful of any other information released in the paper about participants- just listing a few facts about the participants can greatly reduce N, greatly reducing the bounds on privacy guarantees (since the probability of a privacy compromise is roughly inversely proportional to N - n). 75 0 A W"Impm owo'U.- Chapter 4 Improved Privacy Preserving Counting Queries in Medical Databases 4.1 Introduction The rise of electronic health records (EHR) has led to increased interest in using clinical data as a source of valuable information for biomedical research 172, 28, 87, 103, 24]. In particular, there is interest in using this data to improve study designby, for example, helping identify cohorts of patients that can be included in medical studies. A first step in this selection is figuring out how many patients in a given database might be eligible to participate. The answer to such count queries can be used in budgeting and study planning. Unfortunately, even this simple application of EHRs raises concerns over patient privacy. It seems hard to believe that releasing a few count queries can lead to a major loss of privacy. Previous work has shown, however, that similar count queries can undermine user privacy on social networking websites such as Facebook [70], a result that can be extended to medical databases. One could, for example, ask for the number of 25 year old males on a certain medication who do not have HIV. If 77 the answer returned is zero you know that any 25 year old male patient on that medication is HIV positive, a very private piece of information. In order to help deal with this problem, it has been suggested that instead of releasing raw counts, institutions should release perturbed versions [971. Our contribution is to present an improved mechanism for releasing differentially private answers to count queries. 4.2 Previous Work One approach for ensuring privacy in this situation is to use a trust-but-verify framework, where medical professionals are trusted to make queries, but the query records are kept and checked to ensure that no abuses have taken place. checking these records is known as auditing. The process of There have been numerous different approaches suggested to produce such an auditable system [29]. Recently, some have started to work on generating auditing frameworks specific to the biomedical setting [51, 58]. A trust but verify approach allows medical professionals to get access to the data that they need while decreasing abuse. This does not prevent all abuses, however, instead relying on fear of punishment to prevent bad behavior. Moreover, it is not always clear (if it is ever clear) which queries should raise red flags in an audit and which should not. Because of these drawbacks it has been suggested that, instead of releasing raw counts, one could release counts perturbed by as small amount of noise in the hopes that this noise will thwart such privacy threats. These ideas have been included in 12B2 and STRIDE [28, 80, 79, 97]. Both of these methods work by adding truncated Gaussian noise to the query results. Unfortunately, the addition of Gaussian noise is ad-hoc and not based off privacy guarantees. In order to remedy this issue, Vinterbo et al. [97] suggested using the exponential mechanism as a means to produce differentially private answers to count queries. Although this is not the first attempt to use differential privacy in a medical 78 research context, it is the first we are aware of to do so as a way to improve cohort selection [100, 15, 55, 65]. In this work, we modify their method, which is described below, to get a mechanism for releasing count data in a privacy preserving manner while simultaneously preserving higher levels of utility. 4.3 Exponential Mechanism Our approach builds on the one of Vinterbo et al. [971. In a nutshell, they assume that there is a database consisting of n patients, and they want to return a perturbed version of the number of people in that database with a certain condition. To accomplish this goal, they introduce a loss function q defined by 3+(c q(y, c) y)a+ if c > y = S(y - c)a- else where a+, Za_, /+, 0- are parameters given by the user, c is an integer in the range [rmin, rmax], and y is the actual count we want to release. As opposed to the Laplacian mechanism, which uses the L 1 error as a loss function, this loss function allows users to weigh over-estimates and under-estimates differently. In what follows we assume a+ < 1 and o_ < 1. Note that no such assumption was made in previous works, but in practice it seems realistic since if a+ or a_ are larger than 1 it follows from Vinterbo et al. that q is very sensitive, which results in very inaccurate query results. Note that q has sensitivity Aq = max(b+, b_). Therefore, if we define X, to be a random function defined so that P(X,(y) = c) is proportional to exp(-wq(y, c)), then X, is 2Aqw-differentially private [78]. This is the mechanism introduced by Vinterbo et al. [97]. Their analysis, however, is based off of the general analysis of the exponential mechanism. By providing a more refined analysis, we are able to modify the above algorithm to give better utility for a given choice of E. 79 4.4 Derivation of Mechanism In order to improve upon previous algorithms, we give a new analysis of the privacy preserving properties of the exponential mechanism. This result is summed up in Theorem 5. Theorem 5. Assume that a+, a_ E q(y, c) Let X(y) [0,1], + > 0, /- > 0, and that O+(C - y)a+ if c > y 0_(y - C)a- else = be a random variable defined so that P(X(y) exp(-wq(y, c)) for all integers c E [rmin, rmax] = c) is proportional to and y G [0, n]. Then we have that, if ) w < min($, b+ b_ Aq and I and P(X(rmin) = rmin) exp(e)P(X(rmin + 1) = rmin) then X(y) is 6-differentially private. Proof. Let rmar 1 ZY = exp(-wq(y, c)) c=rmin then by definition P(X(y) = c) = xp(-wq(y, c)) Zy For a given c we are interested in (X(y)c) P(X(y')-c) and P(X(y')=c) where y = y' - 1. P(X(y)=c) There are three cases: the first is when y > the third when y C [rmin, rmax - 11. 80 rmax, the second when y < rmin, and Consider the case when y < rmin. Then Z, < Z,+1 so P(X(y + 1) = C) exp(-w0+((c - y - - (c - y),+)) - P(X(y) = c) < Z ZY+1 exp(w/3+((c - y)"+ - (c - y - 1)'+)) - Since a+ < 1 and c - y > c - y - 1 > 0, we see that, by definition, (c - y)a+ - (c y - 1)"+ < 1 so the above is <; exp(w3+) < exp(c) P(X(y + 1) = c) _ - exp(-w#+((c - y) - P(X(y)_= c) P(X(y + 1C) (C - y- ) + as desired. On the other hand, consider ZY Note that exp(-wO+((c - y)c+ - (c - y - 1)a+)) K 1 so the above is < q has sensitivity bounded by max(b_, b+) we see that z < exp(wAq) < exp(c), so putting the above together <exp(E) P(X(y) = c) - P(X(y + 1) = c) By a similar argument, when y rmax we have that - P(X(y) = c) <exp P(X(y + 1) = c) and Finally consider the case when rmin = c) - P(X(y + 1) = c) <exp(c) P(X(y) y < rmax. Then note that ZY+j = Zy - exp(-wq(y, rmax)) + exp(-wq(y + 1, rmin)) =Zy - exp(-#+(rmax - y)&+) + exp(-w/3_(y - rmin + 1)'-) 81 z-. Since . If c > y then q(y, c) > q(y + 1, c); it follows that since We will consider P(X(Y) w < ' we get P(X(y)_=_c) P(X(y ) = C) P(X(y + 1) = c) = exp(w(q(y + 1, c) - q(y, c))) Z _+1 zy ZY < zY+1 < exp(e) ZY Therefore assume that c > y. Then exp(w(q(y + 1, c) - q(y, c))) < exp(wof+), where equality is achieved when c = y + 1. We next consider . Note that exp(-w/3+(rmax - y)c+) is increasing in y while exp(-w,3(y - rmin + 1)"-) is decreasing in y. Thus it is easy to see that there exists a yo E (rmin, rmax) where -exp(-#3+(rmax-y)"'+)+exp(- oW_(y-rmin+l)"-) 5 0 if y > yo while -exp(-/3+(rmax y)e'+) + exp(-w'_3(y - rmin + 1)"-) > 0 if y < Yo. Thus if y ;> yo we see that Z Yi < 1 so P(X(y) = c) P(X(y + 1)) = c < exp(w/3+) ZY < <expw+1 iepo+ exp(c) _ep Z If, on the other hand, y < Yo, this implies that Z., < Z,+ 1; so by induction we see that Z ;> Z _, while -exp(-WO/+(rmax - y)"+) + exp(-WO-(y - Tmin + 1)-) < -exp(-W/ 3 +(rmax - rmin)"+) + exp(-wi(rmin - rmin + 1)0-) and thus ZY+1 < 1 +-exp(-L+(rmax -- exp(-/O+(rmax - - y)c+)+ exp(-wo/_(y - rmin rmin) a+) + exp(-wOf(rmin Zmin Zfo t+1 Zr,i It follows that 82 +1)-) rmin + 1)"~) P(X(y) =c) P(X(y) = P(X(y + 1) = c) = exp(w(q(y + 1, c) - q(y, c))) Z Z__ + ZY exp(OW) Zu'i P(X(rmin) = rmin) P(X(rmin + 1) = rmin) ~ Thus P(X(y) = c) < exp(e)P(X(y + 1) = c) for all y and c. A symmetric argument shows that P(X(y + 1) = c) < exp(e)P(X(y) = c) So X is -differentially private, as desired. This allows us to define Algorithm 1. The basic idea of the algorithm is that it does a search for the largest w so that X, is -differentially private. The parameter k controls how long we continue this search for. The choice of k affects how precise an approximation of the optimal w we get (the algorithms result differs by the true result by at most a factor of } Corollary 1. Algorithm 1 is 6-differentially private. Proof. By the proof of correctness of the exponential mechanism we know that co exp(c), so the algorithm always returns a value. Moreover, by Theorem 5 we see that l this value is -differentially private. 4.5 Theoretical Comparison As k (the precision of our algorithm) increases the running time increases, but we get a better and better estimate of the optimal value of w for our given 6. Note that, if one wanted to improve performance, instead of searching all wi they could use a 83 Algorithm 1 An E-differentially 3 Require: y, eaCYa_,a ,/4,_,rmax, private estimate of a count query rmin, k Ensure: c-differential privacy {(i+ T)2Aq}iefo., for i= 0,...,k do S - {..,...,Wk}. Xi = Xi end for for i=O,...,kdo ai = 'P(X(rmax)=rmax) P(Xi(rmax-1)=rmax) P(Xj(rmin)=rmin) P(Xi(rmin+1)=rmin) ci = max(a, bi). Let end for Let io be the largest i so that ci < exp(e). return Z,(y) = Xi, (y) binary search to find the largest wi C S so that ci < exp(E). We would like to compare this algorithm to the one given in [971, to see how our results compare. We might want to figure out how our algorithm compares to the naive exponential mechanism. To see this, we first need to prove the following theorem: Theorem 6. If Wi > w 2 then, for any c > 0, we get that P(q(y, X. (y)) < c) ;> P(q(y, XW2(y)) with equality if and only if P(q(y, X' 2 (y)) c) = <- C) 1. Proof. If q(y, x) < c then exp(-wiq(y, X)) < exp((w 2 - w1)c)exp(-W 2 q(y, X)) and if q(y, x) > c then exp(-wlq(y, X)) > exp((W 2 - W1)c)exp(-w 2 q(y, X)) For a given w let a= =exp(-wq(y, x,q(y,x)<c 84 x)) and 1 b== exp(-wq(y, x)) x, q(y, x)> c wi)c)a X- 2 (y)) c) = 1. Otherwise, by the above, a, > exp((w 2 - If bw 2 =0 then P(q(y, and b, < exp((W 2 - wi)c)b, 2 so 2 P(q(y, X. 1 (y)) < c) = > a, a,1 + bwl exp((w = a+-2 2 - exp((w 2 - wi)c)a wi)c)a,2 + exp((w = P(q(y, X- 2 (y)) 2 2 - w1)c)b 2 c) just as we wanted. This theorem implies that Algorithm 1 always outperforms the naive exponential mechanism in terms of q. More formally: Corollary 2. For a given e and c we get that P(q(y, Ze(y)) < c) > P(q(y, X 2Aq (y)) < c) Proof. There exists an w that depends only on e, k and q such that W > Z,(y) = Xo(y). Thus by Theorem 6 the corollary follows. 4.6 2 and E Results In order to test our algorithm we implemented it in Python (the code is available at http://groups.csail.mit.edu/cb/DPCount/). We also implemented the naive expo- nential mechanism in Python, so that we could compare the performance of the two methods. In order to measure the performance of our mechanism we look at the associated risk. The risk of a random function X at y is defined as the expected value of q(X(y), y) for y fixed. Note that by using the mechanisms described above, we are 85 4 3 2 CIn 0 -1 0.0 1.5 1.0 Epsilon 0.5 2.0 Figure 4-1: Here we plot privacy parameter c versus the risk (where the risk is on a log scale) for the naive exponential mechanism (blue) and our mechanism with search parameters k = 10 (green) and k = 100 (red). We see that in all cases ours performs much better than the naive exponential mechanism. implicitly trying to minimize the risk, so, it makes sense to use the risk as a measure of how well the mechanisms performs. As a first test of how our method stacks up we compared the risk of our mechanism at a given c with the risk of the naive exponential mechanism. To do this we chose 0+ = 2, _ = 1, o+ = a_ = 1, rFmin = 3, and rma_ are based off [971). We then plot mechanism with k = E versus - 106, y = 100 (these parameters risk for the naive mechanism as well as our 10 and k = 100 (see Figure 4-1). This figure demonstrates that for all choices of c our method results in much less risk than the naive approach. Though our method preserves more utility than the naive exponential mechanism for a given level of privacy, it does come at a slight cost- namely it takes longer to run. However, we can make the runtime realistic (Figure 4-2). This is partly due to 86 the fact that, instead of using the linear search to find io given in the algorithm we implemented a binary search procedure to cut down on running time. Figure 4-2 was generated by measuring the running time of both our mechanism (with k varying) and the naive exponential mechanism for varying choices of rmax. We used the same parameters as in Figure 4-1, and set c = 1.0. Here we see that our mechanism does take longer to run, and the running time increases as k increases. hand, even for the worse case scenario considered (k = 100, r On the other = 107) we see that our method only takes a few minutes. Since we are using these queries for cohort identification it seems like waiting a few extra minutes for greatly increased utility is a worthwhile trade-off. Moreover, it seems likely that the search step can be increased by using estimates of P(X,(a) = b), based off the fact that p(IX(a) - al > c) dies off quickly as c increases for reasonable p. 4.7 Choosing 6 The above algorithm assumed we chose some privacy parameter E then tried to optimize the utility as much as possible [61, 341. utility and privacy. Often we might want to balance Roughly speaking, the utility of X, increases as w increases, while privacy increases as c decreases, so balancing utility with privacy can be done by comparing w to its corresponding c. Using the analysis given by McSherry et al. [78j one would think w = '. Theorem 5, however, shows that this is not optimal. Therefore we can use our result to show that one can choose a better W if they want a certain privacy utility trade-off. This can be seen in Figure 4-3, where the E guaranteed by our analysis for a given w is compared to the c given by a naive analysis (with the parameters used in Figure 4-1). We see that our method shows that more privacy is guaranteed for less loss in utility, a fact that is useful when setting C. It is worth noting that many people may query the system again and again. Although differential privacy ensures that no one query reveals too much information it may be the case that, taken together, a set of differentially private queries can still lead to privacy breaches. Therefore it is necessary to limit the number of queries 87 10 3 102 101 10 0 U . a: 10 102 10 10- 105 10 1022 1 3 10 4 1 s 1066 10~ 97 TmaLx Figure 4-2: We plot rmaX, the maximum value returned by our algorithm, versus the runtime (both oit a log scale) for the naive exponential mechanism (blue) and our algorithm with search parameters k = 10 (green) and k = 100 (red). We see that, though our algorithm is slower, it still runs in a few minutes in all cases. 88 1.0 0.8 0.6 0.4 0.2 0.01 0.0 0.5 1.0 1.5 2.0 E Figure 4-3: We plot privacy parameter r versus the parameter p (which increases with utility) for both the naive exponential mechanism (blue) as well as our analysis with k = 10 (green) and k = 100 (blue). Note that p corresponds to utility (a higher p means a higher utility), while a higher c corresponds to less privacy. We see that our analysis shows that a given c corresponds to a larger 1,t which means that in many algorithms that balance utility and privacy when choosing c, our analysis will result in adding less noise to the answer of the count querry. 89 every user makes to the system. After a user uses up all of their queries (aka uses up their privacy budget) then, in order to gain access to more queries, they have to have previous queries audited. Note that this system runs into some of the same problems as a purely audit based system, but it greatly increases the effort needed to violate someone's privacy. Moreover, it can be hoped that the added noise makes attempts to violate privacy more obvious, since the behavior required to violate privacy will be more abnormal than without the noise. It is worth noting that, in this case, we might be able to stretch the privacy budget further than one would naively expect. For a given query there may be no sensitive information given away about a person- for example, if we asked the system how many individuals had HIV, then this query would not reveal sensitive information about individuals without HIV. This insight suggests that it might be possible to base the number of allowed queries off the number of sensitive queries made to each patient instead of the total number of queries, though it is not obvious how to do this in a privacy preserving manner. 4.8 Conclusion The emergence of EHRs offers both the promise of great utility and the threat of privacy loss. Here we provide one possible method to help balance the needs of the medical community with the rights of patients. Though we improve upon previous results, there are still many questions left to be answered. Even though perturbing data increases privacy it does not eliminate the possibility of something being revealed. To deal with this potential privacy breach we need methods for auditing query records to see if any malfeasance is being committed. We also ignored the question of what level of privacy (aka what choice of c) is appropriate- there are various methods to choose this parameter (such as cost benefit analysis [34]), and it is up to the medical community to decide which makes the most sense in our context. Finally, it should be noted that we addressed only one aspect of secondary use 90 of EHR. Electronic records have the possibility of providing a cornucopia of useful information, but there is a need to consider how to balance these gains with patient privacy, whether it is through new privacy preserving methods like the above or through new laws and regulations. 91 Chapter 5 Picking Top SNPs Privately with the Allelic Test Statistic 5.1 Introduction Genome-wide association studies (GWAS) are a cornerstone of genotype-phenotype association in humans. These studies use various statistical tests to measure which polymorphisms in the genome are important for a given phenotype and which are not. With the increasing collection of genomic data in the clinic there has been a push towards using this information to validate classical GWAS findings and generate new ones [103]. Unfortunately, there is growing concern that the results of these studies might lead to loss of privacy for those who participate in them [60, 19, 75]. These privacy concerns have led some to suggest using statistical tests that are differentially private [66, 105, 102, 67, 95]. On the bright side, such methods, properly used, can help ensure a high degree of privacy. These privacy gains, however, have traditionally come at a high cost in utility and efficiency. Moreover, since the genome is extremely high dimensional, this cost is especially pronounced, as was noted in previous works [95]. In order to help balance utility and privacy, new methods are needed that provide greater utility than current methods while achieving equal or greater privacy. Here we improve upon the state of the art in differentially private GWAS. We 93 build on the work of Johnson and Shmatikov [67], which applied the ideas of differential privacy to common analysis approaches in case-control GWAS. In particular, we show how to use nonconvex optimization to overcome many of the limitations of their method for picking high scoring SNPs in a differentially private way, making the approach computationally tractable [67]. Secondly, we demonstrate how to give improved significance estimates for the chosen SNPs using input, as opposed to output, perturbation based methods. Taken together, these results substantially advance our ability to perform differentially private GWAS. 5.1.1 Previous Work Previous works have looked at using differentially private versions of the Pearson X2 and allelic test statistics (defined below) to find high scoring SNPs, beginning with the work of Uhler et al. [95]. Since then numerous others have worked on this problem 166, 105, 102, 67], and there has even been a competition where teams attempted to improve on the state of the art 166]. These works focused on using three different approaches for picking high scoring SNPs- namely a neighbor distance based one, a Laplacian mechanism based one, and a score based one [95]. These studies have suggested the score based method is an improvement on the Laplacian based method. The relation between the neighbor based method and the other two is more complicated, however. Though it often outperforms them, it turns out that the ranking of SNPs favored by the neighbor method is not always the same as that favored by the other methods. Moreover, the neighbor method is more computationally demanding, leading others to use approximate versions of it [95]. Some of these works have also resorted to weaker privacy definitions, assuming the control groups genome is publicly available [105]. Beyond just choosing high scoring SNPs, others have also looked at ways of estimating significance after choosing the SNPs of interest. This goal has been achieved by calculating the sensitivity of the allelic test statistic and applying the Laplace mechanism directly to it, or by performing similar procedures for p-values [95]. 94 5.2 Our Contributions We significantly improve upon the promising neighbor distance based mechanism for releasing top SNPs [67]. We introduce an adaptive threshold approach which overcomes issues arising from the fact that the neighbor mechanism might favor a different ordering than that given by the allelic test statistic. We then introduce a faster algorithm for calculating the neighbor distance used in this method, making it tractable for large datasets. This algorithm works in three steps: (i) stating the problem as an optimization problem; (ii) solving a relaxation of this problem in constant time; and (iii) rounding the relaxed solution to a solution to the original problem. We also show how to obtain accurate estimates of the allelic test statistic, by focusing on the input, rather than the output, of the pipeline. In particular, we show that the input perturbation based method greatly improves accuracy over traditional output perturbation based techniques. We demonstrate this capability in two different set ups, one corresponding to protecting all genomic information (scenario 1) and one corresponding to protecting only the information from the case cohort (scenario 2). In addition, we look at methods for improving output perturbation in scenario 2. Finally we apply our methods to real GWAS data, demonstrating both our greatly improved computational performance and accuracy compared to the state of the art. 5.3 Set Up Assume we have a case-control cohort. For a given SNP let so, s, and s 2 be the number of individuals in the control population with 0, 1 or 2 copies of the minor allele, respectively. Similarly let ro, r 1 and r 2 be the corresponding quantities for the case cohort, and no, ni and n 2 be the same quantities over the entire study population. Let S be the number of cases, R the number of controls, and N the total number of participants. We assume that R, S and N are known. The allelic test statistic is given by 95 2 Y 2NV((2ro + ri)S - (2so + s1)R) ) RS(2no + ni)(ni + 2n 2 Note that Y only depends on x = 2ro + r1 and y = 2so + si, so we can overload notation and let - ' -RS(x 2 2N(xS - yR) + y)(2N - x - y) This statistic is commonly used in GWAS studies to help determine which SNPs are associated with a given condition [95]. (Note that the allelic test statistic ignores the effects of population stratification and similar issues, a problem we will address in the next chapter) We consider two different scenarios in this chapter. In scenario 1 we assume that all genomic data and phenotypic data is private- that is to say that only R, S and N are known. In scenario 2 we don't try to hide information about the control cohort and instead only try to hide the case cohort- that is to say that so, si and S2 are also known. This scenario has been investigated in previous work [105, 95]. Unless otherwise stated assume we are in scenario 1. 5.4 GWAS Data In this work we test ours and other methods on a Rheumatoid Arthritis (RA) dataset, NARAC-1, from Plenge et al.[47]. After quality control it contains 893 cases and 1244 controls. We performed quality control, removing all SNPs with minor allele frequency (MAF) less than .05. We considered only SNPs that where successfully called for all individuals. This process resulted in a total of 62441 SNPs to be considered. 5.5 Picking Top SNPs with the Neighbor Mechanism In practice we will not know ahead of time which SNPs are related to a given phenotype. Thus we would like to pick the top SNPs (aka those with the largest scores) in a 96 differentially private way. Previous works have used three different methods, namely a Laplacian based method, the score method, and the neighbor method. Here we will focus on the neighbor method,which is given in Algorithm 2. Let D and D' be databases of size n. The neighbor distance between D and D' is the smallest value of k so that there exists a sequence of databases, Do, - - , Dk so that, for all i, Di and Djai differ in exactly one entry, and that Do = D and Dk = D'. It is worth noting that this defines a metric (a distance metric is a measure of distance that obeys several mathematical properties, including the triangle inequality) on the space of all databases of size n. The neighbor distance equals ID - D'I for both scenarios. Sometimes, to simplify notation, we will let p(D, D') denote the neighbor distance. The neighbor method for picking SNPs works by picking a threshold w. All SNPs with an allelic score higher than w are considered significant, while all others are considered not significant. The neighbor distance of a given SNP to the threshold W is the minimum number of changes needed to flip a given SNP from significant to not significant or vice versa- that is to say the minimum neighbor distance to a significant database if the SNP is not significant or vice versa. We can then use this distance measure to pick our SNPs in a differentially private manner, as shown in Algorithm 3. This algorithm relies on two pieces of information: namely a score function q and a definition of datasets. Here we will assume that q is the allelic test statistic. In our set up all datasets with R cases and S controls are allowed. Note that this definition relies on an arbitrary choice of w. Because of this it is often the case that the SNPs favored by the neighbor method are not always the same as those with the highest allelic score [105, 95]. To deal with this problem we instead pick the threshold in a differentially private way. in Algorithm 3. This method ensures that as c, and 62 This method is codefied increase the probability of Algorithm 3 returning the correct SNPs goes to one. This behaviour is because the probability that wdp is in a small neighborhood of w goes to 1 as cl goes to oc. Since w separates the mret highest scoring SNPS from the rest, the probability that di > 1 97 for the met highest scoring SNPs and di < 0 for the rest also goes to 1 as Ei goes to oc. One the other hand, note that if di > 1 for the met highest scoring SNPs and di < 0 for the rest, then as e 2 goes to oc the probability of picking the top met SNPs goes to 1. Putting the above together gives us our result. Algorithm 2 The neighbor method for picking top met SNPs Require: Data set D, number of SNPs to return mret, privacy value c, score function q, and boundary w. Ensure: A list of met SNPs that is c- differentially private. for i=0,...,mdo else di = 1 - min({JD - D' : q(i, D') > L, ID') = ID ) if q(i, D) > w then di = min({ID - D' : q(i, D') < w, D' = IDI}) end if end for di) for all i. Let wi = exp(2 Choose met SNPs without replacement, where Pr(Choose SNP i) oc wi. return Chosen SNPS Algorithm 3 Our modified neighbor method for picking top met SNPs Require: Data set D, number of SNPs to return mret, privacy values e 1 and E2 , and score function q. Ensure: A list of met SNPs that is E1 + E2 - differentially private. Let w be the mean score of the metth and mret + 1-st highest scoring SNP. Let wdp be an El-differentially private estimate of w. return Choose SNPS using Algorithm 2 with C = C2 and boundary value Wdp. Theorem 7. For any choice of q we get that Algorithm 3 is (E1 + C 2 )-differentially private. Proof. To prove that Algorithm 3 is wdp (El + C 2 )-differentially private it suffices to prove is e-differentially private, since the second step of the algorithm is simply the exponential mechanism with privacy parameter E2 . This fact, however, follows trivEl ially. It should be noted that, in practice, we choose E, then let El = .Al and E2 = .9E. There is no real motivation for this choice, and it would be interesting to investigate the trade offs involved between the two parameters. 98 In previous works two main issues have been raised concerning the application of the neighbor method. The first, which we addressed above, is that the order of the SNPs might differ from the order given by the allelic test statistic. The other main issue is runtime. In the following sections we show that the runtime is much less of an issue than some have argued. We then implement these algorithms, using them to compare the utility of our modified neighbor method to previous methods. Fast Neighbor Distance Calculation with Private 5.6 Genotype Data 5.6.1 Method Description The major computational bottleneck of the neighbor method for picking high scoring SNPs has been the calculation of neighbor distance. This bottleneck has led some to calculate approximate neighbor distances (195]) while others have calculated neighbor distance under stronger assumptions, leading to weaker privacy guarantees (105], aka scenario 2). We are able to overcome this bottleneck using Algorithm 4. To help remedy the situation we introduce a new method for calculating the neighbor distance. Our method involves only a constant number of arithmetic op- erations per SNP. To understand our approach, fix a given SNP. Assume we have p = (ro, rl, r2 , SO, Si, S2 ) and a threshold w which we want to use to calculate the neighbor distance. Note that the neighbor distance can be expressed as the solution to the following optimization problem: 1, -Ip - p'I| minimize yee 2 subject to p' > 0, i = 1,..., 6 p +p +p3 X' = 2p' 4= R, p'+p4 p= S + p1, y' = 2p' + p'/ UW(p) (Y( ', y') - w) < 0 99 where i,(p) denotes the sign of Y(p) - w. By removing the integrality constraints and projecting down onto two dimensions we get the following relaxation: minimize g(x, y) subject to O<x<2R; O = gix) + g 2 (y) y<2S uW(p)(Y(x, y) - W) < 0 where x-2r-ri 2 g1(x W if 2(ro + r2 ) + ri > x > 2ro + r1 2ro+rl -x 2 if r, < x < 2ro + r1 r2 + x - 2(ro + r 2 ) - ri ro + r 1 - if 2R > x > 2(ro + r2 ) + ri otherwise x and y-2so-si 2 2 if 2(so + s2 ) + s1 > y > 2so + si so+si-y 2 if si < y < 2so + si 9 2 (y M s2 + y - 2(so + S2) - S1 if 2S > y 2(so + S2) + si otherwise so + Si - Y We say that (x, y) is feasible if it satisfies the constraints for this relaxed problem. Algorithm 4 first solves this relaxed problem by iterating over a small set of possible solutions (each of which can be found in constant time using the quadratic equation and some basic facts about convex optimization) then rounding to find a solution to the original problem. A proof of correctness as well as a few other details 100 are given in the rest of this section. Note that our algorithm assumes that w > 2N-1* 2N . This rqieethoees requirement, however, is not a problem, since in practice this corresponds to a rather large p-value (greater than .05 for N > 3). To accommodate this requirement, the only change we need to make in our neighbor picking algorithm is to round wdp up to 2N1 if this condition is not met. It is also worth noting that this algorithm relies on being able to check, for a given 6, if there exists a feasible x, y E Z with /31 (x) + 3 2(y) = 6 in constant time, where [gi(x)] +1 if r1 =0 andx-2ro-r 1 odd else gi ()] and {g2(y)1 + 1 02(Y) = [g 2 if si = 0 and y - 2so - s, odd else (Y)] We show how to check this condition below. Theorem 8. Algorithm 4 is correct and can be made to run in constant time. Proof. This is proven in the rest of the section. 5.6.2 El Proof Overview In this subsection and those that follow we prove that Algorithm 4 works. To do this, we will fix a particular SNP. This allows us to specify our database with a tuple p = (ro, rl, r2 , sO, s1 , S2). Let w be the threshold we want to calculate the distance to. The first thing to notice is that the neighbor distance from p to a database p' = (r/, r' , r/, s, s', s') is equal to 11p - p'I1. This allows us to state the neighbor distance problem as the following optimization problem: 101 Algorithm 4 Calculates the neighbor distance for SNPs in constant time Require: p = (ro, ri, r 2 , SO, s 1 , s 2 ) with pi 0 for i = 0, ... as usual; and threshold L> 2N1Let g(a, b) = g 1 (x) + g 2 (y) be defined as in the text. Let C denote the curve defined by 2N(xS - yR) 2 = , 5; N, R and S defined RSw(x + y)(2N - x - y) Find the set P of all points p E [0, 2R] x [0, 2S] on the curve C whose tangent line has slope in 1 {1,2, } 2 Using the quadratic equation find Q, the set of all p = (po, pi) E [0, 2R] x [0, 2S]n C with either PO E {2(ro+ r 2 )+ ri, 2ro + rl, ri, 0, 2R} or Pi E {2(so+s 2 ) + si, 2so + si, si, 0, 2S} mn mi= pEPUQ (p)] if Y(p) < w then return y end if for 6 E{, , + 5} do if exists feasible x, y E Z with /31 (x) + /2(y) = 6 then return 6 end if end for 102 minimize -Ip - p'li subjectto p'>0,i==1,...,6 (5.1) pO+p' +p' = R, p'+p'+p '=S = 2p' + p', y' = 2p' + p' U (p) (Y W, y') - CJ) < 0 where uw(p) denotes the sign of Y(p) - w. We begin by removing the integrality constraints, giving us: minimize 6 p'ER 1, -Ip 2 - p'Ii subject to p0 + p' + p' x' = 2 = (5.2) p4 = S R, p3 + p' pI + p', y' = 2p' + p'4 UW (p)(Y(', y') - W) 0 Since Y(p') only depends on x = 2r' + r' and y = 2s' + s' we would like to reduce this to a two dimensional optimization problem. To do this reduction, we need to consider g(x, y) = g 1 (X) + g 2 (y), where x-2r -rl 2 if r1 :5 x < 2ro + ri " 2ro+2r 2 if 2(ro + r 2 ) + r1 > x > 2ro + r1 g 1 (x W r 2 + x - 2(ro + r 2 ) - r1 ro if 2R > x > 2(ro + r 2 ) + r1 otherwise + r1 - x and 103 y- 2 if 2(so + s 2 ) + so-si 2 2so+s-y 92 (y) Si if si y > 2 so + si y < 2so + si = s2 + y - 2(so + s 2 ) - si sO + if 2S > y 2(so + s 2 ) +si si - y otherwise The importance of g is demonstrated by the following theorem: Theorem 9. Consider (x, y) G [0, 2R] x [0, 2S], then 2g(x, y) = min 2 p' feasible, x= p' +P' y=2p'/+pI p - p'i Informally, g(x, y) is the minimum neighbor distance from D to a database with x = 2p' + p'1 and y = 2 p' + p'. Proof. It suffices to show that g, (x) is the minimum number of steps needed to reach a database with x = 2r' + r' and g 2 (y) is the minimum number of steps to reach a database with y = 2s' + s'. We will prove it for gi(x), the other case being similar. Consider the case that x > 2ro + rl, the other case being similar. Since changing s does not change 2r' + r', we can assume that the si's stay fixed. First note that 2g 1(x) > min p' feasible, x=2p'+p' 1p - p'li since it is achieved by increasing r' and decreasing r' until either x = 2r' + r', or until r'= 0, in which case we decrease r' and increase r' until x = 2r' + r'. Moreover, this number of steps is clearly optimal. To see this fact, consider p' that achieves the above minimum, and note we can write (r/, r'/1, r' ) - (rO, r1, r2) = U1 (1, 0, -)+ 104 U2 (1, -1, 0) Then 2ro + r1 + 2u, + U 2 = 2r' + r' = x Meanwhile I(ro, ri, r 2 )-(r , 11 r')I1 = IuI+Iu 2 1+Iui+U 21 = Jui +Ix-2ro-ri-uiI+Ix-2ro-r-2 The fact that p' is positive is equivalent to u 1 < r2 , u 2 < ri and Consider the case that 2u1 < x - 2ro - r1 . This implies u 2 > 0. If us that I(ro, ri, r 2 ) - U1 u1 = + U2 > = U2- -ro. r2 this gives (r/, r'f, r')I1 = g(x). If, on the other hand, a 1 < r2 , choose that a 1 + e < r2 . Let v, = u 1 + e and v 2 E so 2e. Then if (r',r'',"r')= (ro, ri, r 2 ) + v 1 (1,0, -1) + v2 (1, -1, 0) we see |(ro, ri,r 2 ) - (ri', r/ r')II < |(ro, ri, r2 ) - (rb, r', r2)| while 2rg + r'' = x and r" > 0 for all i. This is a contradiction. Applying similar arguments to the other cases gives us our result. 17 It is worth noting for later that this theorem has the following corollary: Corollary 3. Consider (x, y) E [0, 2R] x [0, 2S] integral, then 2/ 1 (x) + 2/ 3 2(y) min = / 6 p'EZ feasible, x=2p'+p'1 , y=2p'3 +p p P - p' This result implies that the optimal solution to our relaxed problem is equal to the optimal solution of minimize Xly subject to g(x, y) = gi() + g 2 (y) 0 < x < 2R; 0 < y < 2S u (p) (Y (X, y) - W) <_0 105 1 (5.3) I while the solution to our initial integral problem is equal to the solution of minimize x,yE7z # 1 (x) + 3 2(y) subject to 0 < x < 2R; 0 < y < 2S (5.4) uW(p)(Y(x, y) - W) < 0 It is important to note that {pI2(poS - piR) 2 < o(po + pi)(2N - po - pi)} is a convex set. More specifically it is the convex hull of an ellipse. To see, note if o < 0 this is an empty set, if L = 0 this is a line. If, on the other hand, w > 0, it is easy to see that 2N(xS - yR) 2 < aRS(x + y)(2N - x - y) can be rewritten as (X, y)Q(x, y)T + qx + q2y + q 3 < 0 where Q is a positive semidefinite matrix. This implies 2N(xS - yR) 2 < aRS(x + y)(2N - x - y) is a (filled in) ellipse, and thus is convex. In order to solve this relaxed problem we iterate through two sets of possible solutions, one corresponding to extreme points and one corresponding to tangent points (Figure 5-1). Using this solution, we are able to find the exact neighbor distance. 5.6.3 Proof for Significant SNPs We first prove that Algorithm 4 works as advertised on significant SNPS (that is to say when Y(p) > w). We will first prove correctness, then prove it can be made to run in constant time. Theorem 10. Algorithm 4 returns the neighbor distance for significant SNPs. Proof. Assume we are looking at a significant SNP. Note that 1g(x, y) - f 1 (X) - 022(Y)I 4. Assume that 6 is the solution to the optimization problem in Equation 5.3, and that ', y' is the associated argmin. Then (x', y') must lay on the level set D 6 = {(a, b)12R > a > 0, 2S > b > 0, g(a, b) = 6} 106 l MI~AIA~R~ g(: (a) Tangent Points (b) Extreme Points Figure 5-1: Our algorithm for finding the solution, 6, of our relaxed optimization problem relies on the fact that there are only two possible types of solutions: (a) extreme point solutions and (b) tangent point solutions. Our algorithm finds all such extreme points and tangent points, and iterates over them to find the solution to our relaxed optimization problem. Note, however, that this set is the union of line segments, where each line segment has slope in {1, -1, 2, -2, .5, -. 5}. Either (x', y') is an endpoint of one of these segments or is in the interior of one of them. At the same time it must also be the case that N(W'S - y'R) 2 RS = w(x'+ y')(2N - x' - y') Let C be the curve defined by this equation. It is important to note that C is a smooth curve, so we can find its derivatives. If (x', y') is in the middle of a line segment 1 C D6 , we see that I must lie tangent to C, so (x', y') must be one of the points in C whose tangent line has slope in {1, -1, 2, -2, .5, -. 5}. Moreover, we know that since 0 < x' < 2R and 0 < y' < 2S that the slope of C at (x', y') is positive, so (x', y') must be one of the points in C whose tangent line has slope in {1, .5, 2}-the set we denoted by P in Algorithm 4. Therefore either (x', y') E P or (x', y') is the endpoint of some segment in D6 . If 107 it is an endpoint, however, it must be that (x', y') E (X', y') E P U Q, so the quantity Q. Therefore it must be that calculated by Algorithm 4 is equal to [g(x', y')1. We next want to show that there exists an integer pair (X, y) with 0 < x < 2R, 0 y 2S, x - x'| < 1, y y'I < 1, and - 2N 2Nj(xS - yR) 2 < w(x + y)(2N - x - y) We will prove this fact in the case that x'S - y'R > 0 and R < S, the other cases being similar. Form (x, y) by rounding x' down and y' up. If xS - yR > 0 it must be that (x, y) is contained in the triangle whose vertices are (x', y'), (0, 0) and (2R, 2S). Note if p is one of these vertices then 2N RS(POS - p1R)2 w(po + p1)(2N - Po - P1) so it follows by convexity that 2N 2 RSxS-yR) w(x+y)(2N-x-y) Therefore consider the case that xS - yR < 0. Let E be the set of all p E [0, 2R] x [0, 2S] with 2N Note that, since w > (p + p1 )(2N - po - p1 ) Rs(poS - p1R)2 N ,(0, 2) and (2R, 2S - 2) must be in the feasible region, E. Therefore, since (0, 2), (0, 0), (2R, 2S - 2) and (2R, 2S) are all in E, by convexity the parallelogram formed by them, denoted by W, must also be in E. Let F be the closure of [0, 2R] x [0, 2S] - W. Note (x', y') E F, since (x', y') is on the boundary of E. Similarly if (x, y) E then (x, y) E [0, 2R] x [0, 2S] - W C F. Note, however, that F has two components, EO and El, where poS - piR > 0 if (po, pi) E Eo and poS - p 1 R < 0 if p G E1 . This fact implies (X', y') E Eo and (x, y) E E1 . Note, however, that the L 2 distance between any point in E1 and any point in EO is at least V/-. By definition, however, I(x, y) - (X', Y')12 < V2. 108 This result is a contradiction, so it must be that (x, y) E E, as desired. Note g(', y') g(x, y) < g(x', y') + 2 < 31 (x) + 2(Y) + 2. Furthermore, g(x', y') + 6 < g(x, y) + 4 Finally, note that if (a, b) E E with (a, b) C Z 2 then g(x', y') g(a, b) < 131 (a) + 02 (b) This fact implies that the neighbor distance is somewhere between and + 6. Therefore Algorithm 4 returns the neighbor distance. FD Theorem 11. Algorithm 4 requires a constant number of arithmetic operations. Proof. First, note that P can be generated in a constant number of operations using ideas from basic calculus. We can also generate Q in constant time. This can be done by first iterating over all po E {2(ro + r2 ) + ri, 2ro + r1 , rl, 0, 2R} and, for each of them, use the quadratic equation to find p, so that Y(po, pi) = w, then doing something similar for each Pi C {2(so + s 2 ) + si, 2so + si, si, 0, 2S}. Since IPUQI < 26 and g takes constant time to calculate it follows that calculating y takes a constant number of arithmetic operations. All that remains to prove is that, for a given integer 6, deciding if there is an integer (x, y) E [0, 2R] x [0, 2S] with 2(xS - yR) 2 < w(x + y)(2N - x - y) and 01(X) + 02(Y) = 6 can be done in a constant number of arithmetic operations. To see this fact, consider 109 the case when r, > 0 and si > 0, the other cases being similar. Note we can write x-2ro2-r 2 if 2(ro + r2 ) + ri > x > 2ro + ri, and x - 2ro - r1 even x-2rO-ri+1 2 01 Wx if 2(ro + r2 ) + r1 > x > 2ro + rl, x 2ro+1-x 2 if ri x < 2ro + rl, x - 2ro - r1 odd 2ro - r1 even - = 2ro+ri -x+1 2 if r1 r2 + x - 2(ro + r 2 ) ro + - 2 2 2ro ri odd - 2(ro + 72 ) + r1 otherwise x r1 - if 2(so + s 2 ) + si > y > 2so + si, and y s--s+1 2 if 2(so + so+sl-y S2) + Si if s, 2 3 - > if 2R > x r1 y-2s -si 2 y- x < 2ro + ri, x y > 2so + si, y 2so - - si even - 2so - s, odd y < 2so + si, y - 2so - si even 2(Y) = 2so+sl 2 S2 y+l if si + y - 2(so + so +si - S2) - y 2so + si, - 2so - if 2S > y > 2(so + S1 si odd 82) + Si otherwise Y The above tells us we can define co, - -,7 so ci E {0, 1}, intervals UO,.. and rational vo,- - , v 7 , dO, y , . , U7 in R, d7E Q so that #1(x) = vix + di iff x E Ui, and x = ci mod 2 Similarly we can define c', - - - , c' so c' {0, 1}, intervals Us,.. . , U7, in R, and rational 110 v/,--, v/, d' ,* *, d' E Q so that 02(y) = v y Thus #31 (x) + /32(y) = + d'i iff y E Uj, and y = c' mod 2 6 if and only if, for some i,j E {O,.. . , 7}, we have that vix+v y+di+d =6 subject to x E Ui, y E U x = ci mod 2 and y = c' mod 2. For a given i and j we can easily check if such an integral x and y exist. ), then at +3 More explicitly, let a = (1, V') and 3 = (0, is a parameteri- zation of {(x,y)Ivix+ v'y+d + d' Since vi, vj E {1, }} 6} = either a has integer entries, or it has some entry that is not an integer but is an integer divided by two. Since ao = 1 we see that if at + then t must be an integer. Moreover, we see that at + / / is integral is an integer if and only if a(t + 4) +/3 is, and if both are integers they equal one another mod 2. We can easily find the interval [-y, -y2] so that at + / E U x U if and only if t E [-Y1, Y2]. Similarly, using the quadratic equation we can find the interval [fi, f2 ] so that Y(at + /) <w if and only if t E [fi, f2]. If we let [r, s] be the intersection of these two intervals it follows trivially that, if (x, y) = at + /, then '(xS - yR) 2 w(x + y)(2N - x - y) and 01(X) + 0 2 (y) = 6 if and only if t E [r, s]. It is easy to check in constant time if there is a to E [r, s] such that ato +/3 = (ci,c 2 ) mod 2 This follows from the fact that if t is a solution so is t + 4; hence there is such a to iff there exists such a to E {[ r], . . , [r] + 4} that also satisfies the conditions required. In conclusion, we can check in constant time if, given integral 6, there is an integer 111 -(xS - yR) 2 < w(x + y)(2N - x - y) and (x, y) E [0, 2R] x [0, 2S] with 1(x) by iterating over all choices of i and + 02(Y) j. = 6 Therefore this algorithm runs in constant E time. 5.6.4 Proof for Non-Significant SNPs Note that an almost identical argument to the above can be used for non-significant SNPs if we are willing to do a similar rounding procedure. In order to make the algorithm more efficient, however, we show that a much simpler rounding procedure works. Assume we are given a threshold w and a database with p = (ro, ri, r2 , S s1, s2 ), where pi > 0 for all i and Y(p) < w. Then we can calculate p(p, {vlY(v) > w}) in constant time. In order to perform this calculation, consider Equation 5.3. We note that Definition 2. A real valued function f is quasi-convex if, for evey a E R, f--1((-oo, a)) is a convex set. Lemma 1. Y: R 2 - R is continuous and quasi-convex on [0, 2R] x [0, 2S]. Proof. We know that Y is continuous from the previous section (assuming we let Y(0, 0) = Y(2R, 2S) = 0) 2 To see that it is quasiconvex, note Y(x, y) < a if and only if 2N(xS - yR) < aRS(x+y)(2N - x - y). If a < 0 this is an empty set, if a = 0 this is a line. If, on the 2 other hand, a > 0, it is easy to see that 2N(xS - yR) < aRS(x be rewritten as (x, y)Q(x, y)T + y) (2N - x - y) can + qix + q2 y + q 3 < 0 where Q is positive semidefinite. This fact implies 2N(xS - yR) 2 < aRS(x + y)(2N - x - y) is a (filled in) ellipse, and D thus is convex. Theorem 12. Algorithm 4 is correct on non-significant SNPs and runs in constant time. 112 Proof. The proof that it runs in constant time is the same as for significant SNPs, so we need only prove correctness. To see this assume that 6 is the solution to Equation 5.3. Let 0Q {(x, y)Ig(x, y) < 6} = Then 06 is a convex polygon. Furthermore, by definition of 6 we see that W= max Y(x,y) (x,y)eC Since Y is quasi-convex and C6 is a convex polygon, these facts imply that there is some extreme point of C6, let us call it (x, y), so that Y(x, y) = w. Since (x, y) is an extreme point of C6 it must be that x E {2(ro + r2 ) + rl, 2ro + ri, ri, 0, 2R} or y E {2(so + s2 ) + s1 , 2so + si, si, 0, 2S} so (x, y) E P by definition. Thus 6=ming(p) = min g(p) pEP pEPUQ Assume x E {2(ro + r 2 ) + ri, 2ro + ri, ri, 0, 2R} the other case being similar. Then note that this assumption implies gi (x) = # (x) by definition. We therefore consider y. By the proof of Theorem 5.6.2 we know that there exists s', s', s' > 0 so that s' + s' + s' = S, 2s' + s' = y and 292(Y) = (so, S1 ,S2 ) - (s', s, s)1 113 Moreover, the proof implies that either s' = 0, s' = 0 or s' = si. S= Assume that si, the other cases being similar. Then let us define U = (Uo,u 1 ,u 2 ) Ls'2]) = ([so], s', and v = (vo, vI, v 2 ) = ([s'j, s', note that 2uO + u1 > 2s' + s' > 2vO + v 1 were y either w < Y(x, y) = [s'21) 2s' + s', so by quasi-convexity Y(x, 2uo + ui) or w < Y(x, y) Y(x, 2vo + v1 ). Assume W < Y(x, y) < Y(x, 2uo + ui), the other case being identical. Note by construction Sju- (s, s, s')| < 1 so 1 1u - (SO, Si, S2)1 < 1-( (U - (SO', S'2)|+ I(so, si,s 2 ) - (sO, s1, s2)) < 1+ g (y) 2 22 Note, however, that 1IU - (SO, Si, S2)1 is an integer so it must be that 1 2-Iu- (sO, si, s 2 )j Note also that (rO, r 1 , r2 , uO, U1 , U 2 ) [g 2(y)] is feasible given the constraints in Equation 5.1 with associated score bounded above by gi(X) + [g 2 (y)I = [g1 (X) + g 2 (y)1 = [6] so the optimal value for Equation 5.1 is bounded above by [61. At the same time, since 6 is the solution to Equation 5.3 we see that the solution to Equation 5.1 must be greater than or equal to 6. Since it must also be an integer it follows that the optimal solution must be greater than or equal to [6]. Putting this all together proves that the optimal solution to Equation 5.1 equals [6] = [min PGPUQ 114 g(p)] = g and thus Algorithm 4 is correct for non-significant SNPs. 5.7 Results: Applying Neighbor Mechanism to Realworld Data 5.7.1 Measuring Utility We apply our improved neighbor mechanisms to the rheumatoid arthritis GWAS dataset described earlier in this chapter. In order to do so we use the following standard measure of performance [1051. Let A be the top met scoring SNPs, and let B be the met SNPs returned by some differentially private algorithms. We than measure the utility of the mechanism by considering IAnBI JAl -that is to say the percentage of SNPs that are correct. The closer to one this quantity is the better. Note that this is only one measure of utility. Others might also look at other measures of utility- after all, the difference between metth highest scoring SNP and the next highest scoring SNP may be small, and this measure does not consider that. We use this measure both because of its simplicity, and because it has been used in previous works [105]. 5.7.2 Comparison to Other Approaches We first want to see how our method compares utility-wise with other methods that have been developed, namely the score and Laplacian methods [95j. In order to test these methods we run our algorithm and both the other algorithms for various met and c to compare utility. The results can be seen in Figure 5-2. We see that in all cases our modified neighbor method (red) outperforms the Laplacian (green) and score (blue) based methods by a large margin. 115 1.0, 1.0 / t) 0.8 0 0 f U CL 0.6 z U e. 0.6 z I 2 1 - U II0.4 II0.4 U U 0.2 U 0.2 4:i 3 .2 a2 3c 1 s5 4 3 5 4 c(privacy) (a) mret = 3 (b) mret = 5 1.0 1.0, U U 01 2 c (privacy) 0.8 0 U 0.8 0 0.. 0.6 z . 0.4 U II0.2 U0.0 0-o o s5 1002 E(privacy) (c) mret = 20 5 2s 303 C0 5 10 is 20 20 25 2s 3 30 E (privacy) (d) mret = 15 10 Figure 5-2: We measure the performance of our modified neighbor method for picking top SNPs (red) as well as the score based (blue) and Laplacian based (green) methods for met (the number of SNPs being returned) equal to a. 3 b. 5 c. 10 and d. 15 for varying values of c. For mret = 3,5 we consider c between 0 and 5, while in the other cases we consider E between 0 and 30. We see that in all four graphs our method leads to the best performance by far. These results are averaged over 20 iterations. 116 5.7.3 Comparison to Arbitrary Boundary Value We also compared our modified neighbor method to the traditional neighbor method with predefined cutoff w. In particular we consider using a cutoff corresponding to a Bonferroni corrected p-value of .05 and .01. The results are pictured in Figure 5-3. When mret = 15 we see that as c increases the utility of our method (red) increases towards one, while the utility of the other methods (green for .05, blue for .01) seem to plateau around .85. This result demonstrates the advantages of using adaptively chosen boundaries, even if in some cases (mret E {3, 5, 10}) doing so leads to slightly decreased utility. Moreover, by changing the balance between C, and E2 it seems plausible that even this slight decrease can be overcome. 5.7.4 Runtime We test the runtime of our method on real data. In particular, we look at how long it takes to calculate the neighbor distance for all SNPs, since this is the time consuming step. In the past others have had to implement approximate versions of the neighbor distance to make it run in a reasonable time [951. We implemented a simple hill climbing algorithm similar to Uhler et al. [95]. We then tested it for various boundary's based off the number of SNPs we want to return (see Table 5.1). We see that our method is much faster than the approximate method, taking only about 3 seconds total to estimate the neighbor distances for all SNPs, regardless of the choice of mret. Moreover, we see that the approximate method gives results that can greatly differ from the exact one, as demonstrated by the average error per SNP (that is to say how far off the approximate result is from the actual result). This is in contrast to our result that does not have any error! 5.8 Output Perturbation Aside from picking high scoring SNPs, we are also interested in estimating their scores. In the past these estimates have been achieved by applying the Laplacian mechanism 117 1.0 1.0 U () 0.8 4) 0.8 0 0 0.6 I z 0.4 0.6 0.4 U o0.2 U U0.2 r I 1 a .2 3 2 3 4 5 E(privacy) 1 2 3 1 2 3 405 4 5 c (privacy) (b) mret = 5 (a) met = 3 1.0 , 1.0. U 0) 0.8 U 0 0.0.6 ir z U, z LAl U 0.6 II0.4 II0.4 0.2 U 0.2 0 5 10 15 20 20 25 25 tL" 30 30 (privacy) E 5 10 (c) mret = 10 20 1y c 25 3 (privacy) (d) mret = 15 Figure 5-3: We measure the performance of our modified neighbor method for picking top SNPs (in red) as well as the traditional neighbor method with cutoffs corresponding to a Bonferroni corrected p-value of .05 (in green) and .01 (in blue) for mret (the number of SNPs being returned) equal to a. 3 b. 5 c. 10 and d. 15 for varying values of E. For mret = 3, 5 we consider c between 0 and 5, while in the other cases we consider c between 0 and 30. We see that in the first three cases the traditional method slightly outperforms ours. When mret = 15, however, the traditional methods can only get maximum utility around .85, where-as ours can get utility arbitrarily close to 1. This shows how we are able to overcome one of the major concerns about the neighbor method with only minimal cost. These results are averaged over 20 iterations. 118 Table 5.1: We demonstrate the runtime of our exact method as well as the approximate method for various boundaries (where the boundary at mret is the average of the mretth and mret + 1st highest scoring SNPs), as well as the average L1 error per SNP that comes from using the approximate method. We see that our exact method is much faster than the approximate method. In addition, its runtime is fairly steady for all choices of met. We see the approximate method is faster and more accurate for larger mret- this makes sense since the average SNP will be closer to the boundary, so there will be less loss. These results are averaged over 20 trials. mret Our Runtime Approx Method Runtime Approx Method Error 3 3.0 seconds 71.15 seconds 22.15 5 3.0 seconds 53.4 seconds 13.77 10 3.05 seconds 38.2 seconds 7.62 15 3.05 seconds 31.85 seconds 5.76 to the output of the allelic test statistic. It turns out, however, that in practice this choice is not optimal. Here we show how to achieve improved output perturbation when in scenario 2 by perturbing the square root of the allelic test statistic, and apply this to the Laplacian based mechanism for picking high scoring SNPs (Algorithm 5). In the next section we go even further, showing how to improve performance using input perturbation. Algorithm 5 The Laplacian method for picking top mret SNPs Require: Data set D, number of SNPs to return mret, privacy value c, and score function q that takes in a SNP and a dataset and returns a score. Ensure: A list of met SNPs max q(i,D)-q(i,D')j. Let Aq = i=1,---,m,D~D' Let si = q(i, D) + Lap(O, 2mE ) for all i. return The mret SNPs with highest si 5.8.1 Calculating Sensitivity In order to apply output perturbation to a function q we need to calculate q's sensitivity. We discuss this below for both the allelic test statistic and its square root. 119 In scenario 1, if we choose q to be the allelic test statistic then we can use the sensitivity calculated by Uhler et al.[95. In scenario 2, if we choose q to be \/7, we can let / = 2N VRS(2so + si + x)(2N - si - 2so - X) IxS - (2so + si)RI Then f(2ro + ri) = v/Y. This implies A = max X,yE{0,--- ,2R},jx-yj<2 If(x) - f(y)I We do not, however, have to iterate over all such x and y. Instead, let Q be the set of all solutions to Let Q' = { [wJ w f"(x) = 0 such that 0 < x < 2R, unioned with{0, 2R, E Q} U { Fwl A \/Y = (2so+si)R w E Q}. Then it follows from basic calculus based that max XEO',yE{x+2,x-2,x-1,x+1},Oy 2Rjx-yj 2 If(x) - f(y)I Note that Q contains at most six elements and can be found easily (it comes down to finding the roots of a degree four polynomial), and as such it is easy to calculate maxEQ',yE{0,...,2R},jx-yl 2 If(x) - f(y)I in constant time. In the remaining cases we estimate the sensitivity by brute force over all possible values of 2r 0 + r1 and 2so + s, under the corresponding scenarios. We did not bother deriving the sensitivity analytically because these methods are shown to be less than optimal in the following sections. 5.8.2 Output Perturbation We first consider how Laplacian based output perturbation affects accuracy. Previous methods have worked by estimating the sensitivity of the allelic test statistic and applying the Laplacian mechanism to it directly [95]. Since we are now able to calculate the sensitivity of all of the above methods, we want to test how well each of them performs. In order to do this comparison we ran each method on our GWAS data and compared the errors, where the error is measured by L1 error. We shall see 120 3535 30 30 2525 25o 020 15 15 10 10 5 5 .2 0.3 0 0.5 0.6 0.7 0.8 0.9 03 1.02 0.4 0.5 0.6 0.7 0.8 0.9 Epsilon Epsilon (a) All SNPs (b) 10 Highest Scoring SNPs 1.0 Figure 5-4: Comparing two forms of output perturbation in scenario 2- the first coming from applying the Laplace mechanism directly to the allelic test statistic (green), the other applying it to the square root of the allelic test statistic then squaring the result (blue), comparing the L 1 error on the y axis with E on the x. We first apply it to 1000 random SNPs (a), then to the top ten highest scoring SNPs (b). We see that in both cases applying the statistic to the square root outperforms the standard approach. that, in scenarios 2, our square root based approach outperforms previous approaches. Scenario 2 We apply the methods to scenario 2. The results are pictured in Figure 5-4. In Figure 5-4(a) we choose 1000 SNPs at random and apply the Laplacian mechanism both directly to the allelic test statistic (the green curve) and to the square root of the allelic test statistic and square the result (the blue curve). We see that using the square root is the preferable method. We also consider how the two approaches compare on high scoring SNPs (since we are most interested in these SNPs). In order to test this we measured the error on the 10 highest scoring SNPs, the result being pictured in Figure 5-4(b). We see that in this case applying the Laplace mechanism to the square root still outperforms the standard approach by a large amount. 121 40 35 35 30 30 25 25 -2 15 15 10 10 5 5 02 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Epsilon Epsilon (a) All SNPs (b) 10 Highest Scoring SNPs 1.0 Figure 5-5: Comparing the output perturbation of the allelic test statistic for scenarios 1 and 2, comparing the L1 error on the y axis with on the x. In scenarios 2 (the blue curve) we add the noise to the square root then square the result, where as for scenario 1 (the green curve) we apply the Laplacian mechanism directly to the test statistic (this choice is motivated by the previous figures). We first apply it to 1000 random SNPs (a), then to the top ten highest scoring SNPs (b). We see that in scenario 2 we require much less noise than scenario 1. Across Scenario Comparison We now want to compare across scenarios. In particular we wonder if choosing not to hide genomic information can lead to utility gains. To test this scenario we took the best approach under each of the above regimes and compared the results (e.g., for scenario 2 we used the square root approach, for scenario 1 added the noise directly to the statistic). The results are plotted in Figure 5-5. In Figure 5-5(a) we plot the result of applying output perturbation to 1000 random SNPs under scenario 1 (red curve) and scenario 2 (blue curve), while in Figure 5-5(b) we do the same with the top ten highest scoring SNPs. In both cases we see that scenario 2 give us huge gains in utility. 5.8.3 Picking Top SNPs with the Laplacian Mechanism We can apply the above output perturbation method to picking high scoring SNPs using Algorithm 5. The first set up will be with q equal to the allelic test statistic in scenario 1. We will use the square root of the allelic test statistic in scenario 2. The 122 motivation for considering these is that, as we saw above, these are the two output perturbation approaches that perform best on GWAS data. The results of this test are pictured in Figure 5-6. We apply both methods to the RA dataset, using mret E {3, 5,10, 15} for figures 5-6(a), 5-6(b), 5-6(c), and 5-6(d) respectively. In each of the figures the blue curve compares the utility with C for scenario 1 and the green curve compares the utility with c for scenario 2. We see that, in all three cases, scenario 2 performs the best, greatly outperforming scenario 1, which is the set up previously consider in the literature [95]. 5.9 Input Perturbation We already saw what happens when, instead of perturbing the allelic test statistic with the Laplace mechanism, we perturb the square root of the allelic test statistic and square the result. In this section we consider using the Laplace mechanism to perturb the input instead of the output and releasing the result. Our results indicate that we get vast improvements using input, as opposed to output, perturbation. The above sections used the Laplacian mechanism on the output. How about the input? It turns out that this method greatly outperforms output perturbation. The method works as follows: Let x = 2ro + r1 and y = 2so + s 1 . Then we see that if x' and y' are the corresponding quantities for a neighboring database that Ix - x'I + y - y' < 2. Therefore we can let Xdp = 2 x + Lap(0, -) ydp 2 y + Lap(0, -) and = C then (xdp, ydp) is a C-differentially private estimate of (x, y). We can then estimate Y in a differentially private form using the equation 2N(xdpS - ydpR) 2 RS(xdp + ydp)(2N 123 xdp - ydp) 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 2 3 4 5 6 10 7 2 3 4 2 3 4 Epsilon 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 10 20 30 40 50 60 70 s0 90 10 100 = 6 Epsilon 7 20 30 40 50 60 70 Epsilon Epsilon (c) mret 6 5 8 9 10 80 90 100 (b) mret = 5 (a) mret = 3 0.1, 5 (d) mret = 15 10 Figure 5-6: We measure the performance of the Laplacian method for picking top SNPs in scenarios 1 (in blue) and 2 (in green) with mret (the number of SNPs being returned) equal to a. 3 b. 5 c. 10 and d. 15 for varying values of C. For mret = 3,5 we consider c between 1 and 10, while in the other cases we consider c between 10 and 100. We see that in all four graphs that scenario 2 leads to the best performance. Scenario 1, which is the one that appeared in previous work, leads to a greater loss of utility. These results are averaged over 100 iterations. 124 18 2.0 16 14 1.5 12 .010 0 0 6 0.5 2 0.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 9,2 1.0 0.4 0.3 0.6 0.5 0.7 0.8 0.9 1.0 Epsilon Epsilon (b) 10 Highest Scoring SNPs (a) All SNPs Figure 5-7: Comparing the output perturbation of the allelic test statistic for scenarios 2 (blue) to the input perturbation method in scenario 1 (green). We see that in this case, as opposed to previous cases, scenario 1 outperforms scenario 2 despite requiring stronger privacy guarantees. This demonstrates that input perturbation is preferable to output perturbation. We compare the error in this estimate with the methods obtained above (we do not include the output perturbation method under scenario 1, since the output method under scenario 2 outperforms it). The results are shown in Figure 5-7. We see that this input perturbation under scenario 1 actually performs much better than the output perturbation method in scenario 2 (and thus the output perturbation in scenario 1). How about input perturbation in scenario 2? We can use 2 2N(xdpS - yR) RS(xdp + y)(2N - xdp - y) We compare both input perturbation methods in Figure 5-8. In this figure we see that scenario 1 (the green line) performs worse than scenario 2 (blue), both when choosing 1000 random SNPs and when only considering the highest scoring SNPs. Taken together, we see that input perturbation seems preferable in practice to output perturbation. 125 1.0 12 0.8 10 0.6 0 6 0.4 0.2 2 0.2 02 0.3 0.4 0.5 06 0.7 0.8 0,9 0.2 1.0 Epsilon (a) All SNPs 0.3 0.4 0.5 0.3 0.4 0.5 0.7 0.8 0.9 1.0 0.6 0.7 Epsilon M.8 0.9 1.0 0.6 (b) 10 Highest Scoring SNPs Figure 5-8: Comparing the input perturbation of the allelic test statistic for scenarios 1 (green) and 2 (blue), comparing the L 1 error on the y axis with f on the x. In all three cases we use input perturbation. We see that scenario 1 requiring more noise to be added. 126 Chapter 6 Correcting for Population Structure in Differentially Private GWAS 6.1 Introduction The motivation behind genome-wide association studies (GWAS) is not only to find associations between alleles and phenotypes, but to find associations that are biologically meaningful. This task, however, can be difficult, thanks to systematic differences between different human populations. It is often the case that biologically meaningful mutations are inherited jointly with mutations that have no such meaning, leading to false GWAS hits. A classic example of this phenomenon is given by the lactase gene. This gene is responsible for the ability to digest lactose (such as in milk), and is more common in Northern Europeans than East Asians. Northen Europeans are also, on average, taller than East Asians. Thus a naive statistical method implies that this gene is related to height, though in practice it is not believed to be. In order to avoid this so-called population stratification, various methods have been suggested (see Chapter 2 for more details). One increasingly popular approach is to use linear mixed models (LMMs) to perform association studies. It has been shown that LMMs help correct for population structure in practice [31]. In this chapter we extend the idea of privacy preserving GWAS studies to LMMs. 127 We focus on preventing a particular type of attack, namely phenotype disclosurethat is to say disclosure of private phenotype data. This means we ignore any leaked information revealing whether an individual participated in the study, while also ignoring any leakage of genotype information that might occur. We feel this is justified for numerous reasons. First, in many cases study participation is not private information (for example, it does not hurt to know someone is in the control population), so there is no reason to spend energy trying to hide this information. We choose to ignore leaked genotype information both for practical reasons (it makes the analysis approachable, and since all known attacks against GWAS test statistics result in disclosure of phenotype or participation) and the fact that genotype information is easy to get access to (by acquiring a physical sample and genotyping it). It is up to the particular user to decide if such limitations are acceptable. 6.2 Previous Work In the previous chapter we talked about the previous work that has been done on applying differential privacy to GWAS studies [66, 67, 95, 106, 46]. These works have focused on the Pearson and allelic test statistics. Though these statistics can be useful in many situations, they are not considered to be the correct statistics to use in the presence of population structure (see Chapter 2) In order to deal with population structure numerous techniques, including genomic control [161, EigenStrat [21], and Linear Mixed Models (LMM) [31], have been suggested. In recent years there has been a growing interest in using LMM for this task. This interest has largely been spurred by increasingly efficient algorithms that allow LMMs to be applied to larger and larger data sets [27, 23, 94, 43]. Previous work on differentially private LMMs has focused on estimating either variance components (using somewhat heuristic methods) [41 or regression coefficients 1111. Here we instead focus on using LMMs to determine which SNPs are significant, touching only briefly on the estimation of variance components. More specifically, we consider the approach taken by EMMA [27]. Here we assume 128 our phenotype (y) is generated by y= Z6+X3+E where X is a normalized genotype matrix, Z is the covariate matrix, 6 is a vector of fixed effects, E is drawn from N(O, Io-,), and # is drawn from N(O, I,,9). Using either maximum likelihood (ML) or Restricted Maximum Likelihood (REML) we estimate oe and Ug. In this chapter we will ignore covariates (aka assume Z is the null matrix), so REML and ML are the same. Let lg and o2 be these estimates. We want to test the null hypothesis that the ith SNP has no effect. In practice this test should be done by considering the model y = sr6, + X_/3 + e where xi is the ith column in X and X_, is X with the ith column removed. We can then test the null hypothesis that 6i = 0 by fitting this model and calculating TK,-1 xy where Ki = o,2j + K7-1 xi This statistic is approximately x 2 distributed. !X_,X-T. In practice this approach is time consuming. Moreover, it is much more difficult to come up with a privacy preserving version of this method. Therefore, instead of reestimating the ML estimates of Ug and or for each i using y = XA6o + X_/3+ we estimate it once for y =X +C then use the statistic y TK-1xj xT K- 1xj wherewher K= K ae m+ XXT. XjIn This is similar to the approach taken by EMMA [27], 129 except ours uses a different statistic. It has been shown to be a reasonable statistic when no one SNP has too large of an effect. 6.3 Our Contributions In this chapter we present, to our knowledge, the first attempt to use mixed linear models to perform association studies in a privacy preserving manner. More specifically, we focus on estimating the X2 statistic introduced in the previous section. In order to estimate this we make a few assumptions. We assume that 0-e and 9g have already been estimated, either in a privacy preserving manner or not (we do touch briefly on how to do this later on in the chapter). We also assume that we have some bound on the values that the phenotype, y, can take- this bound can be achieved by either using known information about the phenotype (its range in the background population, or for disease status the fact that it can take on only a small number of values), or by releasing information about the range of the phenotype in the study population. As mentioned above, we do not attempt to prevent the leakage of genotype information, only phenotype information (this goal can be seen as being similar to us releasing genotype information for the control cohort in the previous chapter). How can we justify this assumption? There are three motivations for this. The first is that genomic information can be collected easily by gathering biological samples from an individual, where as phenotype data might be more difficult to gain a hold of. More than that, it seems that attacks against GWAS statistics result in either knowledge about phenotype [63] or knowledge about participation in a study [60]. In many studies, however, knowledge about participation is not particularly damaging- for example, if the study consists of individuals in an EHR database then all participation tells us is that the given individual has their genotype on record with that particular hospital. The final reason for not trying to protect genomic data is that it makes the analysis much easier, and seems to enable us to get away with adding less noise than we would have to otherwise. Depending on the application this may 130 or may not be a safe assumption, but it is a first step. 6.4 GWAS and LMM As mentioned above, we assume that we know bounds on y-that is to say we know numbers a and b such that a < yj < b for all i. There are numerous ways to choose a and b. If the trait y is bounded for some reason (such as y being a 0 - 1 trait like disease status) this is easy. One could also use prior knowledge about the trait to choose a and b so that all yj are between a and b with high probability (those that are not can be either ignored or rounded to the interval [a, b]). The approach we take below is to set a = min(y) and b = max(yi). These choices release a little bit of information about our cohort, but in practice it probably only affects privacy negligibly (again this should be considered on a case by case basis-in practice one should actually use the induced neighborhood mechanism [69]; although we ignore that here-this can be achieved by replacing e with i in all the following mechanisms). With the above set up we would like to be able to compute the X 2 statistic for the ith SNP. More formally, if K is as above we want to calculate: (x[K-1 ) 2 xTK--1x where R = I, - _ (xK-1 Ry) 2 xTK- 1 xj 1, centers y. In order to calculate this in an C-differentially private way we note this statistic equals (py)2 where xTK- 1 R Vxf K-lxi Therefore, in order to give a differentially private estimate of the x 2 statistic it suffices to get a differentially private estimate of iy and square it. This seems like a difficult task since K depends on the participant's genome, and since we are inverting K things get messy. Note, however, since we are not trying to hide leakage of genomic information we can assume that pi is fixed and it is only y that is changing. 131 Therefore, what we are trying to come up with is a random function X : [a, b]n -4 R such that, if y and y' differ in exactly one coordinate; then for all S C R we have P(X(y) E S) < exp(C)P(X(y') E S) Below we will introduce two methods to achieve differentially private estimates for piy. We will see that both methods work well in certain domains- that is to say for different choices of e and different amounts of population stratification. 6.5 Achieving Differential Privacy Attempt One: Laplace Mechanism We begin by applying the Laplacian Mechanism. If fi(y) = jisy, then fi has sensitivity Af, = maxIpijI(b - a), where pij is the jth coordinate of pi. Then we can apply the i Laplacian mechanism to define F ,i(y) as a random variable drawn from a Laplace distribution with mean piy and standard deviation (b-a)Afi Theorem 13. F,,i is c-differentially private. 6.6 Achieving Differential Privacy Attempt Two: Exponential Mechanism We also use the exponential mechanism [78]. Thus, we will consider a function q [a, b]n x R -> Z+ which will serve as a loss function. Informally, qi(y, c) will be equal to the number of coordinates in y that need to be changed to reach a y' so that piy' = c (this is similar to the method proposed in [67]). This intuition can be formalized as qj (y, C) = min /= y'e[a,b ,y1 ' =cC y - y'0 Note that qi(y, c) = oc if there is no such y', and that qj has sensitivity Aq = 1. 132 Assuming we are given c, then we can define a random function G,i so that G,,i(y) has a density function P(G,,i(y) = c) which is proportional to exp(- qi(y, c)). Theorem 14. G,,i(y) is c-differentially private. 0 Proof. This follows directly from McSherry et al.. [781. Having defined G,,i(y) we want to be able to sample from it. This task can be achieved with Algorithm 6 by setting w =-. Algorithm 6 exp(-wqi(y, c)) Sampling from a distribution with Require: y, w, pi, a, b Ensure: A sample proportional to exp(-wqi(y, c)) Let i, = max(pij (b - yj), pij (a - yj)) Let Ij= min(pu 3(b - yj), pij (a - yj)) Let i 1 ,-- ,i, be a permutation on 1,..., n such that density 1 for all j. Let ji,- ,ja be a permutation on 1,..., n such that 'if proportional to >-- .4.. > Let uj = fi, 'in. Let 1, = for all k. Z 1u and Lk = E, 1 lj, k =1,,n. Choose k E {1,..., n} proportional to exp(-wk)(U - Uk-1 + Lk-1 Choose x uniformly at random from [Lk, Lk-1) U (Uk-1, U] Let Uk fj, = - Lk). Return x Theorem 15. If we let w = 2 then Algorithm 6 returns a sample from Gf,j(y). Proof. It suffices to show that, for any w > 0, Algorithm 6 returns a sample from a distribution W with density proportional to exp(-wqi(y, c)). Let Uk, Lk, lk and Uk be as in Algorithm 6. Assume that y and y' differ in at most k coordinates, then Piy - Is' = pij (yj - y/ ) < -(11 so k pYiy f > y k = Lk -- t i=1 Similarly 133 + -.-. + i k) k PY' so if qj(y, c) < k than Lk than qj(y, c) < k, so qj(y, c) the probability that qi(y, W) Uk = U PiY+ c < Uk. It is easy to see, however, that if Lk = = k if and only if c E [Lk, Lk-_1) U (Uk1, Uk]. C < Uk Therefore k is proportional to exp(-wk)(Lk_1 - Lk + Uk - Uk-1) while the density of W at c conditional on qj(y, c) = k (aka P(W = cIqj(y, c) = k)) is proportional to 1 if c E [Lk, Lk_1) U (Uk-1, Uk] and 0 otherwise. Therefore we can Uk - Uk_1). , n} proportional to exp(-wk)(Lk_1 - Lk + sample from W by first picking k E {1, ... Moreover, the density function for W conditional on qj (y, c) = k must, by definition, be uniformly random on {clqi(y, c) = k} and 0 elsewhere. Therefore we can sample from W by picking k then choosing c uniformly at random from [Lk, Lk-1) U (Uk-1, Uk]. This is exactly what Algorithm 6 does. Algorithm 7 A c-differentially private estimate of yY Require: y, c, pi, a, b, k Let wx= (1+ ) for x=0,...,k I in decreasing order. Let vi,. . . , v, be a permutation of Iil,...,pi, Let dx = exp(w) + (xp(-wx)l)vjj for x = 0,..., k Let ex = wi + log(dx) for x = 0,.. . , k. Let w = min{wXte1 e} Run Algorithm 6 with y, w, pi, a, b as above and return the answer. Note that the above result is based off of the general analysis of the exponential mechanism [781. By giving an analysis tailored to our situation, however, we are able to improve upon this algorithm to produce Algorithm 7. To do that we need the following Theorem. Theorem 16. Let W be a random function so that W(y) has density proportionalto 134 exp(-wqi (y, c)). If we assume that |tij| decreases as j increases and let d= than W is w exp(w) + (exp(-w) - 1)Ibii I n il jexp(-wj) + log(d,)-differentially private. More than that, this is tight in the sense that, if I < w + log(d,), then W is not i-differentially private. Proof. Let Z(y) be the normalization constant so that exp(Wq(yc)) Z(y) = P(W = c) is the density of W at c. The above theorem follows if we can show that, for all - neighboring y and y', that Z(') < d, by using the same proof method used to justify Z (y) the exponential mechanism [78]. To simplify notation let v = pi and q = qi Let Zj = l{clq(x, c) < j}I then we want to show that Ivk|(b Zj > - a) k==1 To see this let Uk,lk,Uflkik and jk be as in Algorithm 6. Then by the proof of Theorem 15 we see that Zi = L( un- lk k=1 By the definition of 3 fUk E k=1 Uk and Ik this must be greater than 3 = >3 max(vik(b - Yk), vk(a - Yk)) - min(vk(b - yk), vk(a - Yk)) k=1 135 max(Vk(b - a), uk(a - b)) = L = ( jvj|(b - a) k=1 k=1 which is just what we wanted. Note, however, since n-1 n Z(y) = exp(-w)Zi+Z exp(-wj)[Zj-Zj-1] = exp(-wn)Zn+(exp(-wj)-exp( -w(j+1))Z, j=1 j=2 we can apply the above result and some basic algebra to show that n v I(b - a)exp(-wj) Z(y) > ) j=1 Assume that y and y' differ in the kth coordinate. Choose ri, r2 , r', r' such that k= where u', ir = r 1 =jr2 = Jr'2 ',' ix,,i' and j' are defined for y' as the u , 12,ixa, and jx are for y. Note either r 1 > r' and r' > r2 or ri < r' and r' < r2 . We assume without loss of generality that r1 > r' and r' > r2 . We can then write n Z(y) = Z(uj - lj)exp(-wj) j=1 To simplify notation let = ujexp(-wj) + A1 i=j r2 -1 n S -ljexp(-wj) ujexp(-wj) + + r-1 j=I j=ri+i ri-1 A2 = 5 ujexp(-wj) j=r/ and r2 A3 = 5 -1jexp(-wj) j=r2+1 136 1 -jexp(-wj) We see that ) Z(y) = A 1 + A 2 + A 3 + uiexp(-wr1) - lr2 exp(-wr 2 Note that i, = Z' if x > r1 or x < r', while Jx j' if x > r' or x < r2 . Similarly x Ix = I' for x # k (since yx = y'). Putting this together we see that ) Z(y') = A 1 + exp(-w)A 2 + exp(w)A 3 + u', exp(-wr') - l',exp(-wr A 1 + A 2 + exp(w)A 3 + u', exp(-wr') - l',exp(-wr) = Z(y) - A 3 - urexp(wri) + l 2 exp(-wr2 )+ exp(w)A 3 + u' exp(-wr') - l,exp(-wr') Z(y) - uriexp(ori)+ lr2exp(-wr 2 ) + (exp(w) - 1)A 3 + U' exp(-wr') - l',exp(-wr') Since we are trying to maximize Z(y) we want to make this last equation as big as possible. Note, however, that A 3 < Z(y) - ulexp(-w) + Iiexp(-w) < Z(y) - Ivi1 I(b - a)exp(-w) where the second inequality follows from the fact Z > E3_ L'vkI (b - a) (proved above). Plugging this into the above inequality we get that Z(y') (exp(w) - 1)(Z(y) - Z(y) - uriexp(wri) + lr2 exp(-wr 2 )+ IviI(b - a)exp(-w)) + u', exp(-wr') - l',exp(-wir') 137 Note uriexp(-wri) - lr2 exp(-wr2) < (Ur, - lr2 )exp(-w) = IvkI(b - a)exp(-E) < Iv1(b - a)exp(-w) Plugging this into the above and doing some algebra we see that Z(y') < Z(y)exp(w) + Ivil(b - a)(exp(-w) - 1) dividing through by Z(y) and using the fact proved above that that Z(y) Ej= 1 ; IvjI(b - a)exp(-wj) to give us that IviI(b - a)(exp(-w) - 1) Z~)< Zy ex- w)- " I K(b - a)exp(-wj) Z(y') exp(w) + E exp(w) + (exp(-w) - 1)I[iil= Ei., d lpijlexp(-wj) To see that this bound is tight, simply consider y defined so yj = a if pij < 0, yj = b otherwise, and let y' be defined so y. = y 3 if y' = j > 1, with y' = a if Yi = b, else b. Corollary 4. Algorithm 7 is c-differentially private. Note that Algorithm 7 can be sped up by doing a binary search for W instead of a linear one- our python implementation takes advantage of this fact. 6.7 Results: Testing Our Method We test out method on real genotype data, with fake phenotype data. The genotype data consists of all the genotypes in HapMap. We generate the phenotype by choosing o2 and oj, calculating the normalized genotype matrix X, generating N(0, RIm) and E from N(0, 2jIn), # from where m is the number of SNPs (in this case we 138 10 lo 8 8 6 6 4 4 2 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.2 %.1 (a) All SNPs 0.3 0.4 0.5 0.6 0.7 0.6 0.9 1O (b) 10 Highest Scoring SNPs Figure 6-1: Comparing the output perturbation of the Laplacian based method (green) with our neighbor based method (blue) with a' = o2 = .5, with both (a) 1000 random SNPs and (b) the causative SNPs. We see that our method performs better in both cases for the choices of c considered. use m = 10, 000). Then our phenotype is y = X3+ e We then apply our method to the resulting phenotype, and see how well both our method and the standard Laplacian mechanism do at approximating the x 2 statistic on 1000 randomly sampled SNPs. We use (a, b) = (min(y), max(y)) for simplicity. We first consider what happens when o 2e = 01g = .5. We see that, for 1.0 > c > .1 our method performs better than the Laplacian approach (Figure 6-1), both on random SNPs and causative SNPs. 6.8 6.8.1 Picking Top SNPs The Approach Note that we can use the same methods as in the previous chapter to pick the top SNPs, applied to the score function q(y) = Ipjyj. In particular, note that we can use the method introduced in Section 6.6 to calculate the distance needed for the neighbor mechanism. 139 We can, however, improve upon the Laplacian and score methods by using a batch mechanism. These improved approaches are described in Algorithms 8 and 9. Note that Aq is easy to calculate: for each i = 1, - - - , n, pick the met SNPs with largest values of Ipiu I, sum those values up to get Aqj, and return the maximum over all i. Algorithm 8 The Laplacian method for picking top met SNPs with mixed linear models Require: Data set y, X, number of SNPs to return mret, privacy value 6, and parameters o, and o. Ensure: A list of met SNPs Calculate the piu, i = 1, . . , m. = max i=1,-- ,n max DC{1,...,m},IDI=mret Z pi|(b - a) jED / Let Aq = Aqm,, Let si = f/ipyj + Lap(0, 2Aq) for all i return The met SNPs with highest si Algorithm 9 The score method for picking top met SNPs with mixed linear models Require: Data set y, X, number of SNPs to return met, privacy value C, and parameters o and o,. Ensure: A list of met SNPs Calculate the pi, i = 1,.. . , m Let Aq = Aqmre = Let si = Z max i=1,-- 1m},IDI=mret pijI(b-a)) -max jED exp(c IA). Pick mret SNPs without replacement, where the probability of picking SNP i is proportional to si return The met SNPs chosen above. Theorem 17. Algorithms 8 and 9 are c-differentially private. Proof. The proofs are almost the same as those used by Uhler et al.. [951. 6.8.2 EZ Application to Data We use the same data as in Section 6.7, with o = U2= .5. We compare all three methods for picking the top mret SNPs for met = {3, 5,10, 15} (Figure 6-2a-d). We see that in all three cases that the score method performs the best, followed by the 140 1.01 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 ''10 20 30 40 60 50 Epsilon (a) mret 70 =- 80 90 1 100 0.8 0.6 0.6 0.4 0.4 0.2 0.2 / 0.6 20 30 40 0 60 70 40 50 60 Epsilon 70 80 90 100 80 90 100 (b) mret =5 1.0 -10 30 3 1.0 nn0 20 80 90 0. 100 0 20 30 40 50 0 70 Epsilon Epsilon (c) mret = 10 (d) mret = 15 Figure 6-2: We measure the performance of the three methods for picking top SNPs using score (blue), neighbor (red) and Laplacian (green) based methods with mret (the number of SNPs being returned) equal to a. 3 b. 5 c. 10 and d. 15 for varying values of e between 10 and 100. We see that in all four graphs that score method leads to the best performance, followed by the neighbor mechanism. These results are averaged over 20 iterations. neighbor mechanism, followed by the Laplacian based method. More than that, the dominance of the score method grows with mret- we believe this is due to the batch approach to the score mechanism that we take, allowing us to add less noise than otherwise required. Strangely enough this batch approach does not have as large an effect on the Laplacian method. 141 6.9 Estimating -e and Ug in a differentially private manner In practice, it seems unlikely that releasing a- and og will lead to any privacy concerns, so it seems reasonable to release them in most cases. In some cases, however, we would like to release differentially private versions of them. This is of particular interest since it has been shown that the statistical power of LMMs is increased when, instead of releasing a single estimate of a, and Ug we reestimate it for each SNP (or each chromosome), with the corresponding SNP (or chromosome) removed. Again we assume that we know some a and b so that each coordinate of y is between a and b. We want to find the o- = (o-, o- 2) that minimizes qx(9, Y) where K, = a-21" + fXXT. = YTK,-y + log(det(K,)) Note that this can also be modified to include covariates, but we do not do so here. We than apply a standard approach known as sample-and-aggregate, given in Algorithm 10. Details about this sample-and- aggregate approach can be found in previous work on differential privacy [81, 4]. Algorithm 10 A c-differentially private estimate of oRequire: y, c, a, b, k Partition {1, . . . , n} in k roughly equally sized pieces, S1,.-. , Sk. Let v be a (-differentially private estimate of var(y) (Can easily be done with the Laplace Mechanism. If the result is negative set this to 0). Let o-i be the a E [0, v] 2 that minimizes qi, where qi(-, y) = y[K7--yj + log(det(Kj,,)) where y, is y restricted to the individuals in Si, and similarly for Ki,,. return Y"' + (Lap(0, v), Lap(0, v)) 142 6.10 Conclusion and Future Work In this chapter we propose the first differentially private mechanism for releasing the results of statistical tests using LMMs in GWAS. Though our method effectively deals with the case of releasing a small number of SNPs, it runs into the same trouble that occurs so often in differential privacy- namely that the noise soon over takes the utility as the number of queries grows. Hopefully future work can help improve this trade off. There are numerous directions in which one could try to extend this work. For example, our method for estimating the variance components is based on a general mechanism, and it might be hoped that more specialized approaches could improve on it- in particular, it might be of interest to look at more robust estimators of u2 and 0, (as opposed to the REML), and try to use them to estimate the variance components. This would allow us to move away from the approximate EMMA like frameworks employed here to a framework more similar to the state of the art in LMM testing. 143 ds..4 0aef-2"'.^" 9w. ar:~~ aer w en ~ ae', ivsoeo~ 5iar ..:ea -r.: . -:' --" ' -f -/ -4 - :re - .-r-...-. --. .-s-.s r -2r . - ..- .. .ss -T. ..--r'~ .ay ,.-.-~.-,:.--. s4gr-a ... -... , ....4... ....... ... . ... . , . ... ... . . Chapter 7 Conclusion In this thesis we have explored various methods for preserving patient privacy in biomedical research. We have focused on two different overarching approaches. One approach involves modeling possible adversaries to measure privacy risks. The other approach focuses on preserving privacy in the absence of such a model. Under the first approach we introduced a measure, known as PrivMAF, that allows us to measure the probability of re-identification after publishing the MAF from a study. This probability was calculated using realistic models of what knowledge the adversary has and how the patient data was generated. The second approach was realized using the idea of differential privacy (see earlier chapters for a definition). In particular, we showed how to improve the utility of differentially private medical count queries, providing a possible means for better designing studies while still preserving patient privacy. We also applied the ideas of differential privacy to GWAS. In particular, we showed how to improve both the runtime and accuracy of previous methods for differentially private GWAS. Moreover, we extended these methods to new, more powerful GWAS statistics. It is not yet clear which approach is the most appropriate for ensuring biomedical privacy. On the one hand, model free approaches (such as differential privacy) don't rely on assumptions about what the adversary knows, and only make limited assumptions about the data. This gives us stronger privacy guarantees that don't break down when we overlook new sources of information an attacker might use. At 145 the same time, this increased privacy leads to a loss of accuracy that might be avoided by model free approaches-a loss that is especially pronounced in high dimensional data, such as the kind we find in genomics. In the end, new methods are needed to help overcome the limitations of both approaches. This thesis is one small step in that direction. 146 Bibliography [1] http://www.healthit.gov/policy-researchers-implementers/meaningful-useregulations. [2] http://www.hhs.gov/ocr/privacy/. [3] http://www.wisn.com/nurses-fired-over-cell-phone-photos-of-patient/8076340. [4] J Abowd, M Schneider, and L Vilhuber. Differential privacy applications to bayesian and linear mixed model estimation. Journal of Privacy and Con dentiality, 5(1):73-105, 2013. [51 M Atallah, F Kerschbaum, and W Du. Secure and private sequence comparisons. ACM Workshop Privacy in Electron Soc, pages 39-44, 2003. [6] E Ayday, J Raisaro, P McLaren, J Fellay, and J Hubaux. Privacy-preserving computation of disease risk by using genomic, clinical, and environmental data. USENIX, 2013. [7] R Bell, P Franks, P Duberstein, R Epstein, M Feldman, E Fernandez y Garcia, and R Kravi. Suffering in silence: reasons for not disclosing depression in primary care. Ann Fam Med, (9):439-446, 2011. [8] D Boneh, A Sahai, and B Waters. Functional encryption: Definitions and challenges. Proceedings of Theory of Cryptography Conference (TCC), 2011. [9] R Braun, W Rowe, C Schaefer, J Zhan, and K Buetow. Needles in the haystack: identifying individual's present in pooled genomic data. PLoS Genet., (10), 2009. [10] S Brenner. Be prepared for the big genome leak. Nature, 498:139, 2013. [11] K Chaudhuri, C Monteleoni, and A Sarwate. Differentially private empirical risk minimization. The Journal of Machine Learning Research, 12:1069-1109, 2011. [12] The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3000 shared controls. Nature, 447:661-683, 2007. 147 [13] D Craig, R Goor, Z Wang, J Paschall, J Ostell, M Feolo, S Sherry, and T Manolio. Assessing and mitigating risk when sharing aggregate genetic variant data. Nat Rev Genet, 12(10):730-736, 2011. [14] G Danezis and E De Cristofaro. Simpler protocals for pirvacy-preserving disease susceptibility testing. GenoPri, 2014. [151 F Dankar and K El Emam. Practicing differential privacy in health care: A review. Transactions on Data Privacy, 5:35-67, 2014. [16] B Devlin and K Roeder. Genomic control for association studies. Biometrics, 55(4):997-1004, 1999. [17] C Dwork and R Pottenger. Towards practicing privacy. J Am Med Inform Assoc, 20(1):102-108, 2013. [18] K El Emam, E Jonker, L Arbuckle, and B Malin. A systematic review of re-identification attacks on health data. PLoS ONE, 6(12), 2011. [19] Y Erlich and A Narayanan. Routes for breaching and protecting genetic privacy. Nature Reviews Genetics, 15:409-421, 2014. [201 A Ghosh et al. Universally utility-maximizing privacy mechanisms. SIAM J Comput, 41(6):1673-1693, 2012. [21] A Price et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet, 38:904-909, 2006. [221 B Anandan et al. t-plausibility: Generalizing words to desensitize text. Trans- actions on Data Privacy, 5(3):505 - 534, 2012. [23] C Lippert et al. Fast linear mixed models for genome-wide association studies. Nature Methods, 8:833-835, 2011. [24] F Doshi-Velez et al. Comorbidity clusters in autism spectrum disorder: An electronic health records time-series analysis. PEDIATRICS, 133:e54-e63, 2014. [25] F Hormozdiari et al. Privacy preserving protocol for detecting genetic relatives using rare variants. Bioinformatics, 30(12):204-211, 2014. [26] G Loukides et al. Anonymization of electronic medical records for validating genome-wide association studies. PNAS, (17):7898-7903, 2010. [271 H Kang et al. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet., 42:348-54, 2010. [28] H Lowe et al. Stride - an integrated standards-based translational research informatics platform. AMIA Annu Symp Proc., pages 391-395, 2009. [29] J Feigenbaum et al. Towards a formal model of accountability. NSPW, 2011. 148 [30] J Yang et al. Common snps explain a large proportion of the heritability for human height. Nat. Genet., 42:565-569, 2010. [31] J Yang et al. Advantages and pitfalls in the application of mixed-model associ- ation methods. Nat Genet, 46(2):100-6, 2014. [32] J Zhang et al. Privgene: Differentially private model fitting using genetic algorithms. SIGMOD, 2013. [33] K El Emam et al. A secure distributed logistic regression protocol for the detection of rare adverse drug events. JAMIA, 7(7), 2012. [34] Khokhar et al. Quantifying the costs and benefits of privacy-preserving health data publishing. JBI, 50:107-121, 2014. [35] L Bierut et al. AdhIb is associated with alcohol dependence and alcohol consumption in populations of european and african ancestry. Mol Psychiatry, 17(4):445-450, 2012. [36] L Kamm et al. A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics, 29(7), 2013. [37] M Fredrikson et al. Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing. USENIX, 2014. [38] M Humbert et al. Reconciling utility with privacy in genomics. WPES, pages 11-20, 2014. [39] M Mailman et al. The ncbi dbgap database of genotypes and phenotypes. Nature Genet., pages 1181-1186, 2007. [40] M Wolfson et al. Datashield: resolving a conflict in contemporary bioscienceperforming a pooled analysis of individual-level data without sharing the data. Int J Epidemiol., 39(5):1372hA1382, 2010. [41] Meystre et al. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med Res Methodol., 10(70), 2010. [42] P Baldi et al. Countering gattaca: efficient and secure testing of fully-sequenced human genomes. Proc. 18th ACM Conf. Comput. Commun. Security, pages 691-702, 2011. [431 P Loh et al. Efficient bayesian mixed model analysis increases association power in large cohorts. Nat. Genet., pages 284-290, 2015. [44] P Mohan et al. Gupt: privacy preserving data analysis made easy. SIGMOD, pages 349-360, 2012. 149 [45] R Bhaskar et al. Discovering frequent patterns in sensitive data. ACM SIGKDD, 2010. [46] R Chen et al. A private dna motif finding algorithm. JBI, 50:122-132, 2014. [47] R Plenge et al. New England Journal of Medicine, pages 1199-1209, 2007. [48] S Lee et al. Estimating missing heritability for disease from genome-wide asso- ciation studies. Am. J. Hum. Genet., 88:294-305, 2011. [491 S Wieland et al. Revealing the spatial distribution of a disease while preserving privacy. PNAS, 105(46):17608-17613, 2008. [50] W Xie et al. Securema: Protecting participant privacy in genetic association meta-analysis. Bioinformatics, 30(23):3334-3341, 2014. [51] Y Chen et al. Auditing medical record accesses via healthcare internation networks. Proceedings of the AMIA Symposium, pages 93-102, 2012. [52] Y Erlich et al. Redefining genomic privacy: trust and empowerment. PLOS Biology, 12(11):0.1371 /journal.pbio.1001983, 2014. [53] Y Zhao et al. Choosing blindly but wisely: differentially private solicitation of dna datasets for disease marker discovery. JAMIA, 22:100-108, 2015. [54] Z Huang et al. Genoguard: Protecting genomic data against brute-force attacks. 36th IEEE Symposium on Security and Privacy, 2015. [55] J Gardner and et al. Share: System design and case studies for statistical health information release. JA MIA, 20:109-116, 2013. [56] C Gentry. Fully homomorphic encryption using ideal lattices. STOC, 2009. [57] N Gilbert. Researchers criticize genetic data restrictions. Nature, 2008. [58] S Gupta and et al. Modeling and detecting anomalous topic access. Proceedings of the 11th IEEE International Conference on Intelligence and Security Informatics, pages 100-105, 2013. [59] M Gymrek, A McGuire, D Golan, E Halperin, and Y Erlich. Identifying personal genomes by surname inference. Science, 339(6117):321-324, 2013. [60] N. Homer, S. Szelinger, M.Redman, D. Duggan, W. Tembe, J. Muehling, J. Pearson, D. Stephan, S. Nelson, , and D. Craig. Resolving individual's contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays. PLoS Genet, 4(8), 2008. [61] J Hsu and et al. Differential privacy: an economic method for choosing epsilon. CoRR, page abs/1402.3329, 2014. 150 [62] J Hsu, M Gaboardi, A Haeberlen, S Khanna, A Narayan, B Pierce, and A Roth. Differential privacy: an economic method for choosing epsilon. Proceedings of 27th IEEE Computer Security Foundations Symposium, 2014. [631 H Im, E Gamazon, D Nicolae, and N Cox. On sharing quantitative trait gwas results in an era of multiple-omics data and the limits of genomic privacy. Am J Hum Genet, 90(4):591-598, 2012. [641 K Jacobs, M Yeager, S Wacholder, D Craig, P Kraft, D Hunter, J Paschal, T Manolio, M Tucker, R Hoover, G Thomas, S Chanock, and N Chatterjee. A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies. Nat Genet, 41(11):1253-1257, 2009. [651 X Jiang and et al. Privacy technology to support data sharing for comparative effectiveness research. Medical Care, 51:58-64, 2013. [661 Y Jiang and et al. A community assessment of privacy preserving techniques for human genomes. BMC Medical Informatics and Decision Making, 14(S1), 2014. [67] A Johnson and V. Shmatikov. Privacy-preserving data exploration in genomewide association studies. KDD, pages 1079-1087, 2013. [681 H-W Jung and K El Emam. A linear programming model for preserving privacy when disclosing patient spatial information for secondary purposes. International Journal of Health Geographics, 13(16), 2014. [691 D Kifer and A Machanavajjhala. No free lunch in data privacy. SIGMOD, pages 193-204, 2011. [70] A Korolova. Privacy violations using microtargeted ads: A case study. JPC, 3, 2011. [711 V Lampos, T De Bie, and N Cristianini. Flu detector - tracking epidemics on twitter. Machine Learning and Knowledge Discovery in Databases, 6323:599-62, 2010. [721 A Lemke and et al. Community engagement in biobanking: Experiences from the emerge network. Genomics Soc Policy, 6:35-52, 2010. [731 N Li, T Li, , and S Venkatasubramanian. anonymity and 1-diversity. ICDE, 2007. t-closeness: Privacy beyond k- [74] G Loukides, A Gkoulalas-Divanis, and B Malin. Anonymization of electronic medical records for validating genome-wide association studies. PNAS, 107(17):7898-7903, 2010. [751 T Lumley and K Rice. Potential for revealing individual-level information in genome-wide association studies. J Am Med Assoc, 303(7):659-660, 2010. 151 [76] B Malin, K El Emam, and C O'Keefe. Biomedical data privacy: problems, perspectives and recent advances. J Am Med Inform Assoc, 1:2-6, 2013. [77] D Manen, A Wout, and H Schuitemaker. Genome-wide association studies on hiv susceptibility, pathogenesis and pharmacogenomics. Retrovirology, 9(70):1- 8, 2012. [781 F McSherry and K Talwar. Mechanism design via differential privacy. Proceedings of the 48th Annual Symposium of Foundations of Computer Science, 2007. [79] S Murphy and H Chueh. A security architecture for query tools used to access large biomedical databases. JAMIA, 9:552-556, 2002. [80] S Murphy and et al. Strategies for maintaining patient privacy in i2b2. JAMIA, 18:103-108, 2011. [811 K Nissim, S Raskhodnikova, and A Smith. Smooth sensitivity and sampling in private data analysis. STOC, pages 75-84, 2007. [821 D Nyholt, C Yu, and P Visscher. On jim watson's apoe status: genetic infor- mation is hard to hide. Eur. J. Hum. Genet., 17:147-149, 2009. [83] J Oliver, M Slashinski, T Wang, P Kelly, S Hilsenbeck, and A McGuirea. Balancing the risks and benefits of genomic data sharing: genome research partic- ipantshAZ perspectives. Public Health Genom, 15(2):106-114, 2012. 1841 E Ramos, C Din-Lovinescu, E Bookman, L McNeil, C Baker, G Godynskiy, E Harris, T Lehner, C McKeon, J Moss, V Starks, S Sherry, T Manolio, and L Rodriguez. A mechanism for controlled access to gwas data: experience of the gain data access committee. Am J Hum Genet, 92(4):479-488, 2013. [85] B Reis, I Kohane, and K Madl. Longitudinal histories as predictors of future diagnoses of domestic abuse. BMJ, 339:b3677, 2009. [861 L Rodriguez, L Brooks, J Greenberg, and E Green. The complexities of genomic identifiability. Science, (339):275-276, 2013. [87] M Saeed and et al. Multiparameter intelligent monitoring in intensive care ii (mimic-ii): A public-access intensive care unit database. Crit Care Med, 39:952-960, 2011. [88] S Sankararaman, G Obozinski, M Jordan, and E Halperin. Genomic privacy and the limits of individual detection in a pool. Nat Genet, 41:965-967, 2009. [89] A Sarwate and et al. Sharing privacy-sensitive access to neuroimaging and genetics data: a review and preliminary validation. Frontiersin Neuroinformatics, 8(35):doi:10.3389/fninf.2014.00035, 2014. 152 190] E Schadt, S Woo, and K Hao. Bayesian method to predict individual snp genotypes from gene expression data. Nat Genet, 44(5):603-608, 2012. [911 L Sweeney. Simple demographics often identify http://dataprivacylab.org/projects/identifiability/, 2010. people uniquely. [92] L Sweeney. K-anonymity: a model for protecting privacy. InternationalJournal on Uncertainty, Fuzziness and Knowledge-based Systems, 10:557-570, 2011. [93] L Sweeney, A Abu, and J Winn. Identifying participants in the personal genome project by name. SSRN Electronic Journal, pages 1-4, 2013. [94] G Tucker, A Price, and B Berger. Improving power in gwas while addressing confounding from population stratification with pc-select. Genetics, 197(3):1044-1049, 2014. [95] C Uhler, S Fienberg, and A Slavkovic. Privacy-preserving data sharing for genome-wide association studies. Journal of Privacy and Confidentiality, 5(1):137-166, 2013. [96] T van Schaik et al. The need to redefine genomic data sharing: A focus on data accessibility. Applied and TranslationalGenomics, 3:100-104, 2014. [97] S Vinterbo and et al. Protecting count queries in study design. JAMIA, 19:750757, 2012. [98] P Visscher and W Hill. The limits of individual identification from sample allele frequencies: theory and statistical analysis. PLoS Genet, 5(10), 2009. [99] D Vu and A Slavkovic. Differential privacy for clinical trial data: Preliminary evaluations. 2009. [100] D Vu and A Slavkovic. Differential privacy for clinical trial data: Preliminary evaluations. Data Mining Workshop, 2009. [101] L Walker, H Starks, K West, and S Fullerton. dbgap data access requests: a call for greater transparency. Sci Transl Med, 3(113):1-4, 2011. [102] S Wang, N Mohammed, and R Chen. Differentially private genome data dissemination through top-down specialization. BMC Medical Informatics and Decision Making, 14(S1), 2014. [103] G Weber and et al. The shared health research information network (shrine): A prototype federated query tool for clinical data repositories. JAMIA, 16:624630, 2009. [104] A Yao. Protocols for secure computations (extended abstract). FOCS, pages 160-164, 1982. 153 [105] F Yu and Z Ji. Scalable privacy-preserving data sharing methodology for genome-wide association studies: an application to idash healthcare privacy protection challenge. BMC Medical Informatics and Decision Making, 14(81), 2014. [106j F Yu, M Rybar, C Uhler, and S Fienberg. Differentially private logistic regression for detecting multiple-snp association calable privacy-preserving data sharing methodology for genomin gwas databases. Privacy in Statistical Databases, 8744:170-184, 2014. [107] E Zerhouni and E Nabel. Protecting aggregate genomic data. Science, 321(5898):1278, 2008. [108j X Zhou, B Peng, Y Li, Y Chen, H Tang, , and X Wang. To release or not to release: evaluating information leaks in aggregate human-genome data. ES- ORICS, pages 607-627, 2011. 154