Preserving Patient Privacy in Biomedical Data Analysis DEC

Preserving Patient Privacy in Biomedical Data
ARCHIVES
Analysis
by
MASSACHUSElTS INSTITUTE
OF TECHNOLOGY
Sean Kenneth Simmons
DEC 24 2015
B.S., University of Texas (2011)
LIBRARIES
Submitted to the Department of Mathematics
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Applied Mathematics
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2015
@ Massachusetts Institute of Technology 2015. All rights reserved.
Author . .
Signature redacted
Department of Mathematics
July 21st, 2015
Certified by.
Signature redacted
Prfso
fAple
ahmtc
Bonnie Berger
Professor of Applied Mathematics
Thesis Supervisor
Accepted by .....
Signature redacted
r,
I
Peter Shor
Chairman, Applied Mathematics Committee
2
Preserving Patient Privacy in Biomedical Data Analysis
by
Sean Kenneth Simmons
Submitted to the Department of Mathematics
on July 21st, 2015, in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy in Applied Mathematics
Abstract
The growing number of large biomedical databases and electronic health records
promise to be an invaluable resource for biomedical researchers. Recent work, however, has shown that sharing this data- even when aggregated to produce p-values,
regression coefficients, count queries, and minor allele frequencies (MAFs)- may compromise patient privacy. This raises a fundamental question: how do we protect
patient privacy while still making the most out of their data?
In this thesis, we develop various methods to perform privacy preserving analysis
on biomedical data, with an eye towards genomic data. We begin by introducing a
model based measure, PrivMAF, that allows us to decide when it is safe to release
MAFs. We modify this measure to deal with perturbed data, and show that we are
able to achieve privacy guarantees while adding less noise (and thus preserving more
useful information) than previous methods.
We also consider using differentially private methods to preserve patient privacy.
Motivated by cohort selection in medical studies, we develop an improved method
for releasing differentially private medical count queries. We then turn our eyes
towards differentially private genome wide association studies (GWAS). We improve
the runtime and utility of various privacy preserving methods for genome analysis,
bringing these methods much closer to real world applicability. Building off this
result, we develop differentially private versions of more powerful statistics based off
linear mixed models.
Thesis Supervisor: Bonnie Berger
Title: Professor of Applied Mathematics
3
4
Acknowledgments
First and foremost, I would like to thank my thesis supervisor, Bonnie Berger, for
all her help and guidance throughout the years. I couldn't have done it without her
and her constant encouragement! I would also like to thank everyone in the Berger
lab, past and present, who have not only helped me grow as a researcher, but who
were there to help me keep my sanity- thank you all! In particular, thanks to PoRu, George and Jian for helping get me started on research; to Bianca, Sumaiya,
Sepehr, Deniz, and William for all the conversations I had with them about things
both biology and not biology related; to Noah for all the advice (in terms of writing
papers, performing research, and deciding what I should do career wise) over the years,
as well as to Rachel Daniels for all her help in editing; to Yaron for all the interesting
conversations we've had over the past year about RNA-Protein interactions, among
many other topics; and to Patrice for everything (you really do keep the lab from
exploding!). I am also indebted to Jadwiga for being a great collaborator and source
of advice over the past few years- I always looked forward to our weekly meetings!
Thanks also to Vinod Vaikuntanathan and Jon Kelner for agreeing to be on my thesis
committee. I'm also indebted to all the administrators in the math department who
have helped me over the years, and to my friends in the REFS program.
Thanks also to all the great advisers and mentors I've had over the year outside
of MIT. This includes Dr. Cline, Dr. Laude and Dr. Vick at UT for all their great
advice and interesting conversations. In addition, I owe a debt to Dr. Blanchet-Sadri
at UNCG for the three summers that introduced me to research and lots of interesting
problems. Along the same lines, I want to thank my mentors at NKU and the NSA
for their mentor ship during my stays there, and to Dr. Gordon and Dr Helleloid at
UT for all they did as undergraduate research advisers (albeit at different points in
my education).
I wouldn't have made it through my PhD without all my friends in the CS and
Math departments at MIT, including John, Hans, Ruthi, Padma, Adrian, and many
others. Thanks for all the companionship and advice!
5
Thanks also to my friends
from outside MIT- Jessie, Rachel, Anna, Ariel, Dan, and many others- for helping
me keep my sanity. In particular, thanks to Lupe. I'm not sure how I would have
kept my sanity these past few months without you around!
I would also like to thank my parents for all they have done over the years, I
wouldn't be here without their support. I'd even like to thank my sisters- a little
sibling rivalry always helps to motivate!
Finally I would also like to thank the NSF and the MIT mathematics department
for their generous funding.
6
Contents
Model Based Approaches . . . . . . . . . . . . . . . . . . . . . . .
23
1.2
M odel Free Approaches . . . . . . . . . . . . . . . . . . . . . . . .
24
1.3
The Place of Differential Privacy in Biomedical Research
. . . . .
26
29
Background
29
2.1.1
Basic Genetics. . . . . . . . . . . . . . . . . .
29
2.1.2
G WA S . . . . . . . . . . . . . . . . . . . . . .
30
2.1.3
The Rise of Electronic Health Records
. . . .
32
. . . . . . . . . . . . . . . . . .
33
2.2.1
Differential Privacy . . . . . . . . . . . . . . .
33
2.2.2
Other Approaches to Privacy
. . . . . . . . .
35
2.3
Privacy Concerns and Biomedical Data . . . . . . . .
36
2.4
Previous Applications of Privacy Preserving Approaches
.
.
.
.
.
.
. . . . . . . . ...
..
. . . .
Biomedical
.
Data .
2.5
.
Privacy Background
.. . . . . . . . .
38
. .
38
HIPAA and Other Legislative Approaches
2.4.2
Access Control
. . . . . . . . . . . . . . . . .
38
2.4.3
Differential Privacy . . . . . . . . . . . . . . .
39
Other approaches . . . . . . . . . . . . . . . . . . . .
39
.
2.4.1
.
2.2
Biology Background
.
. . . . . . . . . . . . . . . . . .
2.1
3
.
.
.
1.1
.
2
21
Introduction
.
1
One Size Doesn't Fit All: Measuring Individual Privracy in Aggre-
41
gate Genomic Data
7
41
3.1.1
Previous work . . . . . . . . . . . . . . . . . .
42
3.1.2
Our Contribution . . . . . . . . . . . . . . . .
44
M ethods . . . . . . . . . . . . . ... . . . . . . . . . .
45
3.2.1
The Underlying Model . . . . . .. . . . . . . .
45
3.2.2
Measuring Privacy of MAF
. . . . . . . . . .
46
3.2.3
Measuring Privacy of Truncated Data . . . . .
47
3.2.4
Measuring Privacy of Adding Noise . . . . . .
.
48
3.2.5
Choosing the Size of the Background Population
49
3.2.6
Release Mechanism . . . . . . . . . . . . . . .
50
3.2.7
Simulated Data . . . . . . . . . . . . . . . . .
51
R esults . . . . . . . . . . . . . . . . . . . . . . . . . .
51
3.3.1
Privacy and MAF . . . . . . . . . . . . . . . .
51
3.3.2
Privacy and Truncation
. . . . . . . . . . . .
52
3.3.3
Privacy and Adding Noise . . . . . . . . . . .
53
3.3.4
Worst Case Versus Average
. . . . . . . . . .
55
3.3.5
Comparing 3 to a
. . . . . . . . . . . . . . .
55
3.3.6
Reidentification Using PrivMAF . . . . . . . .
55
3.4
Conclusion . . . . . . . . . . . . . . . . . . . . . . . .
60
3.5
Derivation of Methods
. . . . . . . . . . . . . . . . .
61
3.5.1
Basic Model . . . . . . . . . . . . . . . . . . .
61
3.5.2
PrivMAF
. . . . . . . . . . . . . . . . . . . .
62
3.5.3
PrivMAF for Data with Noise Added . . . . .
65
3.5.4
PrivMAF for Data with Truncation . . . . . .
66
3.5.5
Comparison to previous approaches
67
3.5.6
A Release Mechanism: Allele Leakage Guarantee Test
68
3.5.7
Changing the Assumptions . . . . .
72
3.5.8
Estimating the parameters . . . . .
75
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3.3
.
3.2
.
Introduction . . . . . . . . . . . . . . . . . . . . . . .
3.1
8
4.1
Introduction .........
77
4.2
Previous Work
78
4.3
Exponential Mechanism
79
4.4
Derivation of Mechanism
80
4.5
Theoretical Comparison
83
4.6
Results . . . . . . . . . .
85
4.7
Choosing c . . . . . . . .
87
4.8
Conclusion . . . . . . . .
90
.
.
.
.
. . . . .
93
5.1
Introduction . . . . . . .
93
5.1.1
94
.
Picking Top SNPs Privately with the All elic Te st Statistic
Previous Work
Our Contributions
5.3
Set Up . . . . . . . . . .
95
5.4
.
95
5.2
GWAS Data . . . . . . . . . . . . . . . . . . .
96
5.5
Picking Top SNPs with the Neighbor Mechanism
96
5.6
Fast Neighbor Distance Calculation with Private Genotype Data
99
5.6.1
Method Description . . . . . . . . . . .
99
5.6.2
Proof Overview . . . . . . . . . . . . .
101
5.6.3
Proof for Significant SNPs . . . . . . .
106
5.6.4
Proof for Non-Significant SNPs
112
.
.
.
.
Results: Applying Neighbor Mechanism to Real- vorld Data
115
Measuring Utility . . . . . . . . . . . .
115
5.7.2
Comparison to Other Approaches . . .
115
5.7.3
Comparison to Arbitrary Boundary Value
117
5.7.4
Runtim e . . . . . . . . . . . . . . . . .
117
. . . . . . . . . . . . . .
117
5.8.1
Calculating Sensitivity . . . . . . . . .
119
5.8.2
Output Perturbation . . . . . . . . . .
120
Output Perturbation
.
.
.
.
5.7.1
.
5.8
. . . .
.
5.7
.
5
Improved Privacy Preserving Counting Queries in Medical Databases 77
.
4
9
5.8.3
5.9
6
Input Perturbation
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
122
123
Correcting for Population Structure in Differentially Private GWAS127
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
127
6.2
Previous Work
128
6.3
Our Contributions
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
130
6.4
GWAS and LMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
131
6.5
Achieving Differential Privacy Attempt One: Laplace Mechanism
. .
132
6.6
Achieving Differential Privacy Attempt Two: Exponential Mechanism
132
6.7
Results: Testing Our Method
. . . . . . . . . . . . . . . . . . . . . .
138
6.8
Picking Top SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
139
6.8.1
The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . .
139
6.8.2
Application to Data
140
6.9
7
Picking Top SNPs with the Laplacian Mechanism . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
Estimating o-e and c-g in a differentially private manner
. . . . . . . .
142
6.10 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . .
143
Conclusion
145
10
List of Figures
1-1
An adversary cannot get access to medical data directly. Instead, by
looking at published analyses (or querying the system with seemingly
legitimate queries) they are able to gain access to private information.
The data under consideration might be part of a large biomedical repository, a hospital's health records, or some other source.
. . . . . . . .
22
2-1
An example of a snippet of the genome . . . . . . . . . . . . . . . . .
30
2-2
An example of a SNP . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3-1
PrivMAF applied to the WTCCC dataset. In all plots we take n=1000
research subjects and a background population of size N=100,000. (a)
Our privacy measure PrivMAF increases with the number of SNPs.
The blue line corresponds to releasing MAFs with no rounding, the
green line to releasing MAFs rounded to one decimal digit, and the
red line to releasing MAFs rounded to two decimal digits. Rounding
to two digits appears to add very little to privacy, whereas rounding
to one digit achieves much greater privacy gains.
(b) The blue line
corresponds to releasing MAF with no noise, the red line to releasing
MAF 5 , and the green line to releasing MAF 1 . Adding noise corresponding to c = .5 seems to add very little to privacy, whereas taking
c = .1 achieves much greater privacy gains. . . . . . . . . . . . . . . .
11
49
3-2
Truncating simulated data to demonstrate scaling. We plot our privacy
measure PrivMAF versus the number of SNPs for simulated data with
n=10000 subjects and a background population of size N=1,000,000.
The green line corresponds to releasing MAFs with no rounding, the
blue line to releasing MAFs rounded to three decimal digit, and the
red line to releasing MAFs rounded to two decimal digits. Rounding
to three digits seems to add very little to privacy, whereas rounding to
two digits achieves much greater privacy gains. . . . . . . . . . . . . .
3-3
52
Worst Case Versus Average Case PrivMAF. Graph of the number of
SNPs, denoted m, versus PrivMAF. The blue curve is the maximum
value of PrivMAF(d, MAF(D)) taken over all d E D for a set of n =
1, 000 randomly chosen participants in the British Birth Cohort, while
the green curve is the average value of PrivMAF(d, MAF(D)) in the
same set. The the maximum value of PrivMAF far exceeds the average.
By the time m = 1000 it is almost five times larger. . . . . . . . . . .
3-4
54
ALGT applied to the WTCCC dataset. A graph of the uncorrected
threshold, a, versus the corrected threshold, / = #(a), from ALGT is
given in blue. The green line corresponds to an uncorrected threshold.
We see that for some choices of a, correction may be desired. For example, for a = .05 the corrected threshold is approximately 3 = .03. Here
we again use the British Birth Cohort with n=1000 study participants,
m=1000 SNPs, and a background population of size N=100,000.
3-5
.
.
.
56
ROC Curves of PrivMAF and Likelihood Ratio. ROC curves obtained
using PrivMAF (green triangles) and the likelihood ratio method (red
circles) to reidentify individuals in the WTCCC British birth cohort
with n=1,000 study participants and 1,000 SNPs.
12
. . . . . . . . . . .
57
3-6
ROC Curves of PrivMAF with Truncation. ROC curves obtained using
PrivMAF for reidentification of unperturbed data (in red, AUC=.686),
data truncated after two decimal digits (aka k = 2, in blue, AUC=.682),
and data truncated after one decimal digit (aka k = 1, in green,
AUC=.605 ).
We see that truncation can greatly decrease the effec-
tiveness of reidentification.
Note that the ROC of the unperturbed
data here is different from that in the previous figure. This is because
we used a different random division of our data in each case. . . . . .
3-7
58
ROC Curves of PrivMAF with noisy data. ROC curves obtained using
PrivMAF for reidentification of unperturbed data (in red, AUC=.696),
with noise corresponding to c = .5 (in green, AUC=.693), and with
E = .1 (in blue, AUC=.656). We see that adding noise can decrease the
effectiveness of reidentification. Note that the ROC of the unperturbed
data here is different from that in the previous figures. This is because
we used a different random division of our data in each case. . . . . .
4-1
59
Here we plot privacy parameter c versus the risk (where the risk is
on a log scale) for the naive exponential mechanism (blue) and our
mechanism with search parameters k = 10 (green) and k = 100 (red).
We see that in all cases ours performs much better than the naive
exponential mechanism.
4-2
. . . . . . . . . . . . . . . . . . . . . . . . .
86
We plot rma, the maximum value returned by our algorithm, versus
the runtime (both on a log scale) for the naive exponential mechanism
(blue) and our algorithm with search parameters k = 10 (green) and
k = 100 (red). We see that, though our algorithm is slower, it still
runs in a few minutes in all cases. . . . . . . . . . . . . . . . . . . . .
13
88
4-3
We plot privacy parameter c versus the parameter p (which increases
with utility) for both the naive exponential mechanism (blue) as well
as our analysis with k = 10 (green) and k = 100 (blue). Note that P
corresponds to utility (a higher It means a higher utility), while a higher
e corresponds to less privacy. We see that our analysis shows that a
given c corresponds to a larger
[,
which means that in many algorithms
that balance utility and privacy when choosing c, our analysis will
result in adding less noise to the answer of the count querry.
5-1
. . . . .
89
Our algorithm for finding the solution, 6, of our relaxed optimization
problem relies on the fact that there are only two possible types of
solutions: (a) extreme point solutions and (b) tangent point solutions.
Our algorithm finds all such extreme points and tangent points, and
iterates over them to find the solution to our relaxed optimization
problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5-2
107
We measure the performance of our modified neighbor method for picking top SNPs (red) as well as the score based (blue) and Laplacian
based (green) methods for met (the number of SNPs being returned)
equal to a.
mret =
3 b.
5 c.
10 and d.
15 for varying values of c.
For
3, 5 we consider c between 0 and 5, while in the other cases we
consider c between 0 and 30. We see that in all four graphs our method
leads to the best performance by far. These results are averaged over
20 iterations.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
116
5-3
We measure the performance of our modified neighbor method for picking top SNPs (in red) as well as the traditional neighbor method with
cutoffs corresponding to a Bonferroni corrected p-value of .05 (in green)
and .01 (in blue) for met (the number of SNPs being returned) equal
to a. 3 b. 5 c. 10 and d. 15 for varying values of c. For met = 3,5
we consider c between 0 and 5, while in the other cases we consider c
between 0 and 30. We see that in the first three cases the traditional
method slightly outperforms ours. When met
=
15, however, the tra-
ditional methods can only get maximum utility around .85, where-as
ours can get utility arbitrarily close to 1. This shows how we are able to
overcome one of the major concerns about the neighbor method with
only minimal cost. These results are averaged over 20 iterations. . . .
5-4
118
Comparing two forms of output perturbation in scenario 2- the first
coming from applying the Laplace mechanism directly to the allelic
test statistic (green), the other applying it to the square root of the
allelic test statistic then squaring the result (blue), comparing the L 1
error on the y axis with E on the x. We first apply it to 1000 random
SNPs (a), then to the top ten highest scoring SNPs (b). We see that
in both cases applying the statistic to the square root outperforms the
standard approach. .......
5-5
121
............................
Comparing the output perturbation of the allelic test statistic for scenarios 1 and 2, comparing the L1 error on the y axis with c on the x.
In scenarios 2 (the blue curve) we add the noise to the square root then
square the result, where as for scenario 1 (the green curve) we apply
the Laplacian mechanism directly to the test statistic (this choice is
motivated by the previous figures). We first apply it to 1000 random
SNPs (a), then to the top ten highest scoring SNPs (b). We see that
in scenario 2 we require much less noise than scenario 1.
15
. . . . . . .
122
5-6
We measure the performance of the Laplacian method for picking top
SNPs in scenarios 1 (in blue) and 2 (in green) with met (the number
of SNPs being returned) equal to a. 3 b. 5 c. 10 and d. 15 for varying
values of c. For met = 3, 5 we consider c between 1 and 10, while in
the other cases we consider e between 10 and 100. We see that in all
four graphs that scenario 2 leads to the best performance. Scenario 1,
which is the one that appeared in previous work, leads to a greater loss
of utility. These results are averaged over 100 iterations.
5-7
. . . . . . .
124
Comparing the output perturbation of the allelic test statistic for scenarios 2 (blue) to the input perturbation method in scenario 1 (green).
We see that in this case, as opposed to previous cases, scenario 1 outperforms scenario 2 despite requiring stronger privacy guarantees. This
demonstrates that input perturbation is preferable to output pertur-
bation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5-8
125
Comparing the input perturbation of the allelic test statistic for scenarios 1 (green) and 2 (blue), comparing the L1 error on the y axis
with c on the x. In all three cases we use input perturbation. We see
that scenario 1 requiring more noise to be added.
6-1
. . . . . . . . . . .
126
Comparing the output perturbation of the Laplacian based method
(green) with our neighbor based method (blue) with U2 =
with both (a) 1000 random SNPs and (b) the causative SNPs.
=
.5,
We
see that our method performs better in both cases for the choices of
6
considered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
139
6-2
We measure the performance of the three methods for picking top SNPs
using score (blue), neighbor (red) and Laplacian (green) based methods
with mret (the number of SNPs being returned) equal to a. 3 b. 5 c.
10 and d. 15 for varying values of c between 10 and 100. We see that
in all four graphs that score method leads to the best performance,
followed by the neighbor mechanism. These results are averaged over
20 iterations.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
141
18
List of Tables
5.1
We demonstrate the runtime of our exact method as well as the approximate method for various boundaries (where the boundary at mret
is the average of the mretth and meet + 1st highest scoring SNPs), as
well as the average L, error per SNP that comes from using the approximate method. We see that our exact method is much faster than
the approximate method. In addition, its runtime is fairly steady for
all choices of mret. We see the approximate method is faster and more
accurate for larger mret- this makes sense since the average SNP will
be closer to the boundary, so there will be less loss. These results are
averaged over 20 trials. . . . . . . . . . . . . . . . . . . . . . . . . . .
19
119
Symbol
Description
MAF
Minor Allele Frequency
SNP
Single Nucleotide Polymorphism
EHR
Electronic health record
PrivMAF
Our privacy measure
2
Variance due to the environment
og
Variance due to genetics
01
The tuple (Ge, ug)
K, K,
!XXT
D
Study cohort data (usually genomic)
X
The normalized genotype matrix
In
The n by n matrix of all ones
In
The n-dimensional identity matrix
N(p, E)
The multivariate normal distribution with mean y, covariance E
exp(x)
The exponential of x
P(S)
The probability of S
Lap(A)
The one dimensional Laplacian distribution with mean 0, standard deviation A
Lap,(A)
An n dimensional random variable whose entries at drawn from Lap(A)
Aq
The sensitivity of q
R
The number of cases in a case-control study
S
The number of controls in a case-control study
ri
The number of cases with genotype i at a given SNP, i E {0, 1, 2}
si
The number of controls with genotype i at a given SNP, i E {0, 1, 2}
Y
The allelic test statistic
det
The determinant
m
The number of SNPs under consideration
mret
The number of SNPs a given algorithm returns
PPV
Positive Predictive Value
+ U2I
20
Chapter 1
Introduction
In the last few decades, the use of sensitive patient data in biomedical research has
been on the rise [55, 24, 103, 28, 87, 721. Spurred by both the genomics revolution
and the rise of electronic health records (among other new sources of medical data
[71]), this valuable data holds the promise of revolutionizing medical research, giving
rise to the possibilities of personalized medicine and understanding the genetic basis
of various diseases.
This growth, however, has led to privacy concerns among patients and care providers
alike [80, 79, 3, 10, 83]. Medical records, for example, often contain sensitive data
(disease status, etc.)
that patients might not want revealed.
It has been shown
that this concern leads many patients to withhold valuable medical information from
doctors. In order to get patients to participate in research we need to find ways of
balancing privacy with the needs of the medical community.
At first glance it seems like this should be an easy task. Deanonymizing data (such
as removing identifiers such as name, social security numbers, etc.) is an obvious
solution. Again and again, however, it has been shown that such approaches can be
circumvented [18, 91, 59]. Perhaps the most famous example of this occurred when
Latayna Sweeney used voter registration records to reidentify supposedly anonymized
medical records belonging to the Governor of Massachusetts [92].
More recently,
Gymrek et al. were able to use deidentified genomic data in conjunction with public
databases to reidentify many participants in genetic studies [59].
21
41,
Published Analysis
Figure 1-1: An adversary cannot get access to medical data directly. Instead, by looking at published analyses (or querying the system with seemingly legitimate queries)
they are able to gain access to private information. The data under consideration
might be part of a large biomedical repository, a hospital's health records, or some
other source.
Surprisingly, even aggregate data can be a privacy risk.
Korolova showed that
count queries can give away private information about Facebook users [70]. In the
biomedical realm it has been shown that various genomic statistics (such as MAF [60]
and regression coefficients [63]) can lead to privacy concerns as well. This realization
has led various groups, including the National Institute of Health (NIH), to pull
aggregate data from public databases and put it in access controlled databases [57].
Though such precautions help preserve privacy, they also take a toll on the scientific
community.
The medical community seems to be at an impasse: how can we allow researchers
access to medical data (data that could possibly save lives), while still ensuring the
privacy of the individuals involved? In this thesis we consider various methods for
allowing researchers access to confidential medical data while still preserving patient
22
privacy.
After introducing some important biological and privacy background in
Chapter 2, we develop possible approaches to this problem.
1.1
Model Based Approaches
The work of Homer et al.
[60] demonstrated that even simple aggregate statistics
from genomic data can lead to privacy issues. In particular, they showed that minor
allele frequencies (MAF, the prevalence of a given allele in the study population) can
give away private information about study participants. This realization shocked the
biological community, leading many (including the NIH) to move public aggregate
data into control access repositories.
Since then, many researchers have looked into methods that allow the release of
this aggregate data while still ensuring privacy [19, 108, 88, 107, 95, 89]. Though
many of these methods are powerful, they have multiple drawbacks- either requiring
so much noise to be added to the data that it destroys utility, or not ensuring privacy
for all study participants.
In Chapter 3 we introduce a new, model based method that aims to overcome
both of these barriers. This method, known as PrivMAF, allows us to measure the
privacy lost by each individual in a study after releasing MAFs. We are able to apply
this measure to real data, showing that the level of privacy loss varies wildly between
individuals. PrivMAF can also be modified to measure the privacy gained by adding
noise to the MAF from a study, allowing us to ensure privacy with less utility loss
than required by model free methods such a differential privacy (see below). Finally,
we go on to show how PrivMAF can be used to decide when it is safe to release the
MAF from a study, giving confidence to researchers who want to ensure privacy while
still being able to perform useful research.
23
1.2
Model Free Approaches
Model based approaches to privacy allow us to use our beliefs about possible adversaries (assumptions about what they know, what they don't, etc.) in order to increase
the amount of data that we can safely release. Unfortunately, such approaches do not
protect against all possible adversaries. In order to do that model free approaches,
such as differential privacy [17, 95], are required.
Loosely speaking, differential privacy works by releasing a perturbed statistic that
can not be used to distinguish between a database containing a certain individual
and one not containing that individual, thus ensuring privacy (see Chapter 2 for
a more formal treatment). It has been suggested [67, 97] that differentially private
statistics could be used as a way to allow researchers access to EHR and other medical
data without violating privacy. In Chapters 4-6 we look at various ways this can be
achieved in practice, thus improving on the state of the art both in terms of accuracy
and computational efficiency.
One of the main areas where differential privacy has been considered is in study
design. Researchers often want to query medical data bases to figure out how many
patients in a particular hospital qualify as study participants. In order to allow this
while still preserving privacy, various systems, such as I2B2 and STRIDE [103, 281,
release noisy versions of these count queries.
Unfortunately, most of these noisy
mechanisms are ad hoc and do not provide formal privacy guarantees. More recently,
it has been suggested that one could release differentially private versions of these
counts queries to help preserve privacy while retaining utility [97]. In Chapter 4, we
show how a modified version of this differentially private mechanism allows for more
utility while still ensuring privacy.
With the projected rise of genotyping in the clinic [103], it has been suggested that
the data resulting from such tests might be used to further genomic research. Again,
however, such uses raise privacy concerns for those involved.
As a result, various
authors [95, 106, 67] studying ways of using this genomic data in a privacy preserving
way. In Chapters 5 and 6 we further investigate this line of research, improving on
24
the state of the art.
Most previous work has focused on using differentially private versions of the allelic
test statistic to perform genomic studies in a privacy preserving fashion [67, 95, 66,
105, 102]. In Chapter 5 we improve upon these approaches. Our main contribution
is to improve the accuracy and computational efficiency of the neighbor method, first
introduced by Johnson and Shmatikov [671, for picking high scoring SNPs. We are
able to use an adaptively selected boundary value to remove problems arising with
the accuracy of this method. Moreover, we present a new algorithm, based loosely off
convex optimization, to show that the neighbor method can be made computationally
feasible (more specifically, we show that, for a given SNP, the neighbor distance can
be calculated in constant time).
We then go on to compare the utility trade offs for two different privacy scenarios
that occur in the literature- one in which we assume that all genomic data is private
versus one in which only genomic data from the case cohort is private [95, 105]. We
see that, if we allow some leakage of information about the control cohort, we can
greatly improve the Laplacian based method for picking high scoring SNPs introduced
by Uhler et al. [95]. Moreover, in this case we see that we can modify the traditional
approach for output perturbation in order to better estimate the allelic test statistic
for a chosen SNP.
The final contribution of this chapter is to show that, in both scenarios, one can
better estimate the allelic test statistic using input perturbation instead of output
perturbation. This novelty allows us to get better estimates for the significance of a
given SNP.
The allelic test statistic is a powerful tool for better understanding the genetics underpinnings of disease.
In more diverse populations, however, the presence
of multiple subpopulations (people of different ethnicities, etc.)
can lead to many
false positives. Alternative statistics have been suggested for dealing with this prob-
lem [21, 27, 23, 94, 43]. Many of these are based of linear mixed models (LMM)
[27, 23, 94, 43]. Though there has been some work on fitting LMM in a differentially
private way
[4],
no work has attempted to use differentially private LMM in order to
25
perform association analysis.
In Chapter 6, we remedy this issue by introducing a
differentially private LMM based statistic. After showing how to calculate the statistic in a differentially private way, we go on to apply it to picking high scoring SNPs.
We also briefly touch on how to fit such models in a differentially private way.
1.3
The Place of Differential Privacy in Biomedical
Research
Though the tools of differential privacy have been around for years, the biomedical
community has been slow to adopt them [15]. Though this delay is partially due to
the limited knowledge about such approaches in the biomedical field, perhaps a bigger
reason is that current privacy techniques greatly reduce the utility of data and their
analysis. In a field whose main concern is human health, there is extra incentive to
give the most accurate analysis possible since lives could be on the line.
Despite this concern, there are a few important areas where differential privacy
might play a role in biomedical research. The most obvious one is when institutional
concerns or legal rules prevent data from being published- for example, consider
NIH's policy of not releasing much of its aggregate genomic data [57]. When such
limitations exist, it might be possible to release differentially private versions of the
data under consideration instead. Though this option is not ideal, if there is a choice
between noisy data and no data, the noisy data is often preferable.
The other application where differential privacy might be useful is when untrusted
users query a database. This situation has motivated many of the previous works on
differential privacy [97, 67], and some of the only applications of data perturbation
that have been implemented in real world systems [28, 80, 79]. In a nutshell, the
idea is that users who might want to use a large medical database to help design a
study (for example, to come up with hypotheses to test, researchers must first find
participants with certain traits for a study) or validate results can do so by sending
queries to the database and getting differentially private answers to those queries.
26
This ability also allows researchers to better determine if a given dataset would be
useful to them before going through the, often arduous, task of requesting access [96].
This approach allows researchers access to the database while minimizing privacy
concerns. As an added bonus, since the queries are being used as a preliminary step,
as opposed to being part of a rigorous analysis, there is less concern about the ethical
implications of returning inaccurate results.
Furthermore, systems such a GUPT [441 can be used to allow users to make any
query on the database they desire-although using such a general architecture does
greatly decrease accuracy.
Therefore, it makes sense to come up with specialized
methods such as the ones we introduce here for dealing with common queries.
27
Go
Cl
Chapter 2
Background
In this thesis we will be using several concepts from basic genetics and the privacy
community.
To help orient the reader we give a brief overview of these concepts
here. After introducing some basic concepts related to genomics and electronic health
records in the first few sections, we go on to introduce some basic concepts from the
privacy community. Finally, we finish off the chapter with an overview of some of the
privacy concerns facing the biomedical community today.
2.1
2.1.1
Biology Background
Basic Genetics
A large portion of human biological information is coded in the genome.
For our
purposes the genome can be thought of as a string consisting of the letters A, C, T
and G. Each of these letters is known as a nucleotide. Note that each location of the
human genome has two copies, one inherited from our father, one from our mother
(ignoring mitochondrial DNA, cancer, etc).
A single nucleotide polymorphism (SNP) is a location in the genome that differs
between different people in the human population. For a given SNP, an allele is one
possible value of the genome at that location. For most common SNPs there are at
most two alleles, one more common one (the major allele) and one less common (the
29
ACTTTG CCCCATAAAA
ACTTTGCCCCGTAAAA
Figure 2-1: An example of a snippet of the genome
A CTTTG CC CCAT A A A A
ACTTTGCCCCGTAAAA
Figure 2-2: An example of a SNP
minor allele). At a given SNP each individual can have either 0, 1 or 2 copies of the
minor allele.
As such, the genotype of an individual is often thought of as a high
dimensional vector with one dimension for each SNP, where each entry in that vector
is either 0, 1 or 2.
2.1.2
GWAS
One of the main aims of modern genomics is to link genomic polymorphism to disease [27]. Increasingly, it is becoming possible to achieve this goal with genome-wide
association studies (GWAS). The basic idea behind GWAS is simple: A researcher
collects a large cohort of people, and measures some trait, referred to as the phenotype (either a continuous trait like height or a discrete one like disease status). The
researcher also collects genotype data about each individual. It is then a simple matter to take the data and perform a statistical test (such as linear regression, logistic
regression, x 2 , allelic test statistic, etc.) to see which SNPs are related to the trait
under consideration.
30
Unfortunately, there are some complications that arise when performing this analysis naively. The fact that human genetic history is quite messy (due to the many
subpopulations of humans that have existed over time) leads to many associations
that are not relevant to the underlying biology. This problem is known as population
structure.
As an example of how population structure may complicate things, consider performing a GWAS related to height [431. Assume we have a study cohort containing
Northern European and East Asian individuals.
Thanks to an accident of history
Northern Europeans tend to be taller than East Asians. Moreover, Northern Europeans are more likely to have a mutated version of the lactase gene that allows them to
drink milk as adults. This would lead a naive analysis to suggest that height is related
to the lactase gene, an association that seems meaningless biologically (discounting,
perhaps, the effects of drinking milk on height).
In order to avoid these false positives, numerous methods have been suggested.
These include rescaling the results of the X 2 test to account for such associations [16]
and adding principal components as covariates in linear or logistic regression [21, 94].
A more recent solution has been to use linear mixed models to correct for this
structure [27, 23, 94, 43]. Assume that X is a given n x m genotype matrix with
each column corresponding to a SNP, each row to an individual (normalized so each
column has mean 0, variance 1- other normalizations are also used, we use this one
for simplicity), y a phenotype, and y the mean centered phenotype. Then a mixed
linear model is of the form
X, =X+ E
where
#
=
N(0, RIm) and E= N(0, aIn) for some parameters og and oe. One
can fit this model and use it to try to find which SNPs are associated with y (more
details are given in later chapters). Furthermore, the estimated parameters o-e and ug
are of biological interest, since they help researchers figure out how much of a trait's
variance is due to genetics (aka how heritable it is) [30].
31
This model has two main upsides.
The first is that it is able to help correct
for population stratification [31]. At the same time, this model gives the user more
statistical power by including information about all the SNPs instead of considering
them one at a time. Note that there are many different variations of this model [27,
23, 94, 43], each of which has different drawbacks and benefits in terms of statistical
power, ability to correct for population stratification, and run time.
As a final note, the above model assumes y is a continuous phenotype. It turns
out, however, that this model can also be applied to discrete traits (such as disease
status) using the liability threshold framework [48].
2.1.3
The Rise of Electronic Health Records
In the past decade electronic health records (EHR) have become standard. Driven
both by new technologies and various government programs [11, the hope was that,
by replacing paper records, EHR would improve medical care for everyone.
One of the major benefits of EHR is that they make secondary use of clinical
data for scientific studies feasible.
Programs such as 12B2, eMerge and STRIDE
[103, 28, 72] have been established to help use this patient data to gain insight into
the basic biology behind human disease. Unfortunately, this data tends to be noisy
and incomplete, leading to many biases. Despite these drawbacks there have been
many success stories- such as using these records to identify disease subtypes [24]
or using them to better identify victims of domestic abuse [85J. Moreover, in order
to make better use of these records there have been attempts, such as the MIMIC
database
1871, to release deidentified versions of these records that academics can use
in their research without going through an involved application process.
32
2.2
2.2.1
Privacy Background
Differential Privacy
Assume that we have a dataset D = (di,... , dn) E Dn and want to calculate f(D)
for some
f
f:
D"
-+
Q , where Q and ID are both sets. This is simple enough (assuming
is easy to calculate).
It may be the case, however, that f(D) releases private
information about di for some i. For example, if D is a set of patients with a given
condition, then f(D) may reveal the fact that di is in D, and thus has the condition.
In order to deal with this worry we want to release a perturbed version of
f,
let
us call it F, that does not have the same privacy concerns. This idea is formalized
using differential privacy 117]. We say that D and D' = (d', ...
, d')
are neighboring
databases if both are the same size and differ in exactly one entry (aka there is exactly
one i such that di # d'). We then have the following definition.
Definition 1. A random function F : IDn -+ Q is c-differentially private for some
e > 0 if, for all neighboring databases D and D' and all sets S C Q, we have that
P(F(D) E S) < exp(c)P(F(D') E S)
Intuitively, the above definition says that, if D and D' differ by one entry, then
F(D) and F(D') are statistically hard to distinguish. This property ensures that no
individual has too large an influence on F(D), so they can not lose much privacy. The
parameter c is a privacy parameter: the closer to 0 it is the more privacy is ensured,
while the larger it is the weaker the privacy guarantee. This means we would like
to set e as small as possible, but unfortunately this comes at the cost of having less
useful outputs. The problem of figuring out the correct e to use is quite tricky and
often ill-defined [621.
Our goal is to find a differentially private F that closely approximates
f.
One
of the simplest ways to do this is with what is known as the Laplacian mechanism
[17]. Formally, if Q C Rn , we define the sensitivity of a function
be equal to
33
f,
denoted Af, to
Af =
max
If(D)
-
(Xi,...,
X,)
x E Rn is defined as
D,D' neighbors
where the L 1 norm for a vector
=
f(D')I1
n
lXl1
Z=Xi
More than that, let Lapn(A) be a random variable that returns an n dimensional
vector with density equal to
P(Lapn(A)
=
x)
=
exp (Wi)
then A is equal to the standard deviation of the coordinate functions of Lapn(A). The
Laplacian Mechanism is then achieved by letting
F(D) = f(D) + Lapn
(f
Theorem 1. If F is defined as above then F is c-differentially private.
Exponential Mechanism
Though the Laplacian mechanism was the first such mechanism suggested, many
others have appeared over the years ([20, 55, 78, 11, 451, just to name a few). In
particular, we are interested in coming up with estimates of
f when
there exists some
loss function, q, such that f(D) = argmincq(D,c). One common approach is known
as the exponential mechanism.
Theorem 2. Let f and q be as above, and define F so that F(D) is chosen according
to
P(F(D) = c) oc exp
where Aq
=
max
D,D' neighbors,cEQ
q(D,c)
q(D, c) - q(D', c)|. Then F is c-differentially private.
In many cases the exponential mechanism gives us a F that closely approximates
34
f
[78]. Unfortunately, it is difficult to sample from F, so the exponential mechanism
is not always a practical choice.
2.2.2
Other Approaches to Privacy
For the sake of completeness we will say a few words about some other paradigms for
private data release.
One common approach is to deanonymize the data.
This approach involves re-
moving certain known identifiers from the data before sharing it. These identifiers
can include anything from names to zip codes to ages. This approach has the upside of preserving almost all the utility in a dataset, though in many cases it has led
to privacy breaches when data fields not thought to be identifying are connected to
outside information [92].
To deal with this shortcoming the idea of k-anonymity (as well as many variations
there of) was suggested [92]. In a nutshell, k-anonymity works by taking the data
and, through various transformations (removing data entries, generalizing data fields,
etc), generates a version of the dataset such that every record in the dataset matches
k - 1 other records.
This prevents an adversary from determining which of these
k records belongs to a given individual. Although this improves privacy there have
been attacks on it in various instances
1731.
Moreover, it does come at the cost of
having less useful data.
Another idea that has been proposed is auditing [29]. This framework, also known
as trust but verify, involves sharing data with certain individuals (aka a controlled
access framework), but making sure to go back into the records to check that the
users did not abuse this access. The approach has the advantage of giving users all
the utility of the dataset with increased privacy protections. This framework can help
deter and punish misconduct, but does not actively prevent it. On the down side, this
approach is often cumbersome, since it involves users applying for access, an often
painful process. In addition, it is not always clear what behaviors should be flagged
as misconduct, though there is active research on figuring this out [29].
Finally, we should mention cryptographic approaches to privacy. Often it is the
35
case that the final output of an analysis does not threaten privacy, but that the
database itself is still private. This situation does not present a problem if one user
holds all of the data and that user has the computational tools needed to analyze
it.
There are times, however, when the user may want to either outsource their
computation, or to combine their data with data from other users.
The simple solution to these problems is for the data owner to share their data.
This solution, however, clearly violates privacy.
Luckily, there are cryptographic
solutions to this problem. If the data owner wants to outsource data analysis they
can use homomorphic or functional encryption [56, 8]. Similarly, if numerous users
want to pool their data they can use multiparty computations [104]. Unfortunately,
both of these approaches are rather computationally intensive at the moment, so
are probably not useful for analysis of large, high-dimensional datasets such as those
present in genomics.
2.3
Privacy Concerns and Biomedical Data
Over the years there have been many different examples of how carelessness in the
way data is shared can lead to privacy concerns. Unfortunately, we do not have the
time to go into them all here- though various reviews exist for any interested readers
[19, 76]- and instead give a brief overview of some of the most important.
Perhaps the most prevalent source of privacy risk comes from human error. There
have been numerous cases of medical professionals losing laptops with sensitive data
and other mistakes that have led to privacy breaches.
If these were the only privacy issue that existed then biomedical privacy would
not be a very interesting area of research.
obvious privacy concerns exist.
It turns out, however, that other, less
Perhaps the first of these to be exposed was what
is known as a linkage attack [93, 91]. The idea is simple: an adversary has access
to a deidentified database and some outside database that is not deidentified. The
adversary can then use the outside database to reidentify the individuals who are in
the deidentified database by looking for fields that occur in both the outside database
36
and the private database.
One of the first demonstrations of this was performed
by Latayna Sweeney [911. After Massachusetts released deidentified medical health
records, Latayna was able to link these records to the voter registration using zip
code and date of birth. To add a bit of flare to this result, she found the record of
the governor of Massachusetts, and sent it to him by mail.
Similar linkage attacks can occur when sharing genetic data. Gymrek et al. [59]
recently showed how it is possible, using online ancestry databases, to reidentify
supposedly deidentified genomic sequences. Using data from the Y chromosome they
were able to determine the surname of a large percentage of participants in some
online genomic databases. Using other information (such as genealogy information,
state of birth, etc), they could go even further and narrow the candidate down to a
few individuals, sometimes even reidentifying the sample completely.
One suggested method for protecting genomic data has been to withhold sensitive
or identifying parts of the genome, and only release the rest. It turns out that such
measures can be defeated [82]. In particular, since mutations of nearby locations in
the genome are not independent (due to linkage disequalibrium), one can reconstruct
private data contained in the genome.
Even aggregate genomic data- our main concern here- is not free of privacy concerns. Homer et al. [60] showed that MAFs could be used to determine if individuals
had participated in a genetic study. Similar work has demonstrated that releasing
regression coefficients [63] can lead to privacy loss as well.
Why do medical researchers care about privacy concerns? Beyond the obvious
ethical reasons, there is also the worry that privacy breaches will lead to lack of
trust, resulting in fewer individuals being willing to participate in studies. Previous
research has already shown that privacy concerns lead many individuals not to go
to their doctor with medical concerns [83], and privacy breaches would likely make
this worse. Moreover, privacy concerns have led several agencies (such as the NIH)
to hide medical data in controlled access repositories, something that many believe
hurts the scientific enterprise. It might be hoped that a better understanding of the
boundary between medicine and privacy can lead to a loosening of these restrictions.
37
2.4
Previous Applications of Privacy Preserving Approaches to Biomedical Data
As one might imagine, the concern about privacy in biomedical data analysis has
led to a slew of suggested approaches for dealing with these issues. Such approaches
range from completely policy based to new cryptographic methods. Below we give a
brief overview of some of these approaches.
2.4.1
HIPAA and Other Legislative Approaches
Various legislative approaches have been suggested to help deal with the issue of
privacy and medical data. In the US, the main such legislation is HIPAA, the Health
Insurance Portability and Accountability Act of 1996 [2].
HIPAA sets up various
requirements for releasing medical data publicly. For releasing individual level data,
the act requires that either a statistician looks at the data and declares it to be safe
to release, or that the data meets the safe harbor requirement that requires various
pieces of identifying information to be removed.
2.4.2
Access Control
Access control is one of the most common approaches in biomedicine to protecting
patient data- from EHR to the NIH and beyond [103, 39]. In a nutshell, access
control involves taking data and storing it away from prying eyes. In order to get
access to the data researchers have to apply and pass some kind of background check.
After passing this check they are then given access to the data.
Increasingly there have been suggestions that, instead of allowing researchers to
download data, researchers should instead be allowed to submit code to the repositories who perform the analysis for them [52]. This approach helps prevent the loss
of control which occurs after data is downloaded by researchers, and allows for the
possibility of auditing to find out if researchers are abusing their privileges- such
auditing methods have been suggested many places in the literature [51, 581.
38
2.4.3
Differential Privacy
Differentially privacy (see above) has also been suggested as a possible solution to
the privacy conundrum. Methods have been developed for performing differentially
private medical count queries [97, 65], statistical tests on medical databases [99],
genomic studies [66, 67, 95, 106, 46, 53], and model fitting 132]. There has even been
a competition, known as iDASH, to help come up with better methods for performing
differentially private GWAS [66]. Though such methods have been steadily improving
they have yet to see many real applications. The closest that we are aware of is in the
case of study design. Databases such as 12B2 and STRIDE often return perturbed
count queries to researchers as a way of preserving privacy. Note that, although much
of this research has been encouraging, there is still a long way to go [37].
In a similar vein their has been work using methods similar to differential privacy
to protect patient location data [49, 68].
2.5
Other approaches
k-anonymity is one of the most common privacy methods applied to medical data, including applying it to GWAS studies [26] and many other areas. Similarly, there have
been model based approaches to release DNA sequences without violating relatives
privacy [38].
There has been a lot of interest in being able to combine various private databases
to perform a joint analysis without losing any privacy.
Such techniques are known
as multiparty computations. Based on cryptography [104] such methods allow users
to perform joint operations without losing privacy, though they come at a cost to
performance.
In the medical realm such approaches have been suggested for many
applications, including drug monitoring [33], GWAS studies [36, 50], and paternity
testing [25], among many others 142, 14, 40].
There have also been attempts to allow users to outsource computation to the
cloud in privacy preserving ways using homomorphic encryption (that is to say encryption that allows someone to compute on the data without decrypting it). Such
39
approaches have been applied to mapping genomic sequences to a reference sequence
[5] and simple genomic analysis [6]. There has also been work on figuring out novel
schemes to encrypt genomes so that even brute force attacks will fail [54], in order to
allow users to outsource genotype storage to the cloud.
A final area of interest has been in devising new methods for deidentifying patient
records. There have been numerous methods investigated for deidentfying medical
notes, either by removing identifiers to make them HIPAA compliant, or by removing
other identifying information not covered by HIPAA [22, 41]. Though such methods
are an interesting area of research, it is not yet clear how much private information
actually leaks through.
40
Chapter 3
One Size Doesn't Fit All: Measuring
Individual Privacy in Aggregate
Genomic Data
3.1
Introduction
Note: The work in this chapter was presented at the GenoPri workshop in IEEE
Symposium on Security and Privacy 2015.
Recent research has shown that sharing aggregate genomic data, such as p-values,
regression coefficients, and minor allele frequencies (MAFs) may compromise participant privacy in genomic studies [60, 19, 108, 90, 63]. In particular, Homer et al.
showed that, given an individual's genotype and the MAFs of the study participants,
an interested party can determine with high confidence if the individual participated
in the study (recall that the MAF is the frequency with which the least common
allele occurs at a particular location in the genome).
Following the initial realiza-
tion that aggregate data can be used to reveal information about study participants,
subsequent work has led to even more powerful methods for determining if an individual participated in a study based on MAFs [98, 88, 64, 9]. These methods work
by comparing an individual's genotype to the MAF in a study and to the MAF in
41
the background population.
If their genotype is more similar to the MAF in the
study, then it is likely that the individual was in the study. This raises a fundamental
question: how do researchers know when it is safe to release aggregate genomic data?
To help answer this question we introduce a new model-based measure, PrivMAF,
that provides provable privacy guarantees for MAF data obtained from genomic studies. Unlike many previous privacy measures, PrivMAF gives an individual privacy
measure for each study participants, not just an average measure. These individual
measures can then be combined to measure the worst case privacy loss in the study.
Our measure also allows us to quantify the privacy gains achieved by perturbing the
data, either by adding noise or binning.
3.1.1
Previous work
Several methods have been proposed to help determine when MAFs are safe to release.
The simplest method- one suggested for regression coefficients
175]- is to just
choose
a certain number and release the MAFs for at most that many single nucleotide polymorphisms (SNPs, e.g. locations in the genome with multiple alleles). Sankararaman
et al. [881 suggested calculating the sensitivity and specificity of the likelihood ratio
test to help decide if the MAs for a given dataset are safe to release. More recently,
Craig et al. [13] advocated a similar approach, using the Positive Predictive Value
(PPV) rather than sensitivity and specificity. These measures provide a powerful set
of tools to help determine the amount of privacy lost after releasing a given dataset.
One limitation of these approaches, however, is that they ignore the fact that a given
piece of aggregate data might reveal different amounts of information about different
individual study participants, and instead look at an average measure of privacy over
all participants. For the unlucky few who lose a lot of privacy in a given study, a privacy guarantee for the average participant is not very comforting. The only sure way
to avoid potentially harmful repercussions is to produce provable privacy guarantees
for all participants when releasing sensitive research data.
Some researchers have recently suggested k-anonymity [92, 108, 74] or differential
privacy [17, 951 based approaches, which allow release of a transformed version of
42
the aggregate data in such a way that privacy is preserved. The idea behind these
methods is that perturbing the data decreases the amount of private information
released. Though such approaches do give improved privacy guarantees, they limit the
usefulness of the results, as the data has often been perturbed beyond its usefulness;
thus, there is a need to develop methods that perturb the data as little as possible in
order to maximize its utility.
Identifying individuals whose genomic information has been included in an aggregate result can have real-world repercussions.
genetics of drug abuse
[35].
Consider, for example, studies of the
If the MAFs of the cases (e.g. people who had abused
drugs) were released, then knowing someone contributed genetic material would be
enough to tell that they had abused drugs. Along the same lines, there have been
numerous genome-wide association studies (GWAS) related to susceptibility to numerous STDs, including HIV [771. Since many patients would want to keep their HIV
status secret, these studies need to use care in deciding what kind of information they
give away. Such privacy concerns have led the NIH and the Wellcome Trust, among
others, to move genomic data from public databases to access-controlled repositories [84, 57, 107].
Such restrictions are clearly not optimal, since ready access to
biomedical databases has been shown to enable a wide range of secondary research
[101, 83].
Many types of biomedical research data may compromise individual's privacy, not
just MAF [19, 76, 75, 59, 93, 18, 911.
For instance, even if we just limit ourselves
to genomic data there are several broad categories of privacy challenges that depend
on the particular data available, e.g. determining from an individual's genotype and
aggregated data whether they participated in a GWAS study [90], from an individual's genotype whether they are in a gene-expression database 163], or, alternately,
determining an individual's identity from just genotype and public demographic information [59].
43
3.1.2
Our Contribution
We introduce a privacy statistic, our measure PrivMAF, which provides provable
privacy guarantees for all individuals in a given study when releasing MAFs for unperturbed or minimally perturbed (but still useful) data. The guarantee we give is
straightforward: given only the MAFs and some knowledge about the background
population, PrivMAF measures the probability of a particular individual being in the
study. This guarantee implies that, if d is any individual and PrivMAF(d, MAF) is
the score of our statistic, then, under reasonable assumptions, knowledge of the minor
allele frequencies implies that d participated in the study with probability at most
PrivMAF(d, MAF). Intuitively, this measure bounds how confident an adversary can
be in concluding that a given individual is in our study cohort based off the available
information.
Moreover, the PrivMAF framework can measure privacy gains achieved by perturbing MAF data.
Even though it is preferential to release unperturbed MAFs,
there may be situations in which releasing perturbed statistics is the only option
that ensures the required level of privacy- such as when the number of SNPs whose
data we want to release is very large. With this scenario in mind, PrivMAF can be
modified to measure the amount of privacy lost when releasing perturbed MAFs. In
particular, the statistic we obtain allows us to measure the privacy gained by adding
noise to (common in differential privacy) or binning (truncating) the MAFs. To our
knowledge, PrivMAF is the first method for measuring the amount of privacy gained
by binning MAFs. In addition, our method shows that much less noise is necessary
to achieve reasonable differential privacy guarantees, at the cost of adding realistic assumptions about what information potential adversaries have access to, thus
providing more useful data.
In addition to developing PrivMAF, we apply our statistic to genotype data from
the Wellcome Trust Case Control Consortium's (WTCCC) British Birth Cohorts
genotype data. This allows us to demonstrate our method on both perturbed and
unperturbed data. Moreover, we use PrivMAF to show that, as claimed above, dif44
ferent individuals in a study can experience very different levels of privacy loss after
the release of MAFs.
3.2
3.2.1
Methods
The Underlying Model
Our method assumes a model implicitly described by Craig et al. [13], with respect
to how data were generated and what knowledge is publicly available.
PrivMAF assumes a large background population. Like previous works, we assume
this population is at Hardy-Weinberg (H-W) equilibrium. We choose a subset (B) of
this larger population, consisting of all individuals who might reasonably be believed
to have participated in the study.
Finally, the smallest set, denoted D, consists
of all individuals who actually participated in the study.
As an example, consider
performing a GWAS study at a hospital in Britain. The underlying population might
be all people of British ancestry; B, the set of all patients at the hospital; and D, all
study participants.
As a technical aside, it should be noted that- breaking with standard conventionswe allow repetitions in D and B. Moreover, we assume that the elements in D and
B are ordered.
In our model B is chosen uniformly at random from the underlying population, and
D is chosen uniformly at random from B. An individual's genotype, d
can be viewed as a vector in {0, 1,
2 }M,
=
(dj, . . . , din),
where m is the number of SNPs we are
considering releasing. Let p3 be the minor allele frequency of SNP
j
in the underlying
population. We assume that each of the SNPs is chosen independently. By definition
of H-W equilibrium, for any d E B, the probability that dj = i for i E {0, 1, 2} is
i)(-
pj)2-$
Let MAFj (D)
dj be the minor allele frequency of SNP j in D, the
=
dED
frequency with which the least common allele occurs at SNP
(MAF 1(D), ..
. , MAFm(D)).
j.
Then MAF(D)
=
We assume the parameters, {pi}, the size of B (denoted
45
N), and the size of D (denoted n) are publicly known. We are trying to determine if
releasing MAF(D) publicly will lead to a breach of privacy.
Note that our model does assume the SNPs are independent, even though this
is not always the case due to linkage disequalibrium (LD). This independence assumption is made in most previous approaches. We can, however, extend PrivMAF
to take into account LD by using a Markov Chain based model (see Section 3.5.7).
The original WTCCC paper [12] looked at the dependency between SNPs in their
dataset and found that there are limited dependencies between close-by SNPs. In
situations where LD is an issue one can often avoid such complications by picking one
representative SNP for each locus in the genome.
3.2.2
Measuring Privacy of MAF
Consider an individual d E B. We want to determine how likely it is that d E D
based on publicly released information. We assume that it is publicly known that
d E B. This is a realistic assumption, since it corresponds to an attacker believing
that d may have participated in the study. This inspires us to use
P(d E DIMAF(D) = MAF(D), d E B)
(3.1)
as the measure of privacy for individual d, where D and B are drawn from the same
distribution as D and B. Informally, h and B are random variables that represent
our adversary's a priori knowledge about D and B.
More precisely, we calculate an upper bound on Equation 3.1, denoted by PrivMAF(d, MAF(D)).
In practice we use the approximation:
PrivMAF(d, MAF(D))
1
(N)P((D))
nPn_1(x(D)-d)
where x(D) = 2nMAF(D) and
Pm(x)
=
) x (1 46
2
p ) n-xi
It should be noted that, for reasonable parameters, this upper bound is almost tight.
We can then let
PrivMAF(D) = max PrivMAF(d, MAF(D))
dED
Informally, for all d C D, PrivMAF(D) bounds the probability that d participated in
our study given only publicly-available data and MAF(D). All derivations are given
in Section 3.5.
This measure allows a user to choose some privacy parameter, a, and release the
data if and only if PrivMAF(D) < a. It is worth noting, however, that deciding
whether or not to release the data gives away a little bit of information about D,
which can weaken our privacy guarantee. While in practice this seems to be a minor
issue, we develop a method to correct for it in Section 3.5.6.
3.2.3
Measuring Privacy of Truncated Data
In order to deal with privacy concerns it is common to release perturbed versions of the
data. This task can be achieved by adding noise (as in differential privacy), binning
(truncating results), or using similar approaches. Here we show how PrivMAF can
be extended to perturbed data.
We first consider truncated data. Let MAFtrunc(k) (D) be obtained by taking the
minor allele frequencies of the jth SNP and truncating it to k decimal digits. For
example, if k = 1 then .111 would become .1, and if k = 2 it would become .11. We
are interested in
P(d E DIMAFrunc(k)(f) = MAFtrunc(k)(D), d E B)
As above, we can calculate an upper bound, denoted by
PrivMAFtrunc(k) (d, MAFtrunc(k) (D)). The approximation we use to calculate this is
47
given in Section 3.5. We then have
PrivMAFtrunc(k) (D) = max PrivMAFtrunc(k) (d, MAFtrunc(k) (D))
dED
For each d E D, this measure upper bounds the probability that individual d participated in our study given only publicly-available data and knowledge of MAFtrunc(k) (D).
3.2.4
Measuring Privacy of Adding Noise
Another way to achieve privacy guarantees on released data is by perturbing the data
using random noise (this is a common way of achieving differential privacy). Though
there are many approaches to generate this noise, most famously by drawing it from
the Laplace distribution [17], we investigate one standard approach to adding noise
that is used to achieve differential privacy when releasing integer values [20].
Consider E > 0. Let rj be an integer valued random variable such that P(q = i) is
MAFP(D)
_\,-/
where 71, .,
=
MAFP (f)
+
proportional to e-flf. Let
2n
. are independently and identically distributed (iid) copies of 71. It is
worth noting that MAF'(D) is 2E-differentially private. Recall [17]:
Definition 1. Let n be an integer, Q and E sets, and X a random function that maps
n element subsets of Q (we call such subsets 'databases of size n') into E.
We say
that X is c-differentially private if, for all databases D and D' of size n that differ in
exactly one element and all S C E, we have that
P(X(D) E S) < exp(e)P(X(D') E S)
Using the same framework as above we can define PrivMAF(d, MAF(D)) and
PrivMAF(D) to measure the amount of privacy lost by releasing MAFC(D).
above the approximation we use to calculate this is given in Section 3.5.
48
As
10
0.8
0.8
LO0.6
LL 0.6
i04
04
0.2
0,2
200
600-
400
Number of SNPs
8-i0-
0
1000
200
600
400
800
1000
Number of SNPs
(b)
(a)
Figure 3-1: PrivMAF applied to the WTCCC dataset. In all plots we take n=1000
research subjects and a background population of size N=100,000. (a) Our privacy
measure PrivMAF increases with the number of SNPs. The blue line corresponds
to releasing MAFs with no rounding, the green line to releasing MAFs rounded to
one decimal digit, and the red line to releasing MAFs rounded to two decimal digits.
Rounding to two digits appears to add very little to privacy, whereas rounding to one
digit achieves much greater privacy gains. (b) The blue line corresponds to releasing
MAF with no noise, the red line to releasing MAF 5 , and the green line to releasing
MAF 1 . Adding noise corresponding to 6 = .5 seems to add very little to privacy,
whereas taking c = .1 achieves much greater privacy gains.
3.2.5
Choosing the Size of the Background Population
One detail we did not go into above is the choice of N, where N is the number of
people who could reasonably be assumed to have participated in the study.
This
parameter depends on the context, and giving a realistic estimate of it is critical. In
most applications the background population from which the study is drawn is fairly
obvious. That being said, one needs to be careful of any other information released
publicly about participants-just listing a few facts about the participants can greatly
reduce N, thus greatly reducing the bounds on privacy guarantees (since the amount
of privacy lost by an individual is roughly inversely proportional to N - n).
Note that N can be considered as one of the main privacy parameters of our
method. The smaller the N, the stronger the adversary we are protected against.
Therefore we want to make N as large as possible, while at the same time ensuring
the privacy we need. In our method, an adversary who has limited his pool of possible
contenders to fewer than N individuals before we publish the MAF can be considered
49
to have already achieved a privacy breach; thus it is a practitioner's job to choose N
small enough that such a breach is unlikely.
3.2.6
Release Mechanism
Often one might like to use PrivMAF to decide if it is safe to release a set of MAF
from a study. This can be done by choosing a between 0 and 1 and releasing the
MAF if and only if PrivMAF(D) < a. The action of deciding to release D or not
release D, however, gives away a little information. In practice this is unlikely to be
an issue, but in theory it can lead to privacy breaches. This issue can be dealt with
by releasing the MAF if and only if PrivMAF(D) < 3(a), where 3 = 3(a) is chosen
so that:
a>
1+
1
-
- maxdE{0l2}m Pr(d E f
- b)
where
P0 = Pr(max PrivMAF(d, MAF(b))
Jx(b) = x)
deD
We call this release mechanism the Allele Leakage Guarantee Test (ALGT). Unlike
the naive release mechanism ALGT gives us the following privacy guarantee:
Theorem 3. Choose / as above. Then, if PrivMAF(D)
/,
for any choice of d E D
we get that
P (d E bid E b, x(D) = x,
/
PrivMAF(d, MAF(b)) <
is less than or equal to a.
Note that the choice of a determines the level of privacy achieved. Picking this
level is left to the practitioner- perhaps an approach similar to that taken by Hsu
et al. [621 is appropiate.
A more detailed proof of the privacy result above can be found in Section 3.5.
50
3.2.7
Simulated Data
In what follows, all simulated genotype data was created by choosing a study size,
denoted n, and a number of SNPs, denoted m.
For each SNP a random number,
p, in the range .05 to .5 was chosen uniformly at random to be the MAF in the
background population.
Using these MAFs we then generated the genotypes of n
individuals independently. Note that all computations were run on a machine with
48GB RAM, 3.47GHz XEON X5690 CPU liquid cooled and overclocked to 4.4GHz,
using a single core.
3.3
3.3.1
Results
Privacy and MAF
As a case study we tested PrivMAF on data from the Wellcome Trust Case Control
Consortium (WTCCC)'s 1958 British Birth Cohort [12].
This dataset consists of
genotype data from 1500 British citizens born in 1958.
We first looked at the privacy guarantees given by PrivMAF for the WTCCC data
for varying numbers of SNPs (blue curve, Fig. 3.5.2a), quantifying the relationship
between number of SNPs released and privacy lost. The data were divided into two
sets: one of size 1,000 used as the study participants, the other of size 500 which
was used to estimate our model parameters (pi's).
We assumed that participants
were drawn from a background population of 100,000 individuals (N = 100, 000; see
Methods for more details). Releasing the MAFs of a small number of SNPs results
in very little loss of privacy. If we release 1,000 SNPs, however, we find that there
exists a participant in our study who loses most of their privacy- based on only the
MAF and public information we can conclude they participated in the study with
90% confidence.
In addition, we considered the behavior of PrivMAF as the size of the population
from which our sample was drawn increases. From the formula for our statistic we see
that PrivMAF approaches 0 as the background population size, N, increases, since
51
0.45
0.40
0.35
0.30
LL
<0.25
0.20
0L
0.15
0.10
0.05
0.000
2000
4000
6000
8000
10000
Number of SNPS
Figure 3-2: Truncating simulated data to demonstrate scaling. We plot our privacy
measure PrivMAF versus the number of SNPs for simulated data with n=10000 subjects and a background population of size N=1,000,000. The green line corresponds
to releasing MAFs with no rounding, the blue line to releasing MAFs rounded to
three decimal digit, and the red line to releasing MAFs rounded to two decimal digits. Rounding to three digits seems to add very little to privacy, whereas rounding to
two digits achieves much greater privacy gains.
there are more possibilities for who could be in the study, while it goes to 1 as N
decreases towards n.
3.3.2
Privacy and Truncation
Next we tested PrivMAF on perturbed WTCCC MAF data, showing that both adding
noise and binning result in large increases in privacy. First we considered perturbing
our data by binning. We bin by truncating the unperturbed MAFs, first to one decimal
2
digit (MAFtrunc(l), k = 1) and then to two decimal digits (MAFtrunc( ), k = 2). As
depicted in Fig. 3.5.2a we see that truncating to two digits gives us very little in terms
52
of privacy guarantees, while truncating to one digit gives substantial gains.
In practice, releasing the MAF truncated to one digit may render the data useless
for most purposes.
It seems reasonable to conjecture, however, that as the size of
GWAS continues to increase similar gains can be made with less sacrifice.
As a
demonstration of how population size affects the privacy gained by truncation, we
generated simulated data for 10,000 study participants and 10,000 SNPs, choosing
N to be one million.
We then ran a similar experiment to the one performed on
truncated WTCCC data, except with k = 2 and k = 3; we found the k = 2 case had
similar privacy guarantees to those seen in the k = 1 case on the real data (Fig. 3-2).
For example, we see that if we consider releasing all 10000 SNPs then PrivMAF is
near 0.35, while when k = 2 it is below 0.2 (almost a factor of two difference).
3.3.3
Privacy and Adding Noise
We also applied our method to data perturbed by adding noise to each SNPs MAF
(Fig. 3.5.2b).
Methods).
We used c = 0.1 and 0.5 as our noise perturbation parameters (see
We see that when c = 0.5, adding noise to our data resulted in very
small privacy gains. When we change our privacy parameter to C = 0.1, however,
we see that the privacy gains are significant. For example, if we were to release 500
unperturbed SNPs then PrivMAF(D) would be over 0.4, while PrivMAF 1 (D) is still
under 0.2.
The noise mechanism we use here gives us 2mE-differential privacy (see Methods),
where m is the number of SNPs released. For c = .1, if m = 200 then the result
is 40-differentially private, which is a nearly useless privacy guarantee in most cases.
Our measure, however, shows that the privacy gains are quite large in practice. This
suggests that PrivMAF allows one to use less noise to get reasonable levels of privacy,
at the cost of having to make some reasonable assumptions about what information
is publicly available.
53
1.0
0.8
L
0.6
CL 0.4
0.2
0.00
200
600
400
800
1000
Number of SNPs
Figure 3-3: Worst Case Versus Average Case PrivMAF. Graph of the number
of SNPs, denoted m, versus PrivMAF. The blue curve is the maximum value of
PrivMAF(d, MAF(D)) taken over all d E D for a set of n = 1, 000 randomly chosen
participants in the British Birth Cohort, while the green curve is the average value
of PrivMAF(d, MAF(D)) in the same set. The the maximum value of PrivMAF far
exceeds the average. By the time m = 1000 it is almost five times larger.
54
3.3.4
Worst Case Versus Average
As stated earlier, the motivation for PrivMAF is that previous methods do not measure privacy for each individual in a study but instead provide a more aggregate
measure of privacy loss. This observation led us to wonder exactly how much the
privacy risk differs between individuals in a given study. To test this question, we
compared the maximum and mean score of PrivMAF(d, MAF(D)) in the WTCCC
example for varying values of m, the number of released SNPs. The result is pictured
in 3-3. The difference is stark-the person with the largest loss of privacy (pictured in
blue) loses much more privacy than the average participant (pictured in green). By
the time m = 1000 the participant with the largest privacy loss is almost five times
as likely to be in the study as the average participant. This result clearly illustrates
why worse case, and not just average, privacy should be considered.
3.3.5
Comparing
#
to a
To justify our release test ALGT, we compared the naive threshold, a, to the corrected
threshold, 0 (Figure 3-4). For larger values of a we see that the two thresholds are
fairly close. As a decreases, however, the two quantities start to diverge, with the
corrected threshold decreasing much faster than the naive one. Moreover, we see that
when a is roughly 0.04, 3 suddenly drops to around 0 and remains at that level for all
smaller a- this behavior is due to the negligible probability that a study population
would have an PrivMAF less than .04 given this choice of parameters. This suggests
that, in most cases, using a instead of
3.3.6
/
will not reduce privacy by too much.
Reidentification Using PrivMAF
Thus far we have presented PrivMAF as a means of helping ensure participant privacy.
As it turns out, PrivMAF can also be used in exactly the opposite way, as a means
of compromising subjects' privacy. To do this choose some threshold -y. For a given
genotype d we predict d E D if and only if PrivMAF(d, MAF(D)) > -Y. Used in this
way, our approach performs comparably to previous approaches; we plot the ROC
55
0.20
6
0.15
0.10
4-J
00.05
U.
0 .08 .00
0.05
0.10
0.15
0.20
Uncorrected Threshold
Figure 3-4: ALGT applied to the WTCCC dataset. A graph of the uncorrected
threshold, a, versus the corrected threshold, / = 3(a), from ALGT is given in blue.
The green line corresponds to an uncorrected threshold. We see that for some choices
of a, correction may be desired. For example, for a = .05 the corrected threshold
is approximately / = .03. Here we again use the British Birth Cohort with n=1000
study participants, m=1000 SNPs, and a background population of size N=100,000.
56
1.0
A
0.8
AS
A
A
A
A
A
0.6
U)
A
S
0.4
AA
AA
S
A
0.2
0.1
0.2
0.3
0.5
0.4
0.6
0.7
0.8
0.9
False Positive Rate
Figure 3-5: ROC Curves of PrivMAF and Likelihood Ratio. ROC curves obtained
using PrivMAF (green triangles) and the likelihood ratio method (red circles) to
reidentify individuals in the WTCCC British birth cohort with n=:1,000 study participants and 1,000 SNPs.
curve of the likelihood ratio test [86] as well as the ROC curve obtained by using our
test statistic (see Figure 3-5). We see that both methods perform similarly. Since it is
known that the likelihood ratio test gives the highest power for a given false positive
rate of any test, this curve suggest that our privacy measure is doing a good job in
terms of measuring how much privacy is lost in a given dataset by releasing the minor
allele frequencies.
Note that we can also use this as a reidentification on perturbed data, using the
perturbed PrivMAF. The results of this analysis are shown in Figure 3-6 for truncated
data and Figure 3-7 for data with noise added.
57
1.0
0.8
0.6
0.4
0.2
000
0.2
0.6
0.4
0.8
1.0
False Positive Rate
Figure 3-6: ROC Curves of PrivMAF with Truncation. ROC curves obtained using
PrivMAF for reidentification of unperturbed data (in red, AUC=.686), data truncated
after two decimal digits (aka k = 2, in blue, AUC=.682), and data truncated after one
decimal digit (aka k = 1, in green, AUC=.605 ). We see that truncation can greatly
decrease the effectiveness of reidentification. Note that the ROC of the unperturbed
data here is different from that in the previous figure. This is because we used a
different random division of our data in each case.
58
1.0
0.8
0.6
0.4
0.2
0
0.2
0.6
0.4
0.8
1.0
False Positive Rate
Figure 3-7: ROC Curves of PrivMAF with noisy data. ROC curves obtained using
PrivMAF for reidentification of unperturbed data (in red, AUC=.696), with noise
corresponding to c= .5 (in green, AUC=.693), and with c = .1 (in blue, AUC=.656).
We see that adding noise can decrease the effectiveness of reidentification. Note that
the ROC of the unperturbed data here is different from that in the previous figures.
This is because we used a different random division of our data in each case.
59
3.4
Conclusion
On the one hand, to facilitate genomic research, many scientists would prefer to
release even more data from studies 1101], 1861. Though tempting, this approach
can sacrifice study participants' privacy. As highlighted in the introduction, several
different classes of methods have been previously employed to balance privacy with
the utility of data.
Methods such as sensitivity/PPV based methods are dataset
specific, but only give average-case privacy guarantees. Because our method provides
worst-case privacy guarantees for all individuals, we are able to ensure improved
anonymity for individuals. Thus, PrivMAF can provide stronger privacy guarantees
than sensitivity/PPV based methods. Moreover, since our method for deciding which
SNPs to release takes into account the genotypes of individuals in our study, it allows
us to release more data than any method based solely on MAFs with comparable
privacy guarantees.
Our findings demonstrate that differential privacy may not always be the method
of choice for preserving privacy of genomic data. Notably, perturbing the data appears
to provide major gains in privacy, though these gains come at the cost of utility. That
said, our results suggest that, when n is large, truncating minor allele frequencies
may result in privacy guarantees without the loss of too much utility. Moreover, the
method of binning we used here is very simple- it might be worth considering how
other methods of binning may be able to achieve similar privacy guarantees while
resulting in less perturbation on average. We further show that adding noise can
result in improved privacy, even if the amount of noise we add does not provide
reasonable levels of differential privacy.
Note that our method is based off a certain model of how the data is generated,
a model that is similar to those used in previous approaches.
It will not protect
against an adversary that has access to insider information. This caveat, however,
seems to be unavoidable if we do not want to turn to differential privacy or similar
approaches that perturb the data to a greater extent to get privacy guarantees, thus
greatly limiting data utility.
60
Having presented results on moderate-sized real datasets, we test the ability of
PrivMAF to scale as genomic data sets grow. In particular, we ran our algorithm
on larger artificial datasets (with 10,000 individuals and 1000 SNPs) and have found
our PrivMAF implementation still runs in a short amount of time (19.14 seconds on
our artificial dataset of size 10,000 described above, with a running time of O(mn),
where n is the study size and m is the number of SNPs).
Though our work focuses on the technical aspects related to preserving privacy,
a related and equally important aspect comes from the policy side. Methods similar
to those presented here offer the biomedical community the tools it needs to ensure
privacy; however, the community must determine appropriate privacy protections
(ranging from the release of all MAF data to use of controlled access repositories)
and in what contexts (i.e., do studies of certain populations, such as children, require
extra protection?).
It is our hope that our work helps inform this debate.
Our
tool could, for example, be used in combination with controlled access repositories to
release the MAFs of a limited number of SNPs depending on what privacy protections
are deemed reasonable
Our work addresses the critical need to provide privacy guarantees to study participants and patients by introducing a quantitative measurement of privacy lost by
release of aggregate data, and thus may encourage release of genomic data.
A Python implementation of our method is available at http://groups.csail.mit.edu/cb/PrivMAF/.
3.5
3.5.1
Derivation of Methods
Basic Model
Before describing the model underlying our results we want to motivate it. Often the
size of an underlying population is so large compared to that of the study population
that we might as well consider the underlying population to be infinite (consider for
example the population of all people of English ancestry versus the participants in
the British Birth Cohort). In practice, however, it might be that we know the study
61
participants are drawn from some smaller subpopulation (for example the British
Birth Cohort is drawn from the population of all children born in Britain during a
certain week in 1958). This subpopulation is small enough that we can not consider
it infinite. Therefore we can think of our study population as being generated by first
generating this smaller subpopulation out of an infinite background population, then
choosing the study participants out of this smaller population. This is the point of
view our model takes, and it is formally described below.
It is worth noting that, breaking with standard notation, we assume all sets are
ordered and can have repetitions.
Assume that the genotypes of study participants are drawn from some theoretical
infinite population. We have m SNPs, which we label with 1,
... ,
m, each of which
is independent of the others. Let pi be the minor allele frequency of the ith SNP
in our infinite population, and assume our population is in Hardy-Weinberg (H-W)
Equilibrium.
We first produce a small background population B = {bi, - - - , bN}
where each bj E {0, 1, 2}
(B is the finite set of people who in reality might have
participated in the study), and where each member of the population is generated
independently of the others. Our study population, denoted D = {di, ...
, d,},
is a
population of size n produced from the background population by choosing n members
of B uniformly at random with no repetitions (note, since B can have repetitions in
it, it is possible to have di = dj even if i
k =
# j. This is because it is possible to have
1 so that bk = b1.). It is worth noting that the marginal probability distribution
on D is exactly the same as the probability distribution we would get by generating
D directly from the infinite population.
3.5.2
PrivMAF
Let MAFj(D) be the minor allele frequency of the ith SNP in our population D. We
want to release - = MAF(D) = (MAF 1 (D),...,MAFm(D)) (where xi = Zdi is
dED
the number of times the minor allele occurs at SNP i in our study population). To
simplify notation let x(D) = 2nMAF(D).
62
We want some kind of measure of how
much privacy is lost by each study participant after releasing MAF(D). We achieve
this goal by measuring the probability that an individual participated in the study
given the data released. For a given individual d we want to calculate how likely it
is under our model that d is in D given x(D). Note that (in practice) we know that
d E B (that is to say if an adversary is trying to figure out if d E D they already know
d E B), so what we want to calculate is the probability that d is in D conditional on
d being in B and on x equaling x(D). More formally, we want to consider:
P(d E DId E B, x(D) = x)
where f) and ( have the same distribution as D and B. We would like to devise a
formula to calculate an upper bound on this probability. First we need to build a few
tools.
Let B
-
D be the set of all people in B who are not in D. Note B - b and b are
independent random variables, so
P(d E B, d
= P(d E
Djx(b)
B - D)P(d
=
x) = P(d E j - Dbx(b)
=
x)P(d
bjx(D) = x)
Djx(D) = x) = P(d E B - D)(1 - P(d E Djx(D) = x))
We also see, since d E b implies d C B, that
P(d C D, d E B\x(D) = x)
=
P(d c Djx(D) = x)
Using Bayes' rule and some algebra we see that
P(dCEDdEB,x(D) =x)P(
P(d C- b, d C- j-\x(b) =x)
P(d E Djx(D) = x) + P(d E B, d
blx(b)
P(d E bx(b) = x)
P(d E bx(h) = x) + P(d E b - b)(1 - P(d C D1x(b) = x))
63
=
x)
1
1 + P(d E f - D)(P d
The next step is to consider P(d E
1-P(d
D|x(D)
i(I)=)
(3.2)
--
= x). This equals
Djx(D) = x) = I-fl P(d = jdix(b) = x)
= 1-(1-P(d =
d1|x(D)
= x))n
Then note that
P(d =jjx(D)
Pd
P
=
x)
P(d = di, x(b) = x)
=
= x x)
21|(DP
= X)
P(d = diP P(X(x)
x(d2 ,.. , d) = x
P(x(D) = x)
d)
-
(3.3)
Let Pn(x)
=
P(x(b)
=
x); then equation 3.3 equals
Pn 1 (x - d)
)
=P(d =d 1
Substituting this in to equation 3.2 we get that
1
P(d E bd E 3, x(b) = x) =
- P(D E
1+ ((P'dT__d)))
1--T-ndJ1
-(-
5
- b)
))
Using the fact that (1 - z)' > 1 - nz when 0 < z < 1 (this follows from the
inclusion exclusion principle) we get that
<K
1
=
1 - P(d C B - D) + P(dEBD(P.(x))
PrivMAF(d, MAF(D))
nP(d=d1)P,-1(x-d)
It is worth mentioning that the above upper bound is likely to be fairly tight,
since z = P(d =
d1|x(b)
=
x), which in practice is likely to be very small (especially
64
when the data set is anywhere near being safe to release, since P(d = dilx(D) = x) <
P(d
=
d 1 |d E B, x(D) = x)).
This quantity, PrivMAF(d, MAF(D)), is our measure of privacy.
Note that for realistic choices of n, N, p and m we get that P(d E f - D)
is approximately equal to (N - n)P(d = d1 ) and that P(d E f - D) << 1, so
1 - P(d E B - b) ~ 1. Plugging this in we get the measure
1
PrivMAF(d, MAF(D)) ~--,+(N-n)Pn(x)
nP _1(x-d)
which is what we use in practice. Moreover, we see that
Pmx
= (2n
x(Ii)2n-xi
This allows us to calculate PrivMAF(d, MAF(D)) easily.
PrivMAF allows us to determine how much privacy is lost by a particular individual.
What we want is the total privacy loss by releasing a study.
It makes
sense to look at maximum loss to any individual in our study, which is to say
max PrivMAF(d, MAF(D)). We call this quantity PrivMAF. If PrivMAF is bounded
dED
above by a then, for any participant d E D, an adversary can be at most a percent
confident that d actually participated in the study, which is the privacy guarantee we
want.
Naively it seems like calculating PrivMAF for m SNPs and n individuals has
complexity O(mrn 2 ).
By using the cancellation in
P(X)
it only ends up taking
O(mn) time, which is asymptotically optimal.
3.5.3
PrivMAF for Data with Noise Added
Note that the above framework can be generalized to measure the privacy loss present
in releasing noisy version of MAF(D). In particular, let q be some random variable.
Then we can let MAF(D) = MAFj(D)
+ 2, where 91,... , rm are iid random
variables distributed as q. Then we want to measure how well MAF"(D) preserves
65
privacy. As above, we are interested in P(d E DIMAP(D) = MAF(D),d c B).
The same derivation used in the previous section implies that this probability is
upper bounded by:
PrivMAP(d, MAF(D)) =
1
1
P(d E
-
D) +
-
P(deB-b)(P,7(MAF7 (D)))
nP(d=di)P,_1 (MAF7 (D)-d)
where
Pm(v ) =
2
(
_ p)2n-i P(q = 2nrv - i)
j=1 i=0
Note that the same approximations used in the previous section apply here. In
this paper we will let P(r = i) be chosen proportional to e-"'', where i is an integer
and c is a user chosen privacy parameter (relating to -differential privacy guarantees).
We can then let
MAF
MAP
and
PrivMAFe
3.5.4
=
PrivMAF
PrivMAF for Data with Truncation
Similarly we can consider the gain in privacy we get by rounding our MAF. More
specifically, consider k > 1, then if MAF (D) =
L, we let MAF runc(k) (D) be the
result of truncating each entry in MAF(D) after k decimal digits. More formally
MAF
trunc(k) (D)
= [MAF(D) *
1 0 kj
In the below we let v = MAFtrunc(k) (D) to make the equations more readable. In
order to measure privacy we want to calculate P(d E bMAFtrunc(k) (D)
As above, we can upper bound this by
66
-
v, d E B).
11
P(d E B
-
--
b) +
= PrivMAFtrunc(k) (d, MAFtrunc(k) (D))
P(MAFtrunc(k)(D)=V)
P(dEB-D)
nP(d=di) P(MAFtrunc(k) (b)=vlddi)
Note
m
P(MAF run(k)(D) = v)
P(MAFrunc(k)(b) = v) =
j=1
and
m
P(MAFtrunc(k)(b) = vld = di) =
)
P(MAF runc(k)(f) = vjld = d 1
j=1
If Sk(vj) = {xI! truncates to vj}, then
P(MAFtunc(k)(D -
(
) -
2n-i
2n
iESk(vj)
and
P(MAF tunc(k)(D) = v.Id
=
(2
1) =
2) ;(i - p)2n-i+d,-2
iESk (v)
This allows us to calculate PrivMAFtrunc(k) (d, MAFtrunc(k) (D)), just as we wanted.
3.5.5
Comparison to previous approaches
Our approach is the first to give privacy guarantees for all individuals in a study. The
method endorsed by Sankararaman et al. [861 provides guarantees of a sort- since the
log likelihood test gives the best power for a given false positive ratio, it ensures that
the power of any test can not be too big. The problem with their guarantee is that
it is an aggregate guarantee, and does not ensure the safety of all participants. Our
approach, on the other hand, does ensure privacy for all involved. Our method also
takes into account the size of the pool from which our study is drawn, something the
67
likelihood approach does not take into account but which is important in measuring
privacy. The PPV approach suggested by Craig et al. [76] does take the background
population size into account, but again does not come with any privacy guarantees
that hold for all participants. We also believe that our method gives a more intuitive
measure of privacy than previous ones (though of course this is subjective).
One
might argue that the difference between the worse case and average case privacy loss
are not that different, but our experiments do not seem to support this claim.
Zhou, et al. [71 have also presented work with strong privacy guarantees; however,
they examined the frequency and likelihood of pairs of alleles rather than MAF.
Moreover, they give guarantees of a combinatorial nature (using k-anonymity), where
as ours are probabilistic in nature.
3.5.6
A Release Mechanism: Allele Leakage Guarantee Test
Motivation
As mentioned in the paper, we want to use the above measure to decide if it is safe
to release MAF(D) (Note that we could do something similar for perturbed MAF.
but do not do so here). Assume we want to bound the probability of an adversary
figuring out if someone took part in the study to be at most a, where 0 < a < 1. A
first guess at how to do this might be to look at PrivMAF, and release if and only if
it is at most a.
Though in practice this approach seems to work well, in theory we can have
trouble. The problem with this approach is that the decision of whether or not to
release gives away a little information about D and thus destroys our probability
guarantees (to understand why this is note that deciding to release if and only if
PrivMAF is less than a means that any data being released gives away two pieces of
information, namely the value of MAF and the fact that PrivMAF is less than a. If
PrivMAF is greater than a with non-negligible probability (say 50 percent probability
or so) this extra bit of information can actually be very informative).
An obvious fix is to release if and only if
max
dE{0,1,2}m
68
PrivMAF(d, MAF(D)) < a. In
this case the decision of whether or not to release gives no more information than
outputting MAF(D) by itself. It turns out that this quantity is easy to calculate,
and gives us the security guarantee we want. Unfortunately it is also overkill: the
worst-case behavior is often much worse than the average case, so this policy is likely
to tell us a data set is not safe to release even when it is.
This leads us to propose another solution. For any choice of 3 we can define
P0 = P(max PrivMAF(d, MAF(D)) < #jx(D) = x)
dED
where the probability is taken over the choice of D. Choose #3 so that
1
+
-
P
-
maxde{o,1,2}m P(d E B - b)
and release the data if and only if PrivMAF(D) is less than or equal to /.
This release
test is what is referred to as the Allele Leakage Guarantee Test (ALGT) in the paper.
We can show that ALGT gives us the privacy we require without too much overkill.
On the other hand it is much slower than the above methods, since calculating P0 is
slow (described below).
Derivation
ALGT tells us that, given both MAF(D) and the knowledge leaked by the decision
to release, then from the adversaries view the probability that d E D is at most a for
any choice of d E D. More formally:
Theorem 4. Choose / as above. Then, if PrivMAF(D)
/,
for any choice of d E D
we get that
P (d E bd E B, x(D)
=
x, max PrivMAF(d, MAF(D)) <
3)
a
dc:D
Proof. The proof basically comes down to repeated applications of the definition
of conditional probability, independence, and Bayes rule. Let R be the event that
69
max PrivMAF(di, MAF(D)) </.
Then
P(d E )Id E B, x(h) = x, Max PrivMAF(d, MAF(D)) < #) = P(d E Djx(D) = x, R)
dED
P(d E Bix(D) = x, R)
P(d E Djx(D)
P(d E Dbx(b) = x, R)
= x, R) + P(d e A - b)(1 - P(d E bx(b) = x, R))
1
1+
P(dh-()
1+P(dEbjx(b)=x,R)
P(d E B
-
(3.4)
D)
To simplify this note that
P(d E Djx(b) = x, R)
P(R,d E bix(b) = x)
P(d E f - b)
P(d C f - h)P
PrivMAF(d, MAF(D))P(d E B|x(b) = x)
P(d -b - D)PO
-
P(d EDjx(D) = x)
P(d E B - D)P
To simplifying this we look at
),
which we see equals
P(d G bx(h) = x) + P(d EA - b)(1 - P(d EDjx(D) = x))
P(d E B - D)
P(d ED|x(D)
x)(1 - P(d E B
P(d E B - D)
+
=1
Using the fact that P(d E
- b) = P(d c
A
-
D))
- D|x(b) = x) this becomes
P(d E bjx(b) = x)(1 - P(d E A - b))
P(d G fx(D) = x) - P(d E Dlx(D) = x) + P(d G blx(b)
=1+
=
x)P(d E
A
- b)
~d E f,x(D)=x)(1- P(d E j -b))
1 - P(d E bd E A, x(b) = x) + P(d E bid E A, x(b) = x)P(d G A - b)
P(d E
<1
+
1
A
=
PrivMAF(d, MAF(D))(1 - P(d E B - b))
1 - PrivMAF(d, MAF(D)) + PrivMAF(d, MAF(D))P(d E B - D)
70
1
=1+
1
1<1+<+
1
PrivMAF(d,MAF(D))
(1 -P(dE5--))PrvMAF(d,MAF(D))
Substituting this in to equation 3.4 results we get
PrivMAF(d, MAF(D)) (1
(1-P(dEf -b))PrvMAF(d,MAF(D))
< PrivMAF(d, MAF(D))( 1
1
PrivMAF(d, MAF(D))(
PrivMAF(d,MAF(D))
-1
1
1
-
P(d E D~x(D) = x, R)
P(d E f )
Putting it all together we see that
P(d E bid E fx(b)
= x,
R)
1 +
1
1
PrivMAF(d,MAF(D))
I -- P(d E3 1 D) - + -O P
1 + 7-
131
P3
-
maxde{o,1,2}m
P(d E B
-
D)-
which is what we wanted.
Note that, in practice, since
mate
#
max
dE{O,1,2}m
P(d E B - D) << 1, we choose an approxi-
such that
1
0Z1 + POPO
Scalability of ALGT
We have presented results on moderate-sized datasets. We have also run our algorithm
on larger artificial datasets (with 10,000 individual's and 1000 SNPs) and have found
our ALGT implementation still runs in a reasonable amount of time, completing in
just over 8 hours on a single core (Methods). Although our current implementation
runs on a single core, the PrivMAF framework permits parallelization of Monte Carlo
71
sampling, the major computational bottleneck in our pipeline, i.e. computing 3, and
thus is able to benefit from any parallel or distributed computing system. As dataset
sizes grow, we expect to be able to keep pace by computing the PrivMAF statistic
more efficiently.
3.5.7
Changing the Assumptions
The above model makes a few assumptions (assumptions that are present in most
previous work that we are aware of). In particular it assumes that there is no linkage
disequilibrium (LD) (which is to say that the SNPs are independently sampled),
that the genotypes of individuals are independent of one another (that there are no
relatives, population stratification, etc. in the population), and that the background
population is in Hardy-Weinberg Equilibrium (H-W Equilibrium). The assumption
that genotypes of different individuals are independent from one another is difficult
to remove, and we do not consider it here.
We can, however, remove either the
assumption of H-W Equilibrium or of SNPs being independent.
First consider the case of H-W Equilibrium.
Let us consider the ith SNP, and
let pi be the minor allele frequency. We also let po,i, pi,i and
P2,i
be the probability
of us having zero, one, or two copies of the minor allele respectively. Assuming the
population is in H-W equilibrium is the same as assuming that po,i = (1
-
p,)2,
pi,i = 2pi(I - pi), and P2,i = p2. Dropping this assumption, we see that all of the
calculations above still hold, except we get that
x-
Pr(xi() = xi) =
where we use the convention that
(")
c
= 0 when c < 0.
x
-2c
c
This allows us to remove
the assumption of H-W Equilibrium. Unfortunately there are two problems with this
approach. The first is statistical- instead of having to just estimate one parameter per
SNP (pi), we have to estimate two (po,i and p1,i, since P2,i can be calculated from the
other two). The other problem is that calculating Pr(xi(D) = xi) suddenly becomes
72
more computationally intensive, so much so that it is prohibitive for large data sets.
In order to allow us to drop the assumption of no LD we can model the genome as
a Markov model (you could also use a hidden Markov model instead which allows for
more complex relationships, but for simplicity sake we will only talk about Markov
models since the generalization to HMM is straightforward).
In such a model the
state of a given SNP only depends on the state of the previous SNP. To specify such
a model we need to specify the probability distribution of the first SNP, and for each
subsequent SNP we need to specify its distribution conditional on the previous SNP.
It is then straightforward to modify our framework to deal with this model. As above,
however, this requires us to estimate lots of parameters and also is much more time
consuming; thus it is not likely to be useful in practice.
Estimating P8
Unfortunately, P0
=
P(PrivMAF(D) < 31x(D) = x) is not so easy to calculate.
We use a Monte Carlo type approach to calculate it. More precisely we sample D
conditional on x(D) = x, then estimate P3 as being the percentage of the D we
/.
generated for which max PrivMAF(d, MAF(D))
dED
This approach requires us to be able to sample D such that x(D) = x. In order to
do this consider t, =
#{jldj,i = 2}, where dj,j is the genotype of dj at SNP i. Then
the probability that t, = t is proportional to
(;) (
t
n
- t
+t -
) pit(2pi(1 - pi))xi2t(I - p.)2(n+t-xi)
xi
where we hold to the convention that
(")
=
0 if
Ti
< 0, m < 0 or r1 < m. This
allows us to sample from ti. Knowing t, we can then calculate the number of
that dj,= 1 (namely xi - 2 * ti) and the number that equal 0 (namely n
j
so
+ ti - xi).
We can then randomly choose t, individuals to have dj,j = 2, and similarly for djj = 1
and dj,j = 0. Repeating this process for all of the SNPs gives us a random sample of
D conditional on x(D) = x.
Of course Monte Carlo estimation is often very slow. What alternatives do we
73
have? One is to note that, if M
log
=
(di,)
n(
-
1)) then (conditional on
x(D) = x) as m goes to infinity we get that (under reasonable assumptions, such as
a MAF > .05, n fixed)
M-EM
y/var(M)
X
where EM is the expected value of M, var(M) is its variance (both of which we can
calculate), and x is a unit normal centered at 0. This result follows from considering
the distribution,
Q, on the pairs
xi, pi. Using the Central Limit Theorem, it is straight-
forward to show the result holds when Q has finite support. One can then use a limiting argument to show it holds for more general Q (we do not include the detail here).
This fact gives us a means of estimating P(PrivMAF(di, MAF(D)) < 31x(D) = x),
which can then be used to estimate PO
Unfortunately this is only an asymptotic bound, and experiments show that it
often gives poor estimates in practice, so we have chosen not to use it in practice.
It can be hoped, however, that more robust approximations are possible to speed up
this calculation.
Approximating
Calculating
/
/3
can be quite time consuming, so one might be tempted to try to avoid
calculating 3. One way to do this could be to use a instead of
simulated datasets show that there is some
equal to a, while below /30 we see
/
#o
3. Experiments on
so that if a is above /0 then
3 is about
quickly decays to 0 (this can be seen, for example,
in Figure 3-4). This implies that, if PrivMAF is significantly below 0z (where we do
not attempt to define significantly below here), then we should expect
to a, so PrivMAF should be below
/
to be close
3 as well. This is a heuristic, but this line of
reasoning seems like it could lead to something more reliable- more work is needed
to know for sure.
74
3.5.8
Estimating the parameters
The above model require estimates of the pi. How are they estimated? The straightforward method is to take another collection of individuals (our reference population)
drawn from the same background population as our study participants. The minor
allele frequencies of this population can then serve as an estimate of the minor allele
frequencies for the background population. Alternatively, we can estimate the pi parameters from the union of this collection of individuals with the study participants,
a method advocated by some previous papers.
An alternative approach is to use Bayesian methods to place a prior on pi, which
can then be updated based on the data in the outside population. We can then use
this posterior probability on pi to estimate P(xi(D) = xi). In our results we used the
naive approach, though arguments can be made for the other two.
The other parameter that one must consider is N, the size of the background
population. This depends a lot on the context, and giving a realistic estimate of it
is critical.
In most applications the background population from which the study
is drawn is fairly obvious.
That being said, one needs to be careful of any other
information released in the paper about participants- just listing a few facts about the
participants can greatly reduce N, greatly reducing the bounds on privacy guarantees
(since the probability of a privacy compromise is roughly inversely proportional to
N - n).
75
0 A W"Impm
owo'U.-
Chapter 4
Improved Privacy Preserving
Counting Queries in Medical
Databases
4.1
Introduction
The rise of electronic health records (EHR) has led to increased interest in using
clinical data as a source of valuable information for biomedical research 172, 28, 87,
103, 24]. In particular, there is interest in using this data to improve study designby, for example, helping identify cohorts of patients that can be included in medical
studies. A first step in this selection is figuring out how many patients in a given
database might be eligible to participate. The answer to such count queries can be
used in budgeting and study planning.
Unfortunately, even this simple application of EHRs raises concerns over patient
privacy. It seems hard to believe that releasing a few count queries can lead to a
major loss of privacy. Previous work has shown, however, that similar count queries
can undermine user privacy on social networking websites such as Facebook [70], a
result that can be extended to medical databases. One could, for example, ask for
the number of 25 year old males on a certain medication who do not have HIV. If
77
the answer returned is zero you know that any 25 year old male patient on that
medication is HIV positive, a very private piece of information.
In order to help deal with this problem, it has been suggested that instead of
releasing raw counts, institutions should release perturbed versions [971. Our contribution is to present an improved mechanism for releasing differentially private answers
to count queries.
4.2
Previous Work
One approach for ensuring privacy in this situation is to use a trust-but-verify framework, where medical professionals are trusted to make queries, but the query records
are kept and checked to ensure that no abuses have taken place.
checking these records is known as auditing.
The process of
There have been numerous different
approaches suggested to produce such an auditable system [29]. Recently, some have
started to work on generating auditing frameworks specific to the biomedical setting
[51, 58].
A trust but verify approach allows medical professionals to get access to the data
that they need while decreasing abuse. This does not prevent all abuses, however,
instead relying on fear of punishment to prevent bad behavior. Moreover, it is not
always clear (if it is ever clear) which queries should raise red flags in an audit and
which should not.
Because of these drawbacks it has been suggested that, instead of releasing raw
counts, one could release counts perturbed by as small amount of noise in the hopes
that this noise will thwart such privacy threats. These ideas have been included in
12B2 and STRIDE [28, 80, 79, 97]. Both of these methods work by adding truncated
Gaussian noise to the query results.
Unfortunately, the addition of Gaussian noise is ad-hoc and not based off privacy
guarantees. In order to remedy this issue, Vinterbo et al. [97] suggested using the
exponential mechanism as a means to produce differentially private answers to count
queries. Although this is not the first attempt to use differential privacy in a medical
78
research context, it is the first we are aware of to do so as a way to improve cohort
selection [100, 15, 55, 65]. In this work, we modify their method, which is described
below, to get a mechanism for releasing count data in a privacy preserving manner
while simultaneously preserving higher levels of utility.
4.3
Exponential Mechanism
Our approach builds on the one of Vinterbo et al. [971. In a nutshell, they assume
that there is a database consisting of n patients, and they want to return a perturbed version of the number of people in that database with a certain condition. To
accomplish this goal, they introduce a loss function q defined by
3+(c q(y, c)
y)a+ if c > y
=
S(y - c)a-
else
where a+, Za_, /+, 0- are parameters given by the user, c is an integer in the range
[rmin, rmax], and y is the actual count we want to release. As opposed to the Laplacian
mechanism, which uses the L 1 error as a loss function, this loss function allows users
to weigh over-estimates and under-estimates differently. In what follows we assume
a+ < 1 and o_ < 1. Note that no such assumption was made in previous works,
but in practice it seems realistic since if a+ or a_ are larger than 1 it follows from
Vinterbo et al. that q is very sensitive, which results in very inaccurate query results.
Note that q has sensitivity Aq = max(b+, b_).
Therefore, if we define X, to be a
random function defined so that P(X,(y) = c) is proportional to exp(-wq(y, c)),
then X, is 2Aqw-differentially private [78].
This is the mechanism introduced by Vinterbo et al. [97]. Their analysis, however,
is based off of the general analysis of the exponential mechanism. By providing a more
refined analysis, we are able to modify the above algorithm to give better utility for
a given choice of
E.
79
4.4
Derivation of Mechanism
In order to improve upon previous algorithms, we give a new analysis of the privacy
preserving properties of the exponential mechanism.
This result is summed up in
Theorem 5.
Theorem 5. Assume that a+, a_ E
q(y, c)
Let X(y)
[0,1],
+ > 0, /-
> 0, and that
O+(C - y)a+
if c > y
0_(y - C)a-
else
=
be a random variable defined so that P(X(y)
exp(-wq(y, c)) for all integers c E
[rmin, rmax]
= c) is proportional to
and y G [0, n]. Then we have that, if
)
w < min($,
b+ b_
Aq
and
I
and
P(X(rmin) = rmin)
exp(e)P(X(rmin + 1) = rmin)
then X(y) is 6-differentially private.
Proof. Let
rmar
1
ZY =
exp(-wq(y, c))
c=rmin
then by definition
P(X(y) = c) = xp(-wq(y, c))
Zy
For a given c we are interested in
(X(y)c)
P(X(y')-c)
and P(X(y')=c)
where y = y' - 1.
P(X(y)=c)
There are three cases: the first is when y >
the third when y C [rmin, rmax
-
11.
80
rmax,
the second when y <
rmin,
and
Consider the case when y < rmin. Then Z, < Z,+1 so
P(X(y + 1)
=
C)
exp(-w0+((c - y
-
- (c - y),+))
-
P(X(y) = c)
<
Z
ZY+1
exp(w/3+((c - y)"+ - (c - y - 1)'+))
-
Since a+ < 1 and c - y > c - y - 1 > 0, we see that, by definition, (c - y)a+ - (c
y - 1)"+ < 1 so the above is
<; exp(w3+) < exp(c)
P(X(y + 1) = c)
_
-
exp(-w#+((c - y)
-
P(X(y)_= c)
P(X(y + 1C)
(C - y-
)
+
as desired. On the other hand, consider
ZY
Note that exp(-wO+((c - y)c+ - (c - y - 1)a+)) K 1 so the above is <
q has sensitivity bounded by max(b_, b+) we see that
z
< exp(wAq) < exp(c), so
putting the above together
<exp(E)
P(X(y) = c)
-
P(X(y + 1) = c)
By a similar argument, when y
rmax
we have that
-
P(X(y) = c)
<exp
P(X(y + 1) = c)
and
Finally consider the case when rmin
=
c)
-
P(X(y + 1) = c) <exp(c)
P(X(y)
y < rmax. Then note that
ZY+j = Zy - exp(-wq(y, rmax)) + exp(-wq(y + 1, rmin))
=Zy - exp(-#+(rmax - y)&+) + exp(-w/3_(y - rmin + 1)'-)
81
z-. Since
. If c > y then q(y, c) > q(y + 1, c); it follows that since
We will consider P(X(Y)
w < '
we get
P(X(y)_=_c)
P(X(y
)
=
C)
P(X(y + 1) = c)
= exp(w(q(y + 1, c) - q(y, c)))
Z
_+1
zy
ZY
< zY+1 < exp(e)
ZY
Therefore assume that c > y.
Then exp(w(q(y + 1, c) - q(y, c))) < exp(wof+),
where equality is achieved when c = y + 1.
We next consider
.
Note that
exp(-w/3+(rmax - y)c+) is increasing in y while exp(-w,3(y - rmin + 1)"-) is decreasing in y.
Thus it is easy to see that there exists a yo E (rmin, rmax) where
-exp(-#3+(rmax-y)"'+)+exp(- oW_(y-rmin+l)"-) 5 0 if y > yo while -exp(-/3+(rmax
y)e'+) + exp(-w'_3(y - rmin + 1)"-) > 0 if y < Yo.
Thus if y ;> yo we see that Z Yi < 1 so
P(X(y) = c)
P(X(y + 1)) = c
< exp(w/3+) ZY
<
<expw+1
iepo+
exp(c)
_ep
Z
If, on the other hand, y < Yo, this implies that Z., < Z,+ 1; so by induction we see
that
Z ;>
Z
_, while
-exp(-WO/+(rmax - y)"+) + exp(-WO-(y - Tmin
+ 1)-)
< -exp(-W/ 3 +(rmax - rmin)"+) + exp(-wi(rmin - rmin + 1)0-)
and thus
ZY+1
<
1
+-exp(-L+(rmax
-- exp(-/O+(rmax -
- y)c+)+ exp(-wo/_(y - rmin
rmin) a+) + exp(-wOf(rmin Zmin
Zfo t+1
Zr,i
It follows that
82
+1)-)
rmin + 1)"~)
P(X(y)
=c)
P(X(y) =
P(X(y + 1) = c)
=
exp(w(q(y + 1, c) - q(y, c)))
Z
Z__
+
ZY
exp(OW)
Zu'i
P(X(rmin) = rmin)
P(X(rmin + 1) = rmin) ~
Thus
P(X(y) = c) < exp(e)P(X(y + 1) = c)
for all y and c. A symmetric argument shows that
P(X(y + 1) = c) < exp(e)P(X(y) = c)
So X is -differentially private, as desired.
This allows us to define Algorithm 1. The basic idea of the algorithm is that it
does a search for the largest w so that X, is -differentially private. The parameter
k controls how long we continue this search for. The choice of k affects how precise
an approximation of the optimal w we get (the algorithms result differs by the true
result by at most a factor of
}
Corollary 1. Algorithm 1 is 6-differentially private.
Proof. By the proof of correctness of the exponential mechanism we know that co
exp(c), so the algorithm always returns a value. Moreover, by Theorem 5 we see that
l
this value is -differentially private.
4.5
Theoretical Comparison
As k (the precision of our algorithm) increases the running time increases, but we
get a better and better estimate of the optimal value of w for our given
6.
Note that,
if one wanted to improve performance, instead of searching all wi they could use a
83
Algorithm 1 An
E-differentially
3
Require: y, eaCYa_,a
,/4,_,rmax,
private estimate of a count query
rmin, k
Ensure: c-differential privacy
{(i+ T)2Aq}iefo.,
for i= 0,...,k do
S
- {..,...,Wk}.
Xi = Xi
end for
for i=O,...,kdo
ai =
'P(X(rmax)=rmax)
P(Xi(rmax-1)=rmax)
P(Xj(rmin)=rmin)
P(Xi(rmin+1)=rmin)
ci = max(a, bi).
Let
end for
Let io be the largest i so that ci < exp(e).
return Z,(y) = Xi, (y)
binary search to find the largest wi C S so that ci < exp(E). We would like to compare
this algorithm to the one given in [971, to see how our results compare.
We might want to figure out how our algorithm compares to the naive exponential
mechanism. To see this, we first need to prove the following theorem:
Theorem 6. If
Wi >
w 2 then, for any c > 0, we get that
P(q(y, X. (y)) < c) ;> P(q(y, XW2(y))
with equality if and only if P(q(y, X' 2 (y))
c)
=
<- C)
1.
Proof. If q(y, x) < c then
exp(-wiq(y, X)) < exp((w 2 - w1)c)exp(-W 2 q(y, X))
and if q(y, x) > c then
exp(-wlq(y, X)) > exp((W 2 - W1)c)exp(-w 2 q(y, X))
For a given w let
a= =exp(-wq(y,
x,q(y,x)<c
84
x))
and
1
b==
exp(-wq(y, x))
x, q(y, x)> c
wi)c)a
X-
2
(y))
c)
=
1. Otherwise, by the above, a,
> exp((w
2
-
If bw 2 =0 then P(q(y,
and b, < exp((W 2 - wi)c)b, 2 so
2
P(q(y, X. 1 (y)) < c) =
>
a,
a,1 + bwl
exp((w
= a+-2
2
-
exp((w 2 - wi)c)a
wi)c)a,2 + exp((w
= P(q(y, X- 2 (y))
2
2
- w1)c)b
2
c)
just as we wanted.
This theorem implies that Algorithm 1 always outperforms the naive exponential
mechanism in terms of q. More formally:
Corollary 2. For a given e and c we get that
P(q(y, Ze(y)) < c) > P(q(y, X 2Aq (y)) < c)
Proof. There exists an w that depends only on e, k and q such that W >
Z,(y) = Xo(y). Thus by Theorem 6 the corollary follows.
4.6
2
and
E
Results
In order to test our algorithm we implemented it in Python (the code is available at
http://groups.csail.mit.edu/cb/DPCount/).
We also implemented the naive expo-
nential mechanism in Python, so that we could compare the performance of the two
methods.
In order to measure the performance of our mechanism we look at the associated
risk.
The risk of a random function X at y is defined as the expected value of
q(X(y), y) for y fixed. Note that by using the mechanisms described above, we are
85
4
3
2
CIn
0
-1
0.0
1.5
1.0
Epsilon
0.5
2.0
Figure 4-1: Here we plot privacy parameter c versus the risk (where the risk is on a
log scale) for the naive exponential mechanism (blue) and our mechanism with search
parameters k = 10 (green) and k = 100 (red). We see that in all cases ours performs
much better than the naive exponential mechanism.
implicitly trying to minimize the risk, so, it makes sense to use the risk as a measure
of how well the mechanisms performs.
As a first test of how our method stacks up we compared the risk of our mechanism
at a given c with the risk of the naive exponential mechanism. To do this we chose
0+ = 2,
_ = 1, o+ = a_ = 1, rFmin = 3, and rma_
are based off [971). We then plot
mechanism with k
=
E versus
-
106, y = 100 (these parameters
risk for the naive mechanism as well as our
10 and k = 100 (see Figure 4-1). This figure demonstrates that
for all choices of c our method results in much less risk than the naive approach.
Though our method preserves more utility than the naive exponential mechanism
for a given level of privacy, it does come at a slight cost- namely it takes longer to
run. However, we can make the runtime realistic (Figure 4-2). This is partly due to
86
the fact that, instead of using the linear search to find io given in the algorithm we
implemented a binary search procedure to cut down on running time. Figure 4-2 was
generated by measuring the running time of both our mechanism (with k varying)
and the naive exponential mechanism for varying choices of rmax. We used the same
parameters as in Figure 4-1, and set c = 1.0. Here we see that our mechanism does
take longer to run, and the running time increases as k increases.
hand, even for the worse case scenario considered (k = 100, r
On the other
= 107) we see that
our method only takes a few minutes. Since we are using these queries for cohort
identification it seems like waiting a few extra minutes for greatly increased utility is
a worthwhile trade-off. Moreover, it seems likely that the search step can be increased
by using estimates of P(X,(a) = b), based off the fact that p(IX(a) - al > c) dies
off quickly as c increases for reasonable p.
4.7
Choosing 6
The above algorithm assumed we chose some privacy parameter E then tried to optimize the utility as much as possible [61, 341.
utility and privacy.
Often we might want to balance
Roughly speaking, the utility of X, increases as w increases,
while privacy increases as c decreases, so balancing utility with privacy can be done
by comparing w to its corresponding c. Using the analysis given by McSherry et al.
[78j one would think w =
'.
Theorem 5, however, shows that this is not optimal.
Therefore we can use our result to show that one can choose a better W if they want
a certain privacy utility trade-off. This can be seen in Figure 4-3, where the E guaranteed by our analysis for a given w is compared to the c given by a naive analysis
(with the parameters used in Figure 4-1). We see that our method shows that more
privacy is guaranteed for less loss in utility, a fact that is useful when setting C.
It is worth noting that many people may query the system again and again.
Although differential privacy ensures that no one query reveals too much information
it may be the case that, taken together, a set of differentially private queries can still
lead to privacy breaches. Therefore it is necessary to limit the number of queries
87
10 3
102
101
10
0
U
.
a:
10
102
10
10-
105
10
1022
1 3
10 4
1 s
1066
10~
97
TmaLx
Figure 4-2: We plot rmaX, the maximum value returned by our algorithm, versus the
runtime (both oit a log scale) for the naive exponential mechanism (blue) and our
algorithm with search parameters k = 10 (green) and k = 100 (red). We see that,
though our algorithm is slower, it still runs in a few minutes in all cases.
88
1.0
0.8
0.6
0.4
0.2
0.01
0.0
0.5
1.0
1.5
2.0
E
Figure 4-3: We plot privacy parameter r versus the parameter p (which increases with
utility) for both the naive exponential mechanism (blue) as well as our analysis with
k = 10 (green) and k = 100 (blue). Note that p corresponds to utility (a higher p
means a higher utility), while a higher c corresponds to less privacy. We see that our
analysis shows that a given c corresponds to a larger 1,t which means that in many
algorithms that balance utility and privacy when choosing c, our analysis will result
in adding less noise to the answer of the count querry.
89
every user makes to the system. After a user uses up all of their queries (aka uses up
their privacy budget) then, in order to gain access to more queries, they have to have
previous queries audited. Note that this system runs into some of the same problems
as a purely audit based system, but it greatly increases the effort needed to violate
someone's privacy. Moreover, it can be hoped that the added noise makes attempts
to violate privacy more obvious, since the behavior required to violate privacy will be
more abnormal than without the noise.
It is worth noting that, in this case, we might be able to stretch the privacy budget
further than one would naively expect. For a given query there may be no sensitive
information given away about a person- for example, if we asked the system how
many individuals had HIV, then this query would not reveal sensitive information
about individuals without HIV. This insight suggests that it might be possible to
base the number of allowed queries off the number of sensitive queries made to each
patient instead of the total number of queries, though it is not obvious how to do this
in a privacy preserving manner.
4.8
Conclusion
The emergence of EHRs offers both the promise of great utility and the threat of
privacy loss. Here we provide one possible method to help balance the needs of the
medical community with the rights of patients.
Though we improve upon previous results, there are still many questions left to be
answered. Even though perturbing data increases privacy it does not eliminate the
possibility of something being revealed. To deal with this potential privacy breach
we need methods for auditing query records to see if any malfeasance is being committed. We also ignored the question of what level of privacy (aka what choice of
c) is appropriate- there are various methods to choose this parameter (such as cost
benefit analysis [34]), and it is up to the medical community to decide which makes
the most sense in our context.
Finally, it should be noted that we addressed only one aspect of secondary use
90
of EHR. Electronic records have the possibility of providing a cornucopia of useful
information, but there is a need to consider how to balance these gains with patient
privacy, whether it is through new privacy preserving methods like the above or
through new laws and regulations.
91
Chapter 5
Picking Top SNPs Privately with the
Allelic Test Statistic
5.1
Introduction
Genome-wide association studies (GWAS) are a cornerstone of genotype-phenotype
association in humans. These studies use various statistical tests to measure which
polymorphisms in the genome are important for a given phenotype and which are not.
With the increasing collection of genomic data in the clinic there has been a push
towards using this information to validate classical GWAS findings and generate new
ones [103]. Unfortunately, there is growing concern that the results of these studies
might lead to loss of privacy for those who participate in them [60, 19, 75].
These privacy concerns have led some to suggest using statistical tests that are
differentially private [66, 105, 102, 67, 95]. On the bright side, such methods, properly
used, can help ensure a high degree of privacy. These privacy gains, however, have
traditionally come at a high cost in utility and efficiency. Moreover, since the genome
is extremely high dimensional, this cost is especially pronounced, as was noted in
previous works [95]. In order to help balance utility and privacy, new methods are
needed that provide greater utility than current methods while achieving equal or
greater privacy.
Here we improve upon the state of the art in differentially private GWAS. We
93
build on the work of Johnson and Shmatikov [67], which applied the ideas of differential privacy to common analysis approaches in case-control GWAS. In particular,
we show how to use nonconvex optimization to overcome many of the limitations of
their method for picking high scoring SNPs in a differentially private way, making the
approach computationally tractable [67]. Secondly, we demonstrate how to give improved significance estimates for the chosen SNPs using input, as opposed to output,
perturbation based methods. Taken together, these results substantially advance our
ability to perform differentially private GWAS.
5.1.1
Previous Work
Previous works have looked at using differentially private versions of the Pearson X2
and allelic test statistics (defined below) to find high scoring SNPs, beginning with
the work of Uhler et al. [95]. Since then numerous others have worked on this problem
166, 105, 102, 67], and there has even been a competition where teams attempted to
improve on the state of the art 166].
These works focused on using three different approaches for picking high scoring
SNPs- namely a neighbor distance based one, a Laplacian mechanism based one, and
a score based one [95]. These studies have suggested the score based method is an
improvement on the Laplacian based method. The relation between the neighbor
based method and the other two is more complicated, however. Though it often outperforms them, it turns out that the ranking of SNPs favored by the neighbor method
is not always the same as that favored by the other methods. Moreover, the neighbor
method is more computationally demanding, leading others to use approximate versions of it [95]. Some of these works have also resorted to weaker privacy definitions,
assuming the control groups genome is publicly available [105].
Beyond just choosing high scoring SNPs, others have also looked at ways of estimating significance after choosing the SNPs of interest. This goal has been achieved
by calculating the sensitivity of the allelic test statistic and applying the Laplace
mechanism directly to it, or by performing similar procedures for p-values [95].
94
5.2
Our Contributions
We significantly improve upon the promising neighbor distance based mechanism
for releasing top SNPs [67].
We introduce an adaptive threshold approach which
overcomes issues arising from the fact that the neighbor mechanism might favor a
different ordering than that given by the allelic test statistic. We then introduce a
faster algorithm for calculating the neighbor distance used in this method, making
it tractable for large datasets. This algorithm works in three steps: (i) stating the
problem as an optimization problem; (ii) solving a relaxation of this problem in
constant time; and (iii) rounding the relaxed solution to a solution to the original
problem.
We also show how to obtain accurate estimates of the allelic test statistic, by
focusing on the input, rather than the output, of the pipeline. In particular, we show
that the input perturbation based method greatly improves accuracy over traditional
output perturbation based techniques. We demonstrate this capability in two different
set ups, one corresponding to protecting all genomic information (scenario 1) and one
corresponding to protecting only the information from the case cohort (scenario 2).
In addition, we look at methods for improving output perturbation in scenario 2.
Finally we apply our methods to real GWAS data, demonstrating both our greatly
improved computational performance and accuracy compared to the state of the art.
5.3
Set Up
Assume we have a case-control cohort.
For a given SNP let so, s, and s 2 be the
number of individuals in the control population with 0, 1 or 2 copies of the minor
allele, respectively. Similarly let ro, r 1 and r 2 be the corresponding quantities for the
case cohort, and no, ni and n 2 be the same quantities over the entire study population.
Let S be the number of cases, R the number of controls, and N the total number of
participants. We assume that R, S and N are known.
The allelic test statistic is given by
95
2
Y 2NV((2ro + ri)S - (2so + s1)R)
)
RS(2no + ni)(ni + 2n 2
Note that Y only depends on x = 2ro + r1 and y = 2so + si, so we can overload
notation and let
-
'
-RS(x
2
2N(xS - yR)
+ y)(2N - x - y)
This statistic is commonly used in GWAS studies to help determine which SNPs
are associated with a given condition [95]. (Note that the allelic test statistic ignores
the effects of population stratification and similar issues, a problem we will address
in the next chapter)
We consider two different scenarios in this chapter. In scenario 1 we assume that
all genomic data and phenotypic data is private- that is to say that only R, S and N
are known. In scenario 2 we don't try to hide information about the control cohort
and instead only try to hide the case cohort- that is to say that so, si and S2 are
also known. This scenario has been investigated in previous work [105, 95]. Unless
otherwise stated assume we are in scenario 1.
5.4
GWAS Data
In this work we test ours and other methods on a Rheumatoid Arthritis (RA) dataset,
NARAC-1, from Plenge et al.[47]. After quality control it contains 893 cases and 1244
controls. We performed quality control, removing all SNPs with minor allele frequency
(MAF) less than .05. We considered only SNPs that where successfully called for all
individuals. This process resulted in a total of 62441 SNPs to be considered.
5.5
Picking Top SNPs with the Neighbor Mechanism
In practice we will not know ahead of time which SNPs are related to a given phenotype. Thus we would like to pick the top SNPs (aka those with the largest scores) in a
96
differentially private way. Previous works have used three different methods, namely
a Laplacian based method, the score method, and the neighbor method. Here we will
focus on the neighbor method,which is given in Algorithm 2.
Let D and D' be databases of size n. The neighbor distance between D and D'
is the smallest value of k so that there exists a sequence of databases, Do, - - , Dk so
that, for all i, Di and Djai differ in exactly one entry, and that Do = D and Dk = D'.
It is worth noting that this defines a metric (a distance metric is a measure of distance
that obeys several mathematical properties, including the triangle inequality) on the
space of all databases of size n. The neighbor distance equals
ID - D'I for both
scenarios. Sometimes, to simplify notation, we will let p(D, D') denote the neighbor
distance.
The neighbor method for picking SNPs works by picking a threshold w. All SNPs
with an allelic score higher than w are considered significant, while all others are
considered not significant. The neighbor distance of a given SNP to the threshold W
is the minimum number of changes needed to flip a given SNP from significant to not
significant or vice versa- that is to say the minimum neighbor distance to a significant
database if the SNP is not significant or vice versa. We can then use this distance
measure to pick our SNPs in a differentially private manner, as shown in Algorithm
3.
This algorithm relies on two pieces of information: namely a score function q and
a definition of datasets. Here we will assume that q is the allelic test statistic. In our
set up all datasets with R cases and S controls are allowed.
Note that this definition relies on an arbitrary choice of w.
Because of this it
is often the case that the SNPs favored by the neighbor method are not always the
same as those with the highest allelic score [105, 95]. To deal with this problem we
instead pick the threshold in a differentially private way.
in Algorithm 3. This method ensures that as c, and
62
This method is codefied
increase the probability of
Algorithm 3 returning the correct SNPs goes to one. This behaviour is because the
probability that wdp is in a small neighborhood of w goes to 1 as cl goes to oc. Since
w separates the mret highest scoring SNPS from the rest, the probability that di > 1
97
for the met highest scoring SNPs and di < 0 for the rest also goes to 1 as Ei goes to
oc. One the other hand, note that if di > 1 for the met highest scoring SNPs and
di < 0 for the rest, then as e 2 goes to oc the probability of picking the top met SNPs
goes to 1. Putting the above together gives us our result.
Algorithm 2 The neighbor method for picking top met SNPs
Require: Data set D, number of SNPs to return mret, privacy value c, score function
q, and boundary w.
Ensure: A list of met SNPs that is c- differentially private.
for i=0,...,mdo
else
di = 1 - min({JD - D' : q(i, D') > L,
ID')
=
ID
)
if q(i, D) > w then
di = min({ID - D' : q(i, D') < w, D' = IDI})
end if
end for
di) for all i.
Let wi = exp(2
Choose met SNPs without replacement, where Pr(Choose SNP i) oc wi.
return Chosen SNPS
Algorithm 3 Our modified neighbor method for picking top met SNPs
Require: Data set D, number of SNPs to return mret, privacy values e 1 and E2 , and
score function q.
Ensure: A list of met SNPs that is E1 + E2 - differentially private.
Let w be the mean score of the metth and mret + 1-st highest scoring SNP.
Let wdp be an El-differentially private estimate of w.
return Choose SNPS using Algorithm 2 with C = C2 and boundary value Wdp.
Theorem 7. For any choice of q we get that Algorithm 3 is
(E1
+ C 2 )-differentially
private.
Proof. To prove that Algorithm 3 is
wdp
(El
+ C 2 )-differentially private it suffices to prove
is e-differentially private, since the second step of the algorithm is simply the
exponential mechanism with privacy parameter
E2 .
This fact, however, follows trivEl
ially.
It should be noted that, in practice, we choose E, then let El = .Al and E2
= .9E.
There is no real motivation for this choice, and it would be interesting to investigate
the trade offs involved between the two parameters.
98
In previous works two main issues have been raised concerning the application of
the neighbor method. The first, which we addressed above, is that the order of the
SNPs might differ from the order given by the allelic test statistic. The other main
issue is runtime. In the following sections we show that the runtime is much less of
an issue than some have argued. We then implement these algorithms, using them
to compare the utility of our modified neighbor method to previous methods.
Fast Neighbor Distance Calculation with Private
5.6
Genotype Data
5.6.1
Method Description
The major computational bottleneck of the neighbor method for picking high scoring
SNPs has been the calculation of neighbor distance. This bottleneck has led some to
calculate approximate neighbor distances
(195])
while others have calculated neighbor
distance under stronger assumptions, leading to weaker privacy guarantees (105], aka
scenario 2). We are able to overcome this bottleneck using Algorithm 4.
To help remedy the situation we introduce a new method for calculating the
neighbor distance.
Our method involves only a constant number of arithmetic op-
erations per SNP. To understand our approach, fix a given SNP. Assume we have
p
=
(ro, rl, r2 , SO,
Si, S2 )
and a threshold w which we want to use to calculate the
neighbor distance. Note that the neighbor distance can be expressed as the solution
to the following optimization problem:
1,
-Ip - p'I|
minimize
yee 2
subject to p' > 0, i = 1,..., 6
p +p
+p3
X' = 2p'
4= R,
p'+p4 p= S
+ p1, y' = 2p' + p'/
UW(p) (Y( ', y') - w) < 0
99
where i,(p) denotes the sign of Y(p) - w. By removing the integrality constraints
and projecting down onto two dimensions we get the following relaxation:
minimize
g(x, y)
subject to
O<x<2R; O
=
gix) + g 2 (y)
y<2S
uW(p)(Y(x, y) - W) < 0
where
x-2r-ri
2
g1(x
W
if 2(ro + r2 ) + ri > x > 2ro + r1
2ro+rl -x
2
if r, < x < 2ro + r1
r2 + x - 2(ro + r 2 ) - ri
ro
+
r
1
-
if 2R > x > 2(ro + r2 ) + ri
otherwise
x
and
y-2so-si
2
2
if 2(so + s2 ) + s1 > y > 2so + si
so+si-y
2
if si < y < 2so + si
9 2 (y
M
s2
+ y - 2(so + S2)
-
S1
if 2S > y
2(so + S2) + si
otherwise
so + Si - Y
We say that (x, y) is feasible if it satisfies the constraints for this relaxed problem.
Algorithm 4 first solves this relaxed problem by iterating over a small set of
possible solutions (each of which can be found in constant time using the quadratic
equation and some basic facts about convex optimization) then rounding to find a
solution to the original problem. A proof of correctness as well as a few other details
100
are given in the rest of this section.
Note that our algorithm assumes that w > 2N-1*
2N . This rqieethoees
requirement, however, is
not a problem, since in practice this corresponds to a rather large p-value (greater
than .05 for N > 3). To accommodate this requirement, the only change we need to
make in our neighbor picking algorithm is to round
wdp
up to
2N1
if this condition is
not met. It is also worth noting that this algorithm relies on being able to check, for
a given 6, if there exists a feasible x, y E Z with /31 (x) +
3
2(y)
= 6 in constant time,
where
[gi(x)] +1
if r1 =0 andx-2ro-r
1 odd
else
gi ()]
and
{g2(y)1
+ 1
02(Y) =
[g 2
if si = 0 and y - 2so - s, odd
else
(Y)]
We show how to check this condition below.
Theorem 8. Algorithm
4 is correct and can be made to run in constant time.
Proof. This is proven in the rest of the section.
5.6.2
El
Proof Overview
In this subsection and those that follow we prove that Algorithm 4 works.
To do
this, we will fix a particular SNP. This allows us to specify our database with a tuple
p = (ro, rl, r2 , sO, s1 , S2). Let w be the threshold we want to calculate the distance to.
The first thing to notice is that the neighbor distance from p to a database p'
=
(r/, r' , r/, s, s', s') is equal to 11p - p'I1. This allows us to state the neighbor distance
problem as the following optimization problem:
101
Algorithm 4 Calculates the neighbor distance for SNPs in constant time
Require: p = (ro, ri, r 2 , SO, s 1 , s 2 ) with pi
0 for i = 0, ...
as usual; and threshold L> 2N1Let g(a, b) = g 1 (x) + g 2 (y) be defined as in the text.
Let C denote the curve defined by
2N(xS - yR) 2
=
, 5;
N, R and S defined
RSw(x + y)(2N - x - y)
Find the set P of all points p E [0, 2R] x [0, 2S] on the curve C whose tangent line
has slope in
1
{1,2, }
2
Using the quadratic equation find Q, the set of all p = (po, pi) E [0, 2R] x [0, 2S]n C
with either
PO E {2(ro+ r 2 )+ ri, 2ro + rl, ri, 0, 2R}
or
Pi E {2(so+s 2 ) + si, 2so + si, si, 0, 2S}
mn
mi=
pEPUQ
(p)]
if Y(p) < w then
return y
end if
for 6 E{,
, + 5} do
if exists feasible x, y E Z with /31 (x) + /2(y) = 6 then
return 6
end if
end for
102
minimize
-Ip - p'li
subjectto
p'>0,i==1,...,6
(5.1)
pO+p' +p' = R, p'+p'+p '=S
=
2p' + p', y'
=
2p' + p'
U (p) (Y W, y') - CJ) < 0
where uw(p) denotes the sign of Y(p) - w. We begin by removing the integrality
constraints, giving us:
minimize
6
p'ER
1,
-Ip
2 - p'Ii
subject to
p0 + p' + p'
x' =
2
=
(5.2)
p4 = S
R, p3 + p'
pI + p', y' = 2p' + p'4
UW (p)(Y(', y') - W)
0
Since Y(p') only depends on x = 2r' + r' and y = 2s' + s' we would like to reduce
this to a two dimensional optimization problem. To do this reduction, we need to
consider g(x, y) = g 1 (X) + g 2 (y), where
x-2r -rl
2
if r1 :5 x < 2ro + ri
"
2ro+2r
2
if 2(ro + r 2 ) + r1 > x > 2ro + r1
g 1 (x
W
r 2 + x - 2(ro + r 2 ) - r1
ro
if 2R > x > 2(ro + r 2 ) + r1
otherwise
+ r1 - x
and
103
y-
2
if 2(so + s 2 ) +
so-si
2
2so+s-y
92 (y)
Si
if si
y >
2
so + si
y < 2so + si
=
s2
+ y - 2(so + s 2 ) - si
sO
+
if 2S > y
2(so + s 2 ) +si
si - y
otherwise
The importance of g is demonstrated by the following theorem:
Theorem 9. Consider (x, y) G [0, 2R] x [0, 2S], then
2g(x, y)
=
min
2
p' feasible, x= p' +P'
y=2p'/+pI
p
-
p'i
Informally, g(x, y) is the minimum neighbor distance from D to a database with x
=
2p' + p'1 and y = 2 p' + p'.
Proof. It suffices to show that g, (x) is the minimum number of steps needed to reach
a database with x = 2r' + r' and g 2 (y) is the minimum number of steps to reach a
database with y = 2s' + s'. We will prove it for gi(x), the other case being similar.
Consider the case that x > 2ro + rl, the other case being similar. Since changing
s does not change 2r' + r', we can assume that the si's stay fixed. First note that
2g 1(x) >
min
p' feasible, x=2p'+p'
1p - p'li
since it is achieved by increasing r' and decreasing r' until either x = 2r' + r', or
until r'=
0, in which case we decrease r' and increase r' until x = 2r' + r'.
Moreover, this number of steps is clearly optimal.
To see this fact, consider p'
that achieves the above minimum, and note we can write
(r/, r'/1, r' ) - (rO, r1, r2) = U1 (1, 0, -)+
104
U2 (1, -1,
0)
Then
2ro + r1 + 2u, + U 2
=
2r' + r' = x
Meanwhile
I(ro, ri, r 2 )-(r ,
11
r')I1
= IuI+Iu 2 1+Iui+U 21
=
Jui +Ix-2ro-ri-uiI+Ix-2ro-r-2
The fact that p' is positive is equivalent to u 1 < r2 , u 2 < ri and
Consider the case that 2u1 < x - 2ro - r1 . This implies u 2 > 0. If
us that I(ro, ri, r 2 )
-
U1
u1 =
+
U2
>
=
U2-
-ro.
r2 this gives
(r/, r'f, r')I1 = g(x). If, on the other hand, a 1 < r2 , choose
that a 1 + e < r2 . Let v, = u 1 + e and v 2
E
so
2e. Then if
(r',r'',"r')= (ro, ri, r 2 ) + v 1 (1,0, -1) + v2 (1, -1,
0)
we see
|(ro, ri,r 2 ) - (ri', r/ r')II < |(ro, ri, r2 ) - (rb, r', r2)|
while 2rg + r'' = x and r" > 0 for all i. This is a contradiction.
Applying similar arguments to the other cases gives us our result.
17
It is worth noting for later that this theorem has the following corollary:
Corollary 3. Consider (x, y) E [0, 2R] x [0, 2S] integral, then
2/ 1 (x) + 2/ 3 2(y)
min
=
/
6
p'EZ feasible, x=2p'+p'1 , y=2p'3 +p
p
P
- p'
This result implies that the optimal solution to our relaxed problem is equal to
the optimal solution of
minimize
Xly
subject to
g(x, y)
=
gi() + g 2 (y)
0 < x < 2R; 0 < y < 2S
u
(p) (Y (X, y) - W) <_0
105
1
(5.3)
I
while the solution to our initial integral problem is equal to the solution of
minimize
x,yE7z
# 1 (x) + 3 2(y)
subject to 0 < x < 2R; 0 < y < 2S
(5.4)
uW(p)(Y(x, y) - W) < 0
It is important to note that {pI2(poS - piR) 2 < o(po
+ pi)(2N - po - pi)} is
a convex set. More specifically it is the convex hull of an ellipse.
To see, note if
o < 0 this is an empty set, if L = 0 this is a line. If, on the other hand, w > 0, it
is easy to see that 2N(xS - yR) 2 < aRS(x + y)(2N - x - y) can be rewritten as
(X, y)Q(x, y)T + qx +
q2y
+ q 3 < 0 where
Q
is a positive semidefinite matrix. This
implies 2N(xS - yR) 2 < aRS(x + y)(2N - x - y) is a (filled in) ellipse, and thus is
convex.
In order to solve this relaxed problem we iterate through two sets of possible
solutions, one corresponding to extreme points and one corresponding to tangent
points (Figure 5-1).
Using this solution, we are able to find the exact neighbor
distance.
5.6.3
Proof for Significant SNPs
We first prove that Algorithm 4 works as advertised on significant SNPS (that is to
say when Y(p) > w). We will first prove correctness, then prove it can be made to
run in constant time.
Theorem 10. Algorithm
4 returns the neighbor distance for significant SNPs.
Proof. Assume we are looking at a significant SNP.
Note that
1g(x, y)
-
f 1 (X)
-
022(Y)I
4.
Assume that 6 is the solution to the optimization problem in Equation 5.3, and
that
', y' is the associated argmin. Then (x', y') must lay on the level set
D 6 = {(a, b)12R > a > 0, 2S > b > 0, g(a, b) = 6}
106
l
MI~AIA~R~
g(:
(a) Tangent Points
(b) Extreme Points
Figure 5-1: Our algorithm for finding the solution, 6, of our relaxed optimization
problem relies on the fact that there are only two possible types of solutions: (a)
extreme point solutions and (b) tangent point solutions. Our algorithm finds all such
extreme points and tangent points, and iterates over them to find the solution to our
relaxed optimization problem.
Note, however, that this set is the union of line segments, where each line segment has
slope in {1, -1, 2, -2, .5, -. 5}. Either (x', y') is an endpoint of one of these segments
or is in the interior of one of them.
At the same time it must also be the case that
N(W'S - y'R) 2
RS
=
w(x'+ y')(2N -
x' - y')
Let C be the curve defined by this equation.
It is important to note that C is a
smooth curve, so we can find its derivatives.
If (x', y') is in the middle of a line
segment 1 C D6 , we see that I must lie tangent to C, so (x', y') must be one of the
points in C whose tangent line has slope in {1, -1, 2, -2, .5, -. 5}. Moreover, we know
that since 0 < x' < 2R and 0 < y' < 2S that the slope of C at (x', y') is positive, so
(x', y') must be one of the points in C whose tangent line has slope in {1, .5, 2}-the
set we denoted by P in Algorithm 4.
Therefore either (x', y') E P or (x', y') is the endpoint of some segment in D6 . If
107
it is an endpoint, however, it must be that (x', y') E
(X', y') E P U Q, so the quantity
Q.
Therefore it must be that
calculated by Algorithm 4 is equal to [g(x', y')1.
We next want to show that there exists an integer pair (X, y) with 0 < x < 2R,
0
y
2S, x
-
x'| < 1, y
y'I < 1, and
-
2N
2Nj(xS - yR) 2 < w(x + y)(2N - x - y)
We will prove this fact in the case that x'S - y'R > 0 and R < S, the other cases
being similar. Form (x, y) by rounding x' down and y' up. If xS - yR > 0 it must be
that (x, y) is contained in the triangle whose vertices are (x', y'), (0, 0) and (2R, 2S).
Note if p is one of these vertices then
2N
RS(POS
-
p1R)2
w(po + p1)(2N
-
Po
-
P1)
so it follows by convexity that
2N
2
RSxS-yR)
w(x+y)(2N-x-y)
Therefore consider the case that xS - yR < 0. Let E be the set of all p E [0, 2R] x
[0, 2S] with
2N
Note that, since w >
(p + p1 )(2N - po - p1
)
Rs(poS - p1R)2
N ,(0, 2) and (2R, 2S - 2) must be in the feasible region,
E. Therefore, since (0, 2), (0, 0), (2R, 2S - 2) and (2R, 2S) are all in E, by convexity
the parallelogram formed by them, denoted by W, must also be in E. Let F be the
closure of [0, 2R] x [0, 2S] - W. Note (x', y') E F, since (x', y') is on the boundary of
E. Similarly if (x, y)
E then (x, y) E [0, 2R] x [0, 2S] - W C F.
Note, however, that F has two components, EO and El, where
poS - piR > 0 if (po, pi) E Eo and poS - p 1 R < 0 if p G E1 . This fact implies
(X', y') E Eo and (x, y) E E1 . Note, however, that the L 2 distance between any point in
E1 and any point in EO is at least V/-. By definition, however, I(x, y) - (X', Y')12 < V2.
108
This result is a contradiction, so it must be that (x, y) E E, as desired.
Note g(', y')
g(x, y) < g(x', y') + 2 <
31 (x)
+
2(Y)
+ 2. Furthermore,
g(x', y') + 6
< g(x, y) + 4
Finally, note that if (a, b) E E with (a, b) C Z 2 then
g(x', y')
g(a, b) < 131 (a) + 02 (b)
This fact implies that the neighbor distance is somewhere between
and
+ 6.
Therefore Algorithm 4 returns the neighbor distance.
FD
Theorem 11. Algorithm
4
requires a constant number of arithmetic operations.
Proof. First, note that P can be generated in a constant number of operations using
ideas from basic calculus.
We can also generate
Q
in constant time. This can be done by first iterating over
all po E {2(ro + r2 ) + ri, 2ro + r1 , rl, 0, 2R} and, for each of them, use the quadratic
equation to find p, so that Y(po, pi) = w, then doing something similar for each
Pi C {2(so + s 2 ) + si, 2so + si, si, 0, 2S}.
Since
IPUQI < 26 and g takes constant time to calculate it follows that calculating
y takes a constant number of arithmetic operations.
All that remains to prove is that, for a given integer 6, deciding if there is an
integer (x, y) E [0, 2R] x [0, 2S] with 2(xS - yR) 2 < w(x + y)(2N - x - y) and
01(X) + 02(Y)
=
6
can be done in a constant number of arithmetic operations. To see this fact, consider
109
the case when r, > 0 and si > 0, the other cases being similar. Note we can write
x-2ro2-r
2
if 2(ro + r2 ) + ri > x > 2ro + ri, and x - 2ro - r1 even
x-2rO-ri+1
2
01
Wx
if 2(ro + r2 ) + r1 > x > 2ro + rl, x
2ro+1-x
2
if ri
x < 2ro
+ rl, x
-
2ro - r1 odd
2ro
-
r1 even
-
=
2ro+ri -x+1
2
if r1
r2
+ x - 2(ro + r 2 )
ro
+
-
2
2
2ro
ri odd
-
2(ro +
72 )
+ r1
otherwise
x
r1 -
if 2(so + s 2 ) + si > y > 2so + si, and y
s--s+1
2
if 2(so +
so+sl-y
S2)
+
Si
if s,
2
3
-
>
if 2R > x
r1
y-2s -si
2
y-
x < 2ro + ri, x
y > 2so + si, y
2so
-
-
si even
-
2so - s, odd
y < 2so + si, y - 2so - si even
2(Y) =
2so+sl
2
S2
y+l
if si
+ y - 2(so +
so +si
-
S2) -
y
2so
+ si,
-
2so
-
if 2S > y > 2(so +
S1
si odd
82)
+
Si
otherwise
Y
The above tells us we can define co, - -,7 so ci E {0, 1}, intervals UO,..
and rational vo,- - , v 7 , dO,
y
,
.
, U7 in R,
d7E Q so that
#1(x) = vix + di iff x E Ui, and x = ci mod 2
Similarly we can define c', - - - , c' so c'
{0, 1}, intervals Us,.. . , U7, in R, and rational
110
v/,--, v/, d' ,* *, d' E Q so that
02(y) = v y
Thus
#31 (x)
+ /32(y)
=
+ d'i iff y E Uj, and y = c' mod 2
6 if and only if, for some i,j E {O,.. . , 7}, we have that
vix+v y+di+d =6
subject to x E Ui, y E U
x = ci mod 2 and y = c' mod 2. For a given i and
j
we
can easily check if such an integral x and y exist.
), then at +3
More explicitly, let a = (1, V') and 3 = (0,
is a parameteri-
zation of
{(x,y)Ivix+ v'y+d + d'
Since vi, vj E {1,
}}
6}
=
either a has integer entries, or it has some entry that is not an
integer but is an integer divided by two. Since ao = 1 we see that if at +
then t must be an integer. Moreover, we see that at +
/
/
is integral
is an integer if and only if
a(t + 4) +/3 is, and if both are integers they equal one another mod 2. We can easily
find the interval [-y, -y2] so that at +
/
E U x U if and only if t E [-Y1, Y2]. Similarly,
using the quadratic equation we can find the interval [fi,
f2 ]
so that Y(at +
/)
<w
if and only if t E [fi, f2]. If we let [r, s] be the intersection of these two intervals it
follows trivially that, if (x, y) = at +
/,
then '(xS
-
yR) 2
w(x + y)(2N
-
x
-
y)
and
01(X) + 0 2 (y) =
6
if and only if t E [r, s]. It is easy to check in constant time if there is a to E [r, s] such
that
ato
+/3
=
(ci,c 2 ) mod 2
This follows from the fact that if t is a solution so is t + 4; hence there is such a to iff
there exists such a to E {[ r], . .
, [r] + 4} that also satisfies the conditions required.
In conclusion, we can check in constant time if, given integral 6, there is an integer
111
-(xS - yR) 2 < w(x + y)(2N - x - y) and
(x, y) E [0, 2R] x [0, 2S] with
1(x)
by iterating over all choices of i and
+ 02(Y)
j.
=
6
Therefore this algorithm runs in constant
E
time.
5.6.4
Proof for Non-Significant SNPs
Note that an almost identical argument to the above can be used for non-significant
SNPs if we are willing to do a similar rounding procedure.
In order to make the
algorithm more efficient, however, we show that a much simpler rounding procedure
works.
Assume we are given a threshold w and a database with p
=
(ro, ri, r2 , S s1, s2 ),
where pi > 0 for all i and Y(p) < w. Then we can calculate
p(p, {vlY(v)
> w}) in constant time. In order to perform this calculation, consider
Equation 5.3. We note that
Definition 2. A real valued function f is quasi-convex if, for evey a E R, f--1((-oo, a))
is a convex set.
Lemma 1. Y: R 2
-
R is continuous and quasi-convex on [0, 2R] x [0, 2S].
Proof. We know that Y is continuous from the previous section (assuming we let
Y(0, 0) = Y(2R, 2S) = 0)
2
To see that it is quasiconvex, note Y(x, y) < a if and only if 2N(xS - yR)
<
aRS(x+y)(2N - x - y). If a < 0 this is an empty set, if a = 0 this is a line. If, on the
2
other hand, a > 0, it is easy to see that 2N(xS - yR) < aRS(x
be rewritten as (x, y)Q(x, y)T
+ y) (2N - x - y) can
+ qix + q2 y + q 3 < 0 where Q is positive semidefinite.
This fact implies 2N(xS - yR) 2 < aRS(x + y)(2N - x - y) is a (filled in) ellipse, and
D
thus is convex.
Theorem 12. Algorithm
4
is correct on non-significant SNPs and runs in constant
time.
112
Proof. The proof that it runs in constant time is the same as for significant SNPs, so
we need only prove correctness.
To see this assume that 6 is the solution to Equation 5.3. Let
0Q
{(x, y)Ig(x, y) < 6}
=
Then 06 is a convex polygon. Furthermore, by definition of 6 we see that
W=
max Y(x,y)
(x,y)eC
Since Y is quasi-convex and C6 is a convex polygon, these facts imply that there is
some extreme point of C6, let us call it (x, y), so that Y(x, y) = w. Since (x, y) is an
extreme point of C6 it must be that
x E {2(ro + r2 ) + rl, 2ro + ri, ri, 0, 2R}
or
y E {2(so + s2 )
+ s1 , 2so + si, si, 0, 2S}
so (x, y) E P by definition. Thus
6=ming(p) = min g(p)
pEP
pEPUQ
Assume
x E {2(ro + r 2 ) + ri, 2ro + ri, ri, 0, 2R}
the other case being similar. Then note that this assumption implies gi (x) = # (x)
by definition.
We therefore consider y. By the proof of Theorem 5.6.2 we know that there exists
s', s', s' > 0 so that s' + s' + s' = S, 2s' + s' = y and
292(Y)
=
(so, S1 ,S2 ) - (s', s, s)1
113
Moreover, the proof implies that either s' = 0, s' = 0 or s' = si.
S=
Assume that
si, the other cases being similar. Then let us define
U
=
(Uo,u 1 ,u 2 )
Ls'2])
= ([so], s',
and
v = (vo, vI, v 2 )
=
([s'j, s',
note that 2uO + u1 > 2s' + s' > 2vO + v 1 were y
either w < Y(x, y)
=
[s'21)
2s' + s', so by quasi-convexity
Y(x, 2uo + ui) or w < Y(x, y)
Y(x, 2vo + v1 ).
Assume
W < Y(x, y) < Y(x, 2uo + ui), the other case being identical. Note by construction
Sju- (s, s, s')| < 1 so
1
1u - (SO, Si, S2)1 < 1-( (U - (SO', S'2)|+ I(so, si,s 2 ) - (sO, s1, s2)) < 1+ g (y)
2
22
Note, however, that
1IU -
(SO, Si, S2)1
is an integer so it must be that
1
2-Iu- (sO, si, s 2 )j
Note also that (rO, r 1 , r2 , uO,
U1 , U 2 )
[g 2(y)]
is feasible given the constraints in Equation 5.1
with associated score bounded above by
gi(X) + [g 2 (y)I
=
[g1 (X) + g 2 (y)1 = [6]
so the optimal value for Equation 5.1 is bounded above by
[61.
At the same time, since 6 is the solution to Equation 5.3 we see that the solution
to Equation 5.1 must be greater than or equal to 6. Since it must also be an integer
it follows that the optimal solution must be greater than or equal to
[6]. Putting
this all together proves that the optimal solution to Equation 5.1 equals
[6]
= [min
PGPUQ
114
g(p)]
=
g
and thus Algorithm 4 is correct for non-significant SNPs.
5.7
Results: Applying Neighbor Mechanism to Realworld Data
5.7.1
Measuring Utility
We apply our improved neighbor mechanisms to the rheumatoid arthritis GWAS
dataset described earlier in this chapter.
In order to do so we use the following
standard measure of performance [1051. Let A be the top met scoring SNPs, and let B
be the met SNPs returned by some differentially private algorithms. We than measure
the utility of the mechanism by considering IAnBI
JAl -that
is to say the percentage of
SNPs that are correct. The closer to one this quantity is the better.
Note that this is only one measure of utility.
Others might also look at other
measures of utility- after all, the difference between metth highest scoring SNP and
the next highest scoring SNP may be small, and this measure does not consider that.
We use this measure both because of its simplicity, and because it has been used in
previous works [105].
5.7.2
Comparison to Other Approaches
We first want to see how our method compares utility-wise with other methods that
have been developed, namely the score and Laplacian methods [95j. In order to test
these methods we run our algorithm and both the other algorithms for various met
and c to compare utility.
The results can be seen in Figure 5-2.
We see that in all cases our modified
neighbor method (red) outperforms the Laplacian (green) and score (blue) based
methods by a large margin.
115
1.0,
1.0
/
t)
0.8
0
0
f
U
CL 0.6
z
U
e. 0.6
z
I
2
1
-
U
II0.4
II0.4
U
U 0.2
U 0.2
4:i
3
.2
a2 3c
1
s5
4
3
5
4
c(privacy)
(a) mret = 3
(b) mret = 5
1.0
1.0,
U
U
01
2
c (privacy)
0.8
0
U
0.8
0
0.. 0.6
z
.
0.4
U
II0.2
U0.0
0-o
o
s5
1002
E(privacy)
(c) mret
=
20
5
2s
303
C0
5
10
is
20
20
25
2s
3
30
E (privacy)
(d) mret = 15
10
Figure 5-2: We measure the performance of our modified neighbor method for picking
top SNPs (red) as well as the score based (blue) and Laplacian based (green) methods
for met (the number of SNPs being returned) equal to a. 3 b. 5 c. 10 and d. 15 for
varying values of c. For mret = 3,5 we consider c between 0 and 5, while in the other
cases we consider E between 0 and 30. We see that in all four graphs our method
leads to the best performance by far. These results are averaged over 20 iterations.
116
5.7.3
Comparison to Arbitrary Boundary Value
We also compared our modified neighbor method to the traditional neighbor method
with predefined cutoff w. In particular we consider using a cutoff corresponding to a
Bonferroni corrected p-value of .05 and .01. The results are pictured in Figure 5-3.
When mret = 15 we see that as c increases the utility of our method (red) increases
towards one, while the utility of the other methods (green for .05, blue for .01) seem
to plateau around .85. This result demonstrates the advantages of using adaptively
chosen boundaries, even if in some cases (mret E {3, 5, 10}) doing so leads to slightly
decreased utility.
Moreover, by changing the balance between C, and
E2
it seems
plausible that even this slight decrease can be overcome.
5.7.4
Runtime
We test the runtime of our method on real data.
In particular, we look at how
long it takes to calculate the neighbor distance for all SNPs, since this is the time
consuming step. In the past others have had to implement approximate versions of
the neighbor distance to make it run in a reasonable time [951. We implemented a
simple hill climbing algorithm similar to Uhler et al. [95]. We then tested it for various
boundary's based off the number of SNPs we want to return (see Table 5.1). We see
that our method is much faster than the approximate method, taking only about 3
seconds total to estimate the neighbor distances for all SNPs, regardless of the choice
of mret. Moreover, we see that the approximate method gives results that can greatly
differ from the exact one, as demonstrated by the average error per SNP (that is to
say how far off the approximate result is from the actual result). This is in contrast
to our result that does not have any error!
5.8
Output Perturbation
Aside from picking high scoring SNPs, we are also interested in estimating their scores.
In the past these estimates have been achieved by applying the Laplacian mechanism
117
1.0
1.0
U
() 0.8
4) 0.8
0
0
0.6
I
z
0.4
0.6
0.4
U
o0.2
U
U0.2
r
I
1
a .2
3
2
3
4
5
E(privacy)
1
2
3
1
2
3
405
4
5
c (privacy)
(b) mret = 5
(a) met = 3
1.0
,
1.0.
U
0) 0.8
U
0
0.0.6
ir
z
U,
z
LAl
U
0.6
II0.4
II0.4
0.2
U 0.2
0
5
10
15
20
20
25
25
tL"
30
30
(privacy)
E
5
10
(c) mret = 10
20
1y
c
25
3
(privacy)
(d) mret
=
15
Figure 5-3: We measure the performance of our modified neighbor method for picking
top SNPs (in red) as well as the traditional neighbor method with cutoffs corresponding to a Bonferroni corrected p-value of .05 (in green) and .01 (in blue) for mret (the
number of SNPs being returned) equal to a. 3 b. 5 c. 10 and d. 15 for varying
values of E. For mret = 3, 5 we consider c between 0 and 5, while in the other cases
we consider c between 0 and 30. We see that in the first three cases the traditional
method slightly outperforms ours. When mret = 15, however, the traditional methods
can only get maximum utility around .85, where-as ours can get utility arbitrarily
close to 1. This shows how we are able to overcome one of the major concerns about
the neighbor method with only minimal cost. These results are averaged over 20
iterations.
118
Table 5.1: We demonstrate the runtime of our exact method as well as the approximate method for various boundaries (where the boundary at mret is the average of
the mretth and mret + 1st highest scoring SNPs), as well as the average L1 error per
SNP that comes from using the approximate method. We see that our exact method
is much faster than the approximate method. In addition, its runtime is fairly steady
for all choices of met. We see the approximate method is faster and more accurate for
larger mret- this makes sense since the average SNP will be closer to the boundary,
so there will be less loss. These results are averaged over 20 trials.
mret
Our Runtime
Approx Method Runtime
Approx Method Error
3
3.0 seconds
71.15 seconds
22.15
5
3.0 seconds
53.4 seconds
13.77
10
3.05 seconds
38.2 seconds
7.62
15
3.05 seconds
31.85 seconds
5.76
to the output of the allelic test statistic. It turns out, however, that in practice this
choice is not optimal. Here we show how to achieve improved output perturbation
when in scenario 2 by perturbing the square root of the allelic test statistic, and apply
this to the Laplacian based mechanism for picking high scoring SNPs (Algorithm 5).
In the next section we go even further, showing how to improve performance using
input perturbation.
Algorithm 5 The Laplacian method for picking top mret SNPs
Require: Data set D, number of SNPs to return mret, privacy value c, and score
function q that takes in a SNP and a dataset and returns a score.
Ensure: A list of met SNPs
max
q(i,D)-q(i,D')j.
Let Aq =
i=1,---,m,D~D'
Let si = q(i, D) + Lap(O, 2mE ) for all i.
return The mret SNPs with highest si
5.8.1
Calculating Sensitivity
In order to apply output perturbation to a function q we need to calculate q's sensitivity. We discuss this below for both the allelic test statistic and its square root.
119
In scenario 1, if we choose q to be the allelic test statistic then we can use the
sensitivity calculated by Uhler et al.[95.
In scenario 2, if we choose q to be \/7, we can let
/
=
2N
VRS(2so
+ si
+ x)(2N - si - 2so - X) IxS
-
(2so + si)RI
Then f(2ro + ri) = v/Y. This implies
A
=
max
X,yE{0,--- ,2R},jx-yj<2
If(x) - f(y)I
We do not, however, have to iterate over all such x and y. Instead, let Q be the
set of all solutions to
Let Q' = { [wJ w
f"(x) = 0 such that 0 < x < 2R, unioned with{0, 2R,
E Q} U { Fwl
A \/Y =
(2so+si)R
w E Q}. Then it follows from basic calculus based that
max
XEO',yE{x+2,x-2,x-1,x+1},Oy
2Rjx-yj 2
If(x) -
f(y)I
Note that Q contains at most six elements and can be found easily (it comes down
to finding the roots of a degree four polynomial), and as such it is easy to calculate
maxEQ',yE{0,...,2R},jx-yl 2
If(x) - f(y)I in constant time.
In the remaining cases we estimate the sensitivity by brute force over all possible
values of 2r 0 + r1 and 2so
+ s, under the corresponding scenarios. We did not bother
deriving the sensitivity analytically because these methods are shown to be less than
optimal in the following sections.
5.8.2
Output Perturbation
We first consider how Laplacian based output perturbation affects accuracy. Previous
methods have worked by estimating the sensitivity of the allelic test statistic and
applying the Laplacian mechanism to it directly [95].
Since we are now able to
calculate the sensitivity of all of the above methods, we want to test how well each
of them performs. In order to do this comparison we ran each method on our GWAS
data and compared the errors, where the error is measured by L1 error. We shall see
120
3535
30
30
2525
25o
020
15
15
10
10
5
5
.2
0.3
0
0.5
0.6
0.7
0.8
0.9
03
1.02
0.4
0.5
0.6
0.7
0.8
0.9
Epsilon
Epsilon
(a) All SNPs
(b) 10 Highest Scoring SNPs
1.0
Figure 5-4: Comparing two forms of output perturbation in scenario 2- the first
coming from applying the Laplace mechanism directly to the allelic test statistic
(green), the other applying it to the square root of the allelic test statistic then
squaring the result (blue), comparing the L 1 error on the y axis with E on the x. We
first apply it to 1000 random SNPs (a), then to the top ten highest scoring SNPs (b).
We see that in both cases applying the statistic to the square root outperforms the
standard approach.
that, in scenarios 2, our square root based approach outperforms previous approaches.
Scenario 2
We apply the methods to scenario 2.
The results are pictured in Figure 5-4.
In
Figure 5-4(a) we choose 1000 SNPs at random and apply the Laplacian mechanism
both directly to the allelic test statistic (the green curve) and to the square root of
the allelic test statistic and square the result (the blue curve). We see that using the
square root is the preferable method.
We also consider how the two approaches compare on high scoring SNPs (since
we are most interested in these SNPs). In order to test this we measured the error on
the 10 highest scoring SNPs, the result being pictured in Figure 5-4(b). We see that
in this case applying the Laplace mechanism to the square root still outperforms the
standard approach by a large amount.
121
40
35
35
30
30
25
25
-2
15
15
10
10
5
5
02
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Epsilon
Epsilon
(a) All SNPs
(b) 10 Highest Scoring SNPs
1.0
Figure 5-5: Comparing the output perturbation of the allelic test statistic for scenarios
1 and 2, comparing the L1 error on the y axis with on the x. In scenarios 2 (the
blue curve) we add the noise to the square root then square the result, where as for
scenario 1 (the green curve) we apply the Laplacian mechanism directly to the test
statistic (this choice is motivated by the previous figures). We first apply it to 1000
random SNPs (a), then to the top ten highest scoring SNPs (b). We see that in
scenario 2 we require much less noise than scenario 1.
Across Scenario Comparison
We now want to compare across scenarios. In particular we wonder if choosing not to
hide genomic information can lead to utility gains. To test this scenario we took the
best approach under each of the above regimes and compared the results (e.g., for
scenario 2 we used the square root approach, for scenario 1 added the noise directly
to the statistic). The results are plotted in Figure 5-5. In Figure 5-5(a) we plot the
result of applying output perturbation to 1000 random SNPs under scenario 1 (red
curve) and scenario 2 (blue curve), while in Figure 5-5(b) we do the same with the
top ten highest scoring SNPs. In both cases we see that scenario 2 give us huge gains
in utility.
5.8.3
Picking Top SNPs with the Laplacian Mechanism
We can apply the above output perturbation method to picking high scoring SNPs
using Algorithm 5. The first set up will be with q equal to the allelic test statistic in
scenario 1. We will use the square root of the allelic test statistic in scenario 2. The
122
motivation for considering these is that, as we saw above, these are the two output
perturbation approaches that perform best on GWAS data.
The results of this test are pictured in Figure 5-6. We apply both methods to the
RA dataset, using mret E {3, 5,10, 15} for figures 5-6(a), 5-6(b), 5-6(c), and 5-6(d)
respectively.
In each of the figures the blue curve compares the utility with C for
scenario 1 and the green curve compares the utility with c for scenario 2. We see
that, in all three cases, scenario 2 performs the best, greatly outperforming scenario
1, which is the set up previously consider in the literature [95].
5.9
Input Perturbation
We already saw what happens when, instead of perturbing the allelic test statistic
with the Laplace mechanism, we perturb the square root of the allelic test statistic
and square the result. In this section we consider using the Laplace mechanism to
perturb the input instead of the output and releasing the result. Our results indicate
that we get vast improvements using input, as opposed to output, perturbation.
The above sections used the Laplacian mechanism on the output. How about the
input? It turns out that this method greatly outperforms output perturbation.
The method works as follows: Let x = 2ro
+ r1 and y
=
2so + s 1 . Then we see
that if x' and y' are the corresponding quantities for a neighboring database that
Ix - x'I + y - y' < 2. Therefore we can let
Xdp =
2
x + Lap(0, -)
ydp
2
y + Lap(0, -)
and
=
C
then
(xdp,
ydp) is a C-differentially private estimate of (x, y). We can then estimate Y
in a differentially private form using the equation
2N(xdpS
-
ydpR) 2
RS(xdp + ydp)(2N 123
xdp -
ydp)
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
2
3
4
5
6
10
7
2
3
4
2
3
4
Epsilon
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
10
20
30
40
50
60
70
s0
90
10
100
=
6
Epsilon
7
20
30
40
50
60
70
Epsilon
Epsilon
(c) mret
6
5
8
9
10
80
90
100
(b) mret = 5
(a) mret = 3
0.1,
5
(d) mret = 15
10
Figure 5-6: We measure the performance of the Laplacian method for picking top
SNPs in scenarios 1 (in blue) and 2 (in green) with mret (the number of SNPs being
returned) equal to a. 3 b. 5 c. 10 and d. 15 for varying values of C. For mret = 3,5
we consider c between 1 and 10, while in the other cases we consider c between 10
and 100. We see that in all four graphs that scenario 2 leads to the best performance.
Scenario 1, which is the one that appeared in previous work, leads to a greater loss
of utility. These results are averaged over 100 iterations.
124
18
2.0
16
14
1.5
12
.010
0
0
6
0.5
2
0.0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
9,2
1.0
0.4
0.3
0.6
0.5
0.7
0.8
0.9
1.0
Epsilon
Epsilon
(b) 10 Highest Scoring SNPs
(a) All SNPs
Figure 5-7: Comparing the output perturbation of the allelic test statistic for scenarios
2 (blue) to the input perturbation method in scenario 1 (green). We see that in this
case, as opposed to previous cases, scenario 1 outperforms scenario 2 despite requiring
stronger privacy guarantees. This demonstrates that input perturbation is preferable
to output perturbation.
We compare the error in this estimate with the methods obtained above (we
do not include the output perturbation method under scenario 1, since the output
method under scenario 2 outperforms it). The results are shown in Figure 5-7. We
see that this input perturbation under scenario 1 actually performs much better than
the output perturbation method in scenario 2 (and thus the output perturbation in
scenario 1).
How about input perturbation in scenario 2? We can use
2
2N(xdpS - yR)
RS(xdp + y)(2N - xdp
-
y)
We compare both input perturbation methods in Figure 5-8.
In this figure we
see that scenario 1 (the green line) performs worse than scenario 2 (blue), both when
choosing 1000 random SNPs and when only considering the highest scoring SNPs.
Taken together, we see that input perturbation seems preferable in practice to
output perturbation.
125
1.0
12
0.8
10
0.6
0
6
0.4
0.2
2
0.2
02
0.3
0.4
0.5
06
0.7
0.8
0,9
0.2
1.0
Epsilon
(a) All SNPs
0.3
0.4
0.5
0.3
0.4
0.5
0.7
0.8
0.9
1.0
0.6
0.7
Epsilon
M.8
0.9
1.0
0.6
(b) 10 Highest Scoring SNPs
Figure 5-8: Comparing the input perturbation of the allelic test statistic for scenarios
1 (green) and 2 (blue), comparing the L 1 error on the y axis with f on the x. In all
three cases we use input perturbation. We see that scenario 1 requiring more noise
to be added.
126
Chapter 6
Correcting for Population Structure
in Differentially Private GWAS
6.1
Introduction
The motivation behind genome-wide association studies (GWAS) is not only to find
associations between alleles and phenotypes, but to find associations that are biologically meaningful. This task, however, can be difficult, thanks to systematic differences
between different human populations. It is often the case that biologically meaningful
mutations are inherited jointly with mutations that have no such meaning, leading
to false GWAS hits.
A classic example of this phenomenon is given by the lactase gene. This gene is
responsible for the ability to digest lactose (such as in milk), and is more common
in Northern Europeans than East Asians. Northen Europeans are also, on average,
taller than East Asians. Thus a naive statistical method implies that this gene is
related to height, though in practice it is not believed to be.
In order to avoid this so-called population stratification, various methods have
been suggested (see Chapter 2 for more details). One increasingly popular approach
is to use linear mixed models (LMMs) to perform association studies. It has been
shown that LMMs help correct for population structure in practice [31].
In this chapter we extend the idea of privacy preserving GWAS studies to LMMs.
127
We focus on preventing a particular type of attack, namely phenotype disclosurethat is to say disclosure of private phenotype data. This means we ignore any leaked
information revealing whether an individual participated in the study, while also ignoring any leakage of genotype information that might occur. We feel this is justified
for numerous reasons. First, in many cases study participation is not private information (for example, it does not hurt to know someone is in the control population),
so there is no reason to spend energy trying to hide this information. We choose to
ignore leaked genotype information both for practical reasons (it makes the analysis
approachable, and since all known attacks against GWAS test statistics result in disclosure of phenotype or participation) and the fact that genotype information is easy
to get access to (by acquiring a physical sample and genotyping it). It is up to the
particular user to decide if such limitations are acceptable.
6.2
Previous Work
In the previous chapter we talked about the previous work that has been done on
applying differential privacy to GWAS studies [66, 67, 95, 106, 46]. These works have
focused on the Pearson and allelic test statistics. Though these statistics can be useful
in many situations, they are not considered to be the correct statistics to use in the
presence of population structure (see Chapter 2)
In order to deal with population structure numerous techniques, including genomic control [161, EigenStrat [21], and Linear Mixed Models (LMM) [31], have been
suggested. In recent years there has been a growing interest in using LMM for this
task. This interest has largely been spurred by increasingly efficient algorithms that
allow LMMs to be applied to larger and larger data sets [27, 23, 94, 43].
Previous work on differentially private LMMs has focused on estimating either
variance components (using somewhat heuristic methods) [41 or regression coefficients
1111. Here we instead focus on using LMMs to determine which SNPs are significant,
touching only briefly on the estimation of variance components.
More specifically, we consider the approach taken by EMMA [27]. Here we assume
128
our phenotype (y) is generated by
y= Z6+X3+E
where X is a normalized genotype matrix, Z is the covariate matrix, 6 is a vector
of fixed effects, E is drawn from N(O, Io-,), and
#
is drawn from N(O, I,,9). Using
either maximum likelihood (ML) or Restricted Maximum Likelihood (REML) we
estimate oe and
Ug.
In this chapter we will ignore covariates (aka assume Z is the
null matrix), so REML and ML are the same. Let lg and o2 be these estimates.
We want to test the null hypothesis that the ith SNP has no effect. In practice
this test should be done by considering the model
y = sr6, + X_/3 + e
where xi is the ith column in X and X_, is X with the ith column removed. We
can then test the null hypothesis that 6i = 0 by fitting this model and calculating
TK,-1
xy
where Ki = o,2j +
K7-1 xi
This statistic is approximately x 2 distributed.
!X_,X-T.
In practice this approach is time consuming. Moreover, it is much more difficult
to come up with a privacy preserving version of this method. Therefore, instead of
reestimating the ML estimates of
Ug
and or for each i using
y = XA6o + X_/3+
we estimate it once for
y =X
+C
then use the statistic
y TK-1xj
xT K- 1xj
wherewher
K=
K
ae
m+
XXT.
XjIn
This is similar to the approach taken by EMMA [27],
129
except ours uses a different statistic. It has been shown to be a reasonable statistic
when no one SNP has too large of an effect.
6.3
Our Contributions
In this chapter we present, to our knowledge, the first attempt to use mixed linear
models to perform association studies in a privacy preserving manner. More specifically, we focus on estimating the X2 statistic introduced in the previous section.
In order to estimate this we make a few assumptions. We assume that 0-e and
9g have already been estimated, either in a privacy preserving manner or not (we do
touch briefly on how to do this later on in the chapter). We also assume that we
have some bound on the values that the phenotype, y, can take- this bound can be
achieved by either using known information about the phenotype (its range in the
background population, or for disease status the fact that it can take on only a small
number of values), or by releasing information about the range of the phenotype in
the study population.
As mentioned above, we do not attempt to prevent the leakage of genotype information, only phenotype information (this goal can be seen as being similar to us
releasing genotype information for the control cohort in the previous chapter). How
can we justify this assumption? There are three motivations for this. The first is that
genomic information can be collected easily by gathering biological samples from an
individual, where as phenotype data might be more difficult to gain a hold of. More
than that, it seems that attacks against GWAS statistics result in either knowledge
about phenotype [63] or knowledge about participation in a study [60].
In many
studies, however, knowledge about participation is not particularly damaging- for
example, if the study consists of individuals in an EHR database then all participation tells us is that the given individual has their genotype on record with that
particular hospital. The final reason for not trying to protect genomic data is that
it makes the analysis much easier, and seems to enable us to get away with adding
less noise than we would have to otherwise. Depending on the application this may
130
or may not be a safe assumption, but it is a first step.
6.4
GWAS and LMM
As mentioned above, we assume that we know bounds on y-that is to say we know
numbers a and b such that a < yj < b for all i. There are numerous ways to choose
a and b. If the trait y is bounded for some reason (such as y being a 0 - 1 trait
like disease status) this is easy. One could also use prior knowledge about the trait
to choose a and b so that all yj are between a and b with high probability (those
that are not can be either ignored or rounded to the interval [a, b]). The approach
we take below is to set a = min(y) and b = max(yi). These choices release a little
bit of information about our cohort, but in practice it probably only affects privacy
negligibly (again this should be considered on a case by case basis-in practice one
should actually use the induced neighborhood mechanism [69]; although we ignore
that here-this can be achieved by replacing e with i in all the following mechanisms).
With the above set up we would like to be able to compute the X 2 statistic for the
ith SNP. More formally, if K is as above we want to calculate:
(x[K-1 ) 2
xTK--1x
where R = I, -
_
(xK-1 Ry) 2
xTK- 1 xj
1, centers y. In order to calculate this in an C-differentially
private way we note this statistic equals (py)2 where
xTK- 1 R
Vxf K-lxi
Therefore, in order to give a differentially private estimate of the x 2 statistic it
suffices to get a differentially private estimate of iy and square it.
This seems like a difficult task since K depends on the participant's genome, and
since we are inverting K things get messy. Note, however, since we are not trying to
hide leakage of genomic information we can assume that pi is fixed and it is only y
that is changing.
131
Therefore, what we are trying to come up with is a random function X : [a, b]n
-4
R such that, if y and y' differ in exactly one coordinate; then for all S C R we have
P(X(y) E S) < exp(C)P(X(y') E S)
Below we will introduce two methods to achieve differentially private estimates
for piy. We will see that both methods work well in certain domains- that is to say
for different choices of e and different amounts of population stratification.
6.5
Achieving Differential Privacy Attempt One: Laplace
Mechanism
We begin by applying the Laplacian Mechanism. If fi(y) = jisy, then
fi
has sensitivity
Af, = maxIpijI(b - a), where pij is the jth coordinate of pi. Then we can apply the
i
Laplacian mechanism to define F ,i(y) as a random variable drawn from a Laplace
distribution with mean piy and standard deviation (b-a)Afi
Theorem 13. F,,i is c-differentially private.
6.6
Achieving Differential Privacy Attempt Two: Exponential Mechanism
We also use the exponential mechanism [78]. Thus, we will consider a function q
[a, b]n x R
->
Z+ which will serve as a loss function. Informally, qi(y, c) will be equal to
the number of coordinates in y that need to be changed to reach a y' so that piy' = c
(this is similar to the method proposed in [67]). This intuition can be formalized as
qj (y, C) =
min /=
y'e[a,b ,y1 ' =cC
y - y'0
Note that qi(y, c) = oc if there is no such y', and that qj has sensitivity Aq = 1.
132
Assuming we are given c, then we can define a random function G,i so that G,,i(y)
has a density function P(G,,i(y) = c) which is proportional to exp(- qi(y, c)).
Theorem 14. G,,i(y) is c-differentially private.
0
Proof. This follows directly from McSherry et al.. [781.
Having defined G,,i(y) we want to be able to sample from it. This task can be
achieved with Algorithm 6 by setting w =-.
Algorithm 6
exp(-wqi(y, c))
Sampling from a distribution
with
Require: y, w, pi, a, b
Ensure: A sample proportional to exp(-wqi(y, c))
Let i, = max(pij (b - yj), pij (a - yj))
Let Ij= min(pu 3(b - yj), pij (a - yj))
Let i 1 ,-- ,i, be a permutation on 1,..., n such that
density
1
for all j.
Let ji,- ,ja be a permutation on 1,..., n such that 'if
proportional to
>-- .4..
> Let uj = fi,
'in.
Let 1, =
for all k.
Z 1u and Lk = E, 1 lj, k =1,,n.
Choose k E {1,..., n} proportional to exp(-wk)(U - Uk-1 + Lk-1
Choose x uniformly at random from [Lk, Lk-1) U (Uk-1, U]
Let Uk
fj,
=
-
Lk).
Return x
Theorem 15. If we let w =
2
then Algorithm 6 returns a sample from Gf,j(y).
Proof. It suffices to show that, for any w > 0, Algorithm 6 returns a sample from a
distribution W with density proportional to exp(-wqi(y, c)). Let Uk, Lk, lk and Uk
be as in Algorithm 6.
Assume that y and y' differ in at most k coordinates, then
Piy - Is'
=
pij (yj - y/ ) < -(11
so
k
pYiy f >
y
k = Lk
-- t
i=1
Similarly
133
+ -.-. + i k)
k
PY'
so if qj(y, c) < k than Lk
than qj(y, c) < k, so qj(y, c)
the probability that qi(y, W)
Uk = U
PiY+
c < Uk. It is easy to see, however, that if Lk
=
=
k if and only if c E [Lk,
Lk-_1) U (Uk1, Uk].
C < Uk
Therefore
k is proportional to
exp(-wk)(Lk_1 - Lk + Uk - Uk-1)
while the density of W at c conditional on qj(y, c) = k (aka P(W = cIqj(y, c) = k))
is proportional to 1 if c E [Lk, Lk_1) U (Uk-1, Uk] and 0 otherwise. Therefore we can
Uk -
Uk_1).
, n}
proportional to exp(-wk)(Lk_1
-
Lk
+
sample from W by first picking k E {1, ...
Moreover, the density function for W conditional on qj (y, c) = k must,
by definition, be uniformly random on {clqi(y, c) = k} and 0 elsewhere.
Therefore
we can sample from W by picking k then choosing c uniformly at random from
[Lk, Lk-1) U (Uk-1, Uk]. This is exactly what Algorithm 6 does.
Algorithm 7 A c-differentially private estimate of yY
Require: y, c, pi, a, b, k
Let wx= (1+ )
for x=0,...,k
I in decreasing order.
Let vi,. . . , v, be a permutation of Iil,...,pi,
Let dx = exp(w) + (xp(-wx)l)vjj for x = 0,..., k
Let ex = wi + log(dx) for x = 0,.. . , k.
Let w = min{wXte1
e}
Run Algorithm 6 with y, w, pi, a, b as above and return the answer.
Note that the above result is based off of the general analysis of the exponential
mechanism [781. By giving an analysis tailored to our situation, however, we are able
to improve upon this algorithm to produce Algorithm 7.
To do that we need the
following Theorem.
Theorem 16. Let W be a random function so that W(y) has density proportionalto
134
exp(-wqi (y, c)). If we assume that |tij| decreases as j increases and let
d=
than W is w
exp(w) +
(exp(-w) - 1)Ibii I
n
il jexp(-wj)
+ log(d,)-differentially private.
More than that, this is tight in the sense that, if I < w + log(d,), then W is not
i-differentially private.
Proof. Let Z(y) be the normalization constant so that
exp(Wq(yc))
Z(y)
= P(W = c)
is the density of W at c. The above theorem follows if we can show that, for all
-
neighboring y and y', that Z(')
< d, by using the same proof method used to justify
Z (y)
the exponential mechanism [78].
To simplify notation let v = pi and q = qi
Let
Zj = l{clq(x, c) < j}I
then we want to show that
Ivk|(b
Zj >
- a)
k==1
To see this let Uk,lk,Uflkik and jk be as in Algorithm 6. Then by the proof of
Theorem 15 we see that
Zi = L(
un-
lk
k=1
By the definition of
3
fUk E
k=1
Uk
and
Ik
this must be greater than
3
=
>3 max(vik(b -
Yk), vk(a - Yk)) - min(vk(b - yk), vk(a - Yk))
k=1
135
max(Vk(b - a), uk(a - b)) = L
= (
jvj|(b - a)
k=1
k=1
which is just what we wanted. Note, however, since
n-1
n
Z(y) = exp(-w)Zi+Z exp(-wj)[Zj-Zj-1] = exp(-wn)Zn+(exp(-wj)-exp(
-w(j+1))Z,
j=1
j=2
we can apply the above result and some basic algebra to show that
n
v I(b - a)exp(-wj)
Z(y) > )
j=1
Assume that y and y' differ in the kth coordinate. Choose ri, r2 , r', r' such that
k=
where u',
ir
=
r
1
=jr2
=
Jr'2
',' ix,,i' and j' are defined for y' as the u
, 12,ixa, and jx are for y.
Note either r 1 > r' and r' > r2 or ri < r' and r' < r2 . We assume without loss of
generality that r1 > r' and r' > r2 . We can then write
n
Z(y)
=
Z(uj - lj)exp(-wj)
j=1
To simplify notation let
=
ujexp(-wj)
+
A1
i=j
r2 -1
n
S
-ljexp(-wj)
ujexp(-wj) +
+
r-1
j=I
j=ri+i
ri-1
A2 =
5 ujexp(-wj)
j=r/
and
r2
A3 =
5
-1jexp(-wj)
j=r2+1
136
1
-jexp(-wj)
We see that
)
Z(y) = A 1 + A 2 + A 3 + uiexp(-wr1) - lr2 exp(-wr 2
Note that i, = Z' if x > r1 or x < r', while Jx
j' if x > r' or x < r2 . Similarly
x
Ix = I'
for x
#
k (since yx = y'). Putting this together we see that
)
Z(y') = A 1 + exp(-w)A 2 + exp(w)A 3 + u', exp(-wr') - l',exp(-wr
A 1 + A 2 + exp(w)A 3 + u', exp(-wr') - l',exp(-wr)
= Z(y) - A 3 - urexp(wri) + l 2 exp(-wr2 )+ exp(w)A 3 + u' exp(-wr') - l,exp(-wr')
Z(y) - uriexp(ori)+ lr2exp(-wr 2 ) + (exp(w) - 1)A 3 + U' exp(-wr') - l',exp(-wr')
Since we are trying to maximize Z(y) we want to make this last equation as big
as possible. Note, however, that
A 3 < Z(y) - ulexp(-w) + Iiexp(-w) < Z(y) - Ivi1 I(b - a)exp(-w)
where the second inequality follows from the fact Z > E3_ L'vkI (b - a) (proved
above). Plugging this into the above inequality we get that
Z(y')
(exp(w) - 1)(Z(y) -
Z(y) - uriexp(wri) + lr2 exp(-wr 2 )+
IviI(b -
a)exp(-w)) + u', exp(-wr') - l',exp(-wir')
137
Note
uriexp(-wri) - lr2 exp(-wr2) < (Ur, - lr2 )exp(-w) = IvkI(b - a)exp(-E)
< Iv1(b - a)exp(-w)
Plugging this into the above and doing some algebra we see that
Z(y') < Z(y)exp(w) + Ivil(b - a)(exp(-w) - 1)
dividing through by Z(y) and using the fact proved above that that Z(y)
Ej= 1
;
IvjI(b - a)exp(-wj) to give us that
IviI(b - a)(exp(-w) - 1)
Z~)<
Zy ex- w)- " I K(b - a)exp(-wj)
Z(y')
exp(w) + E
exp(w) + (exp(-w) - 1)I[iil=
Ei.,
d
lpijlexp(-wj)
To see that this bound is tight, simply consider y defined so yj = a if pij < 0,
yj = b otherwise, and let y' be defined so y. = y 3 if
y' =
j
> 1, with y' = a if Yi = b, else
b.
Corollary 4. Algorithm 7 is c-differentially private.
Note that Algorithm 7 can be sped up by doing a binary search for W instead of
a linear one- our python implementation takes advantage of this fact.
6.7
Results: Testing Our Method
We test out method on real genotype data, with fake phenotype data. The genotype data consists of all the genotypes in HapMap. We generate the phenotype by
choosing o2 and oj, calculating the normalized genotype matrix X, generating
N(0, RIm) and E from N(0,
2jIn),
#
from
where m is the number of SNPs (in this case we
138
10
lo
8
8
6
6
4
4
2
2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.2
%.1
(a) All SNPs
0.3
0.4
0.5
0.6
0.7
0.6
0.9
1O
(b) 10 Highest Scoring SNPs
Figure 6-1: Comparing the output perturbation of the Laplacian based method
(green) with our neighbor based method (blue) with a' = o2 = .5, with both (a)
1000 random SNPs and (b) the causative SNPs. We see that our method performs
better in both cases for the choices of c considered.
use m = 10, 000). Then our phenotype is
y = X3+ e
We then apply our method to the resulting phenotype, and see how well both our
method and the standard Laplacian mechanism do at approximating the x 2 statistic
on 1000 randomly sampled SNPs. We use (a, b) = (min(y), max(y)) for simplicity.
We first consider what happens when o 2e = 01g = .5. We see that, for 1.0
> c
>
.1 our method performs better than the Laplacian approach (Figure 6-1), both on
random SNPs and causative SNPs.
6.8
6.8.1
Picking Top SNPs
The Approach
Note that we can use the same methods as in the previous chapter to pick the top
SNPs, applied to the score function q(y) = Ipjyj.
In particular, note that we can
use the method introduced in Section 6.6 to calculate the distance needed for the
neighbor mechanism.
139
We can, however, improve upon the Laplacian and score methods by using a batch
mechanism. These improved approaches are described in Algorithms 8 and 9. Note
that Aq is easy to calculate: for each i = 1, - - - , n, pick the met SNPs with largest
values of
Ipiu I, sum those values up to get Aqj, and return the maximum over all i.
Algorithm 8 The Laplacian method for picking top met SNPs with mixed linear
models
Require: Data set y, X, number of SNPs to return mret, privacy value 6, and parameters o, and o.
Ensure: A list of met SNPs
Calculate the piu, i = 1, . . , m.
= max
i=1,-- ,n
max
DC{1,...,m},IDI=mret
Z
pi|(b - a)
jED
/
Let Aq = Aqm,,
Let si = f/ipyj + Lap(0, 2Aq) for all i
return The met SNPs with highest si
Algorithm 9 The score method for picking top met SNPs with mixed linear models
Require: Data set y, X, number of SNPs to return met, privacy value C, and parameters o and o,.
Ensure: A list of met SNPs
Calculate the pi, i = 1,.. . , m
Let Aq = Aqmre =
Let si
=
Z
max
i=1,--
1m},IDI=mret
pijI(b-a))
-max
jED
exp(c IA).
Pick mret SNPs without replacement, where the probability of picking SNP i is
proportional to si
return The met SNPs chosen above.
Theorem 17. Algorithms 8 and 9 are c-differentially private.
Proof. The proofs are almost the same as those used by Uhler et al.. [951.
6.8.2
EZ
Application to Data
We use the same data as in Section 6.7, with o
= U2=
.5. We compare all three
methods for picking the top mret SNPs for met = {3, 5,10, 15} (Figure 6-2a-d). We
see that in all three cases that the score method performs the best, followed by the
140
1.01
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
''10
20
30
40
60
50
Epsilon
(a) mret
70
=-
80
90
1
100
0.8
0.6
0.6
0.4
0.4
0.2
0.2
/
0.6
20
30
40
0
60
70
40
50
60
Epsilon
70
80
90
100
80
90
100
(b) mret =5
1.0
-10
30
3
1.0
nn0
20
80
90
0.
100
0
20
30
40
50
0
70
Epsilon
Epsilon
(c) mret = 10
(d)
mret
= 15
Figure 6-2: We measure the performance of the three methods for picking top SNPs
using score (blue), neighbor (red) and Laplacian (green) based methods with mret
(the number of SNPs being returned) equal to a. 3 b. 5 c. 10 and d. 15 for varying
values of e between 10 and 100. We see that in all four graphs that score method
leads to the best performance, followed by the neighbor mechanism. These results
are averaged over 20 iterations.
neighbor mechanism, followed by the Laplacian based method. More than that, the
dominance of the score method grows with mret- we believe this is due to the batch
approach to the score mechanism that we take, allowing us to add less noise than
otherwise required. Strangely enough this batch approach does not have as large an
effect on the Laplacian method.
141
6.9
Estimating
-e and
Ug
in a differentially private
manner
In practice, it seems unlikely that releasing a- and og will lead to any privacy concerns,
so it seems reasonable to release them in most cases. In some cases, however, we would
like to release differentially private versions of them.
This is of particular interest
since it has been shown that the statistical power of LMMs is increased when, instead
of releasing a single estimate of a, and
Ug
we reestimate it for each SNP (or each
chromosome), with the corresponding SNP (or chromosome) removed.
Again we assume that we know some a and b so that each coordinate of y is
between a and b. We want to find the o- = (o-, o- 2) that minimizes
qx(9, Y)
where K, = a-21" +
fXXT.
= YTK,-y
+ log(det(K,))
Note that this can also be modified to include
covariates, but we do not do so here. We than apply a standard approach known
as sample-and-aggregate, given in Algorithm 10.
Details about this sample-and-
aggregate approach can be found in previous work on differential privacy [81, 4].
Algorithm 10 A c-differentially private estimate of oRequire: y, c, a, b, k
Partition {1, . . . , n} in k roughly equally sized pieces, S1,.-. , Sk.
Let v be a (-differentially private estimate of var(y) (Can easily be done with the
Laplace Mechanism. If the result is negative set this to 0).
Let o-i be the a E [0, v] 2 that minimizes qi, where
qi(-, y) = y[K7--yj + log(det(Kj,,))
where y, is y restricted to the individuals in Si, and similarly for Ki,,.
return Y"' + (Lap(0, v), Lap(0, v))
142
6.10
Conclusion and Future Work
In this chapter we propose the first differentially private mechanism for releasing the
results of statistical tests using LMMs in GWAS. Though our method effectively deals
with the case of releasing a small number of SNPs, it runs into the same trouble that
occurs so often in differential privacy- namely that the noise soon over takes the
utility as the number of queries grows. Hopefully future work can help improve this
trade off.
There are numerous directions in which one could try to extend this work. For
example, our method for estimating the variance components is based on a general
mechanism, and it might be hoped that more specialized approaches could improve
on it- in particular, it might be of interest to look at more robust estimators of u2
and
0,
(as opposed to the REML), and try to use them to estimate the variance
components.
This would allow us to move away from the approximate EMMA like
frameworks employed here to a framework more similar to the state of the art in
LMM testing.
143
ds..4
0aef-2"'.^"
9w.
ar:~~
aer
w
en
~
ae',
ivsoeo~ 5iar
..:ea
-r.:
. -:' --"
' -f
-/
-4 - :re
- .-r-...-. --. .-s-.s r -2r . - ..- .. .ss
-T. ..--r'~
.ay
,.-.-~.-,:.--.
s4gr-a
... -... , ....4... ....... ... . ... . , . ... ... . .
Chapter 7
Conclusion
In this thesis we have explored various methods for preserving patient privacy in
biomedical research. We have focused on two different overarching approaches. One
approach involves modeling possible adversaries to measure privacy risks. The other
approach focuses on preserving privacy in the absence of such a model.
Under the first approach we introduced a measure, known as PrivMAF, that allows
us to measure the probability of re-identification after publishing the MAF from a
study. This probability was calculated using realistic models of what knowledge the
adversary has and how the patient data was generated.
The second approach was realized using the idea of differential privacy (see earlier
chapters for a definition).
In particular, we showed how to improve the utility of
differentially private medical count queries, providing a possible means for better
designing studies while still preserving patient privacy.
We also applied the ideas
of differential privacy to GWAS. In particular, we showed how to improve both the
runtime and accuracy of previous methods for differentially private GWAS. Moreover,
we extended these methods to new, more powerful GWAS statistics.
It is not yet clear which approach is the most appropriate for ensuring biomedical privacy. On the one hand, model free approaches (such as differential privacy)
don't rely on assumptions about what the adversary knows, and only make limited
assumptions about the data. This gives us stronger privacy guarantees that don't
break down when we overlook new sources of information an attacker might use. At
145
the same time, this increased privacy leads to a loss of accuracy that might be avoided
by model free approaches-a loss that is especially pronounced in high dimensional
data, such as the kind we find in genomics.
In the end, new methods are needed to help overcome the limitations of both
approaches. This thesis is one small step in that direction.
146
Bibliography
[1] http://www.healthit.gov/policy-researchers-implementers/meaningful-useregulations.
[2] http://www.hhs.gov/ocr/privacy/.
[3] http://www.wisn.com/nurses-fired-over-cell-phone-photos-of-patient/8076340.
[4] J Abowd, M Schneider, and L Vilhuber. Differential privacy applications to
bayesian and linear mixed model estimation. Journal of Privacy and Con
dentiality, 5(1):73-105, 2013.
[51 M Atallah, F Kerschbaum, and W Du. Secure and private sequence comparisons. ACM Workshop Privacy in Electron Soc, pages 39-44, 2003.
[6] E Ayday, J Raisaro, P McLaren, J Fellay, and J Hubaux. Privacy-preserving
computation of disease risk by using genomic, clinical, and environmental data.
USENIX, 2013.
[7] R Bell, P Franks, P Duberstein, R Epstein, M Feldman, E Fernandez y Garcia,
and R Kravi. Suffering in silence: reasons for not disclosing depression in
primary care. Ann Fam Med, (9):439-446, 2011.
[8] D Boneh, A Sahai, and B Waters. Functional encryption: Definitions and
challenges. Proceedings of Theory of Cryptography Conference (TCC), 2011.
[9] R Braun, W Rowe, C Schaefer, J Zhan, and K Buetow. Needles in the haystack:
identifying individual's present in pooled genomic data. PLoS Genet., (10),
2009.
[10] S Brenner. Be prepared for the big genome leak. Nature, 498:139, 2013.
[11] K Chaudhuri, C Monteleoni, and A Sarwate. Differentially private empirical
risk minimization. The Journal of Machine Learning Research, 12:1069-1109,
2011.
[12] The Wellcome Trust Case Control Consortium. Genome-wide association study
of 14,000 cases of seven common diseases and 3000 shared controls. Nature,
447:661-683, 2007.
147
[13] D Craig, R Goor, Z Wang, J Paschall, J Ostell, M Feolo, S Sherry, and T Manolio. Assessing and mitigating risk when sharing aggregate genetic variant data.
Nat Rev Genet, 12(10):730-736, 2011.
[14] G Danezis and E De Cristofaro. Simpler protocals for pirvacy-preserving disease
susceptibility testing. GenoPri, 2014.
[151 F Dankar and K El Emam. Practicing differential privacy in health care: A
review. Transactions on Data Privacy, 5:35-67, 2014.
[16] B Devlin and K Roeder. Genomic control for association studies. Biometrics,
55(4):997-1004, 1999.
[17] C Dwork and R Pottenger.
Towards practicing privacy.
J Am Med Inform
Assoc, 20(1):102-108, 2013.
[18] K El Emam, E Jonker, L Arbuckle, and B Malin. A systematic review of
re-identification attacks on health data. PLoS ONE, 6(12), 2011.
[19] Y Erlich and A Narayanan. Routes for breaching and protecting genetic privacy.
Nature Reviews Genetics, 15:409-421, 2014.
[201 A Ghosh et al. Universally utility-maximizing privacy mechanisms.
SIAM J
Comput, 41(6):1673-1693, 2012.
[21] A Price et al. Principal components analysis corrects for stratification in
genome-wide association studies. Nature Genet, 38:904-909, 2006.
[221 B Anandan et al. t-plausibility: Generalizing words to desensitize text. Trans-
actions on Data Privacy, 5(3):505 - 534, 2012.
[23] C Lippert et al. Fast linear mixed models for genome-wide association studies.
Nature Methods, 8:833-835, 2011.
[24] F Doshi-Velez et al. Comorbidity clusters in autism spectrum disorder: An
electronic health records time-series analysis. PEDIATRICS, 133:e54-e63, 2014.
[25] F Hormozdiari et al. Privacy preserving protocol for detecting genetic relatives
using rare variants. Bioinformatics, 30(12):204-211, 2014.
[26] G Loukides et al. Anonymization of electronic medical records for validating
genome-wide association studies. PNAS, (17):7898-7903, 2010.
[271 H Kang et al. Variance component model to account for sample structure in
genome-wide association studies. Nat. Genet., 42:348-54, 2010.
[28] H Lowe et al. Stride - an integrated standards-based translational research
informatics platform. AMIA Annu Symp Proc., pages 391-395, 2009.
[29] J Feigenbaum et al. Towards a formal model of accountability. NSPW, 2011.
148
[30] J Yang et al. Common snps explain a large proportion of the heritability for
human height. Nat. Genet., 42:565-569, 2010.
[31] J Yang et al. Advantages and pitfalls in the application of mixed-model associ-
ation methods. Nat Genet, 46(2):100-6, 2014.
[32] J Zhang et al. Privgene: Differentially private model fitting using genetic algorithms. SIGMOD, 2013.
[33] K El Emam et al. A secure distributed logistic regression protocol for the
detection of rare adverse drug events. JAMIA, 7(7), 2012.
[34] Khokhar et al. Quantifying the costs and benefits of privacy-preserving health
data publishing. JBI, 50:107-121, 2014.
[35] L Bierut et al. AdhIb is associated with alcohol dependence and alcohol consumption in populations of european and african ancestry. Mol Psychiatry,
17(4):445-450, 2012.
[36] L Kamm et al. A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics, 29(7), 2013.
[37] M Fredrikson et al. Privacy in pharmacogenetics: An end-to-end case study of
personalized warfarin dosing. USENIX, 2014.
[38] M Humbert et al. Reconciling utility with privacy in genomics. WPES, pages
11-20, 2014.
[39] M Mailman et al.
The ncbi dbgap database of genotypes and phenotypes.
Nature Genet., pages 1181-1186, 2007.
[40] M Wolfson et al. Datashield: resolving a conflict in contemporary bioscienceperforming a pooled analysis of individual-level data without sharing the data.
Int J Epidemiol., 39(5):1372hA1382, 2010.
[41] Meystre et al. Automatic de-identification of textual documents in the electronic
health record: a review of recent research. BMC Med Res Methodol., 10(70),
2010.
[42] P Baldi et al. Countering gattaca: efficient and secure testing of fully-sequenced
human genomes. Proc. 18th ACM Conf. Comput. Commun. Security, pages
691-702, 2011.
[431 P Loh et al. Efficient bayesian mixed model analysis increases association power
in large cohorts. Nat. Genet., pages 284-290, 2015.
[44] P Mohan et al. Gupt: privacy preserving data analysis made easy. SIGMOD,
pages 349-360, 2012.
149
[45] R Bhaskar et al. Discovering frequent patterns in sensitive data. ACM SIGKDD,
2010.
[46] R Chen et al. A private dna motif finding algorithm. JBI, 50:122-132, 2014.
[47] R Plenge et al. New England Journal of Medicine, pages 1199-1209, 2007.
[48] S Lee et al. Estimating missing heritability for disease from genome-wide asso-
ciation studies. Am. J. Hum. Genet., 88:294-305, 2011.
[491 S Wieland et al. Revealing the spatial distribution of a disease while preserving
privacy. PNAS, 105(46):17608-17613, 2008.
[50] W Xie et al. Securema: Protecting participant privacy in genetic association
meta-analysis. Bioinformatics, 30(23):3334-3341, 2014.
[51] Y Chen et al. Auditing medical record accesses via healthcare internation networks. Proceedings of the AMIA Symposium, pages 93-102, 2012.
[52] Y Erlich et al. Redefining genomic privacy: trust and empowerment.
PLOS
Biology, 12(11):0.1371 /journal.pbio.1001983, 2014.
[53] Y Zhao et al. Choosing blindly but wisely: differentially private solicitation of
dna datasets for disease marker discovery. JAMIA, 22:100-108, 2015.
[54] Z Huang et al. Genoguard: Protecting genomic data against brute-force attacks.
36th IEEE Symposium on Security and Privacy, 2015.
[55] J Gardner and et al. Share: System design and case studies for statistical health
information release. JA MIA, 20:109-116, 2013.
[56] C Gentry. Fully homomorphic encryption using ideal lattices. STOC, 2009.
[57] N Gilbert. Researchers criticize genetic data restrictions. Nature, 2008.
[58] S Gupta and et al. Modeling and detecting anomalous topic access. Proceedings of the 11th IEEE International Conference on Intelligence and Security
Informatics, pages 100-105, 2013.
[59] M Gymrek, A McGuire, D Golan, E Halperin, and Y Erlich. Identifying personal
genomes by surname inference. Science, 339(6117):321-324, 2013.
[60] N. Homer, S. Szelinger, M.Redman, D. Duggan, W. Tembe, J. Muehling,
J. Pearson, D. Stephan, S. Nelson, , and D. Craig. Resolving individual's contributing trace amounts of dna to highly complex mixtures using high-density
snp genotyping microarrays. PLoS Genet, 4(8), 2008.
[61] J Hsu and et al. Differential privacy: an economic method for choosing epsilon.
CoRR, page abs/1402.3329, 2014.
150
[62] J Hsu, M Gaboardi, A Haeberlen, S Khanna, A Narayan, B Pierce, and A Roth.
Differential privacy: an economic method for choosing epsilon. Proceedings of
27th IEEE Computer Security Foundations Symposium, 2014.
[631 H Im, E Gamazon, D Nicolae, and N Cox. On sharing quantitative trait gwas
results in an era of multiple-omics data and the limits of genomic privacy. Am
J Hum Genet, 90(4):591-598, 2012.
[641 K Jacobs, M Yeager, S Wacholder, D Craig, P Kraft, D Hunter, J Paschal,
T Manolio, M Tucker, R Hoover, G Thomas, S Chanock, and N Chatterjee. A
new statistic and its power to infer membership in a genome-wide association
study using genotype frequencies. Nat Genet, 41(11):1253-1257, 2009.
[651 X Jiang and et al. Privacy technology to support data sharing for comparative
effectiveness research. Medical Care, 51:58-64, 2013.
[661 Y Jiang and et al. A community assessment of privacy preserving techniques
for human genomes. BMC Medical Informatics and Decision Making, 14(S1),
2014.
[67] A Johnson and V. Shmatikov. Privacy-preserving data exploration in genomewide association studies. KDD, pages 1079-1087, 2013.
[681 H-W Jung and K El Emam. A linear programming model for preserving privacy
when disclosing patient spatial information for secondary purposes. International Journal of Health Geographics, 13(16), 2014.
[691 D Kifer and A Machanavajjhala. No free lunch in data privacy. SIGMOD, pages
193-204, 2011.
[70] A Korolova. Privacy violations using microtargeted ads: A case study. JPC, 3,
2011.
[711 V Lampos, T De Bie, and N Cristianini. Flu detector - tracking epidemics on
twitter. Machine Learning and Knowledge Discovery in Databases, 6323:599-62,
2010.
[721 A Lemke and et al. Community engagement in biobanking: Experiences from
the emerge network. Genomics Soc Policy, 6:35-52, 2010.
[731 N Li, T Li, , and S Venkatasubramanian.
anonymity and 1-diversity. ICDE, 2007.
t-closeness:
Privacy beyond k-
[74] G Loukides, A Gkoulalas-Divanis, and B Malin. Anonymization of electronic medical records for validating genome-wide association studies. PNAS,
107(17):7898-7903, 2010.
[751 T Lumley and K Rice. Potential for revealing individual-level information in
genome-wide association studies. J Am Med Assoc, 303(7):659-660, 2010.
151
[76] B Malin, K El Emam, and C O'Keefe. Biomedical data privacy: problems,
perspectives and recent advances. J Am Med Inform Assoc, 1:2-6, 2013.
[77] D Manen, A Wout, and H Schuitemaker. Genome-wide association studies on
hiv susceptibility, pathogenesis and pharmacogenomics. Retrovirology, 9(70):1-
8, 2012.
[781 F McSherry and K Talwar. Mechanism design via differential privacy. Proceedings of the 48th Annual Symposium of Foundations of Computer Science,
2007.
[79] S Murphy and H Chueh. A security architecture for query tools used to access
large biomedical databases. JAMIA, 9:552-556, 2002.
[80] S Murphy and et al. Strategies for maintaining patient privacy in i2b2. JAMIA,
18:103-108, 2011.
[811 K Nissim, S Raskhodnikova, and A Smith. Smooth sensitivity and sampling in
private data analysis. STOC, pages 75-84, 2007.
[821 D Nyholt, C Yu, and P Visscher. On jim watson's apoe status: genetic infor-
mation is hard to hide. Eur. J. Hum. Genet., 17:147-149, 2009.
[83] J Oliver, M Slashinski, T Wang, P Kelly, S Hilsenbeck, and A McGuirea. Balancing the risks and benefits of genomic data sharing: genome research partic-
ipantshAZ perspectives. Public Health Genom, 15(2):106-114, 2012.
1841 E Ramos, C Din-Lovinescu, E Bookman, L McNeil, C Baker, G Godynskiy,
E Harris, T Lehner, C McKeon, J Moss, V Starks, S Sherry, T Manolio, and
L Rodriguez. A mechanism for controlled access to gwas data: experience of
the gain data access committee. Am J Hum Genet, 92(4):479-488, 2013.
[85] B Reis, I Kohane, and K Madl. Longitudinal histories as predictors of future
diagnoses of domestic abuse. BMJ, 339:b3677, 2009.
[861 L Rodriguez, L Brooks, J Greenberg, and E Green. The complexities of genomic
identifiability. Science, (339):275-276, 2013.
[87] M Saeed and et al. Multiparameter intelligent monitoring in intensive care
ii (mimic-ii): A public-access intensive care unit database. Crit Care Med,
39:952-960, 2011.
[88] S Sankararaman, G Obozinski, M Jordan, and E Halperin. Genomic privacy
and the limits of individual detection in a pool. Nat Genet, 41:965-967, 2009.
[89] A Sarwate and et al. Sharing privacy-sensitive access to neuroimaging and genetics data: a review and preliminary validation. Frontiersin Neuroinformatics,
8(35):doi:10.3389/fninf.2014.00035, 2014.
152
190] E Schadt, S Woo, and K Hao. Bayesian method to predict individual snp
genotypes from gene expression data. Nat Genet, 44(5):603-608, 2012.
[911 L Sweeney.
Simple demographics often identify
http://dataprivacylab.org/projects/identifiability/,
2010.
people
uniquely.
[92] L Sweeney. K-anonymity: a model for protecting privacy. InternationalJournal
on Uncertainty, Fuzziness and Knowledge-based Systems, 10:557-570, 2011.
[93] L Sweeney, A Abu, and J Winn. Identifying participants in the personal genome
project by name. SSRN Electronic Journal, pages 1-4, 2013.
[94] G Tucker, A Price, and B Berger. Improving power in gwas while addressing confounding from population stratification with pc-select. Genetics,
197(3):1044-1049, 2014.
[95] C Uhler, S Fienberg, and A Slavkovic. Privacy-preserving data sharing for
genome-wide association studies.
Journal of Privacy and Confidentiality,
5(1):137-166, 2013.
[96] T van Schaik et al. The need to redefine genomic data sharing: A focus on data
accessibility. Applied and TranslationalGenomics, 3:100-104, 2014.
[97] S Vinterbo and et al. Protecting count queries in study design. JAMIA, 19:750757, 2012.
[98] P Visscher and W Hill. The limits of individual identification from sample allele
frequencies: theory and statistical analysis. PLoS Genet, 5(10), 2009.
[99] D Vu and A Slavkovic. Differential privacy for clinical trial data: Preliminary
evaluations. 2009.
[100] D Vu and A Slavkovic. Differential privacy for clinical trial data: Preliminary
evaluations. Data Mining Workshop, 2009.
[101] L Walker, H Starks, K West, and S Fullerton. dbgap data access requests: a
call for greater transparency. Sci Transl Med, 3(113):1-4, 2011.
[102] S Wang, N Mohammed, and R Chen. Differentially private genome data dissemination through top-down specialization. BMC Medical Informatics and
Decision Making, 14(S1), 2014.
[103] G Weber and et al. The shared health research information network (shrine):
A prototype federated query tool for clinical data repositories. JAMIA, 16:624630, 2009.
[104] A Yao. Protocols for secure computations (extended abstract). FOCS, pages
160-164, 1982.
153
[105] F Yu and Z Ji. Scalable privacy-preserving data sharing methodology for
genome-wide association studies: an application to idash healthcare privacy
protection challenge. BMC Medical Informatics and Decision Making, 14(81),
2014.
[106j F Yu, M Rybar, C Uhler, and S Fienberg. Differentially private logistic regression for detecting multiple-snp association calable privacy-preserving data sharing methodology for genomin gwas databases. Privacy in Statistical Databases,
8744:170-184, 2014.
[107] E Zerhouni and E Nabel.
Protecting aggregate genomic data.
Science,
321(5898):1278, 2008.
[108j X Zhou, B Peng, Y Li, Y Chen, H Tang, , and X Wang. To release or not
to release: evaluating information leaks in aggregate human-genome data. ES-
ORICS, pages 607-627, 2011.
154