Differential Privacy

advertisement
Differential Privacy:
Theoretical & Practical Challenges
Salil Vadhan
Center for Research on Computation & Society
John A. Paulson School of Engineering & Applied Sciences
Harvard University
on sabbatical at
Shing-Tung Yau Center
Department of Applied Mathematics
National Chiao-Tung University
Lecture at Institute of Information Science, Academia Sinica
November 9, 2015
Data Privacy: The Problem
Given a dataset with sensitive information, such as:
• Census data
• Academic research
• Health records
• Informing policy
• Identifying subjects for
• Social network activity
drug trial
• Searching for terrorists
• Telecommunications data
• Market analysis
• …
How can we:
• enable “desirable uses” of the data
• while protecting the “privacy” of the data subjects?
????
Approach 1: Encrypt the Data
Sex
Blood
⋯
HIV?
Name
Sex
Chen
F
B
⋯
Y
100101
001001
110101 ⋯ 110111
Jones
M
A
⋯
N
101010
111010
111111 ⋯ 001001
Smith
M
O
⋯
N
001010
100100
011001 ⋯ 110101
Ross
M
O
⋯
Y
001110
010010
110101 ⋯ 100001
Lu
F
A
⋯
N
110101
000000
111001 ⋯ 010010
Shah
M
B
⋯
Y
111110
110010
000101 ⋯ 110101
Problems?
Blood
⋯
Name
HIV?
Approach 2: Anonymize the Data
Name
Sex
Blood
⋯
HIV?
Chen
F
B
⋯
Y
Jones
M
A
⋯
N
Smith
M
O
⋯
N
Ross
M
O
⋯
Y
Lu
F
A
⋯
N
Shah
M
B
⋯
Y
[Sweeney `97]
“re-identification” often easy
Problems?
Approach 3: Mediate Access
Name
Sex
Blood
⋯
HIV?
Chen
F
B
⋯
Y
Jones
M
A
⋯
N
Smith
M
O
⋯
N
Ross
M
O
⋯
Y
Lu
F
A
⋯
N
Shah
M
B
⋯
Y
C
C
trusted
“curator”
q1
a1
q2
a2
q3
a3
data analysts
Problems?
Even simple “aggregate” statistics can reveal individual info.
[Dinur-Nissim `03, Homer et al. `08, Mukatran et al. `11, Dwork et al. `15]
Privacy Models from Theoretical CS
Model
Utility
Privacy
Who Holds Data?
Differential Privacy
statistical analysis individual-specific
of dataset
info
trusted curator
Secure Function
Evaluation
any query desired everything other
than result of
query
original users
(or semi-trusted
delegates)
Fully Homomorphic any query desired everything
(or Functional)
(except possibly
Encryption
result of query)
untrusted server
Differential privacy
[Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherryNissim ’05, Dwork-McSherry-Nissim-Smith ’06]
Sex
Blood ⋯
HIV?
F
B
⋯
Y
M
A
⋯
N
M
O
⋯
N
M
O
⋯
Y
F
A
⋯
N
M
B
⋯
Y
C
C
curator
q1
a1
q2
a2
q3
a3
data analysts
Requirement: effect of each individual should be “hidden”
Differential privacy
[Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherryNissim ’05, Dwork-McSherry-Nissim-Smith ’06]
Sex
Blood ⋯
HIV?
F
B
⋯
Y
M
A
⋯
N
M
O
⋯
N
M
O
⋯
Y
F
A
⋯
N
M
B
⋯
Y
C
C
curator
q1
a1
q2
a2
q3
a3
adversary
Differential privacy
[Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherryNissim ’05, Dwork-McSherry-Nissim-Smith ’06]
Sex
Blood ⋯
HIV?
F
B
⋯
Y
M
A
⋯
N
M
O
⋯
N
M
O
⋯
Y
F
A
⋯
N
M
B
⋯
Y
C
C
curator
q1
a1
q2
a2
q3
a3
adversary
Requirement: an adversary shouldn’t be able to
tell if any one person’s data were changed arbitrarily
Differential privacy
[Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherryNissim ’05, Dwork-McSherry-Nissim-Smith ’06]
Sex
Blood ⋯
HIV?
F
B
⋯
Y
M
O
⋯
N
M
O
⋯
Y
F
A
⋯
N
M
B
⋯
Y
C
C
curator
q1
a1
q2
a2
q3
a3
adversary
Requirement: an adversary shouldn’t be able to
tell if any one person’s data were changed arbitrarily
Differential privacy
[Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherryNissim ’05, Dwork-McSherry-Nissim-Smith ’06]
Sex
Blood ⋯
⋯
HIV?
F
B
F
A
M
O
⋯
N
M
O
⋯
Y
F
A
⋯
N
M
B
⋯
Y
C
Y
Y
C
curator
q1
a1
q2
a2
q3
a3
adversary
Requirement: an adversary shouldn’t be able to
tell if any one person’s data were changed arbitrarily
Simple approach: random noise
𝑛
Sex
Blood
⋯
HIV?
F
B
⋯
Y
M
A
⋯
N
M
O
⋯
N
M
O
⋯
Y
F
A
⋯
N
M
B
⋯
Y
“What fraction of people are
type B and HIV positive?”
C
M
Answer + Noise(𝑂(1/𝑛))
Error → 0 as 𝑛 → ∞
• Very little noise needed to hide each person as 𝑛 → ∞.
• Note: this is just for one query
Differential privacy
[Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherryNissim ’05, Dwork-McSherry-Nissim-Smith ’06]
Sex
Blood ⋯
⋯
HIV?
F
B
F
A
M
O
⋯
N
M
O
⋯
Y
F
A
⋯
N
M
B
⋯
Y
C
Y
Y
C
randomized
curator
q1
a1
q2
a2
q3
a3
adversary
Requirement: for all D, D’ differing on one row, and all q1,…,qt
Distribution of C(D,q1,…,qt) ≈𝜀 Distribution of C(D’,q1,…,qt)
Differential privacy
[Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherryNissim ’05, Dwork-McSherry-Nissim-Smith ’06]
Sex
Blood ⋯
⋯
HIV?
F
B
F
A
M
O
⋯
N
M
O
⋯
Y
F
A
⋯
N
M
B
⋯
Y
C
Y
Y
C
randomized
curator
q1
a1
q2
a2
q3
a3
adversary
Requirement: for all D, D’ differing on one row, and all q1,…,qt
Distribution of C(D,q1,…,qt) ≈𝜀 Distribution of C(D’,q1,…,qt)
Differential privacy
[Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherryNissim ’05, Dwork-McSherry-Nissim-Smith ’06]
Sex
Blood ⋯
⋯
HIV?
F
B
F
A
M
O
⋯
N
M
O
⋯
Y
F
A
⋯
N
M
B
⋯
Y
C
Y
Y
C
randomized
curator
q1
a1
q2
a2
q3
a3
adversary
Requirement: for all D, D’ differing on one row, and all q1,…,qt
 sets T,
Pr[C(D,q1,…,qt)T] ≲ (1+)  Pr[C(D’,q1,…,qt)T]
Simple approach: random noise
𝑛
Sex
Blood
⋯
HIV?
F
B
⋯
Y
M
A
⋯
N
M
O
⋯
N
M
O
⋯
Y
F
A
⋯
N
M
B
⋯
Y
“What fraction of people are
type B and HIV positive?”
C
C
Answer + Laplace(1/𝜀𝑛)
Density ∝ exp(−𝜀𝑛 ⋅ 𝑥 )
• Very little noise needed to hide each person as 𝑛 → ∞.
• Note: this is just for one query
Answering multiple queries
𝑛
Sex
Blood
⋯
HIV?
F
B
⋯
Y
M
A
⋯
N
M
O
⋯
N
M
O
⋯
Y
F
A
⋯
N
M
B
⋯
Y
“What fraction of people are
type B and HIV positive?”
C
C
Answer + Laplace(𝑂
𝑘 /𝜀𝑛)
Error → 0 if 𝑘 ≪ 𝑛2
Composition Thm [Dwork-Rothblum-V. `10]:
𝑘 independent 𝜀0 -DP algs ⇒ 𝑂 𝑘 ⋅ 𝜀0 -DP
Answering multiple queries
𝑛
Sex
Blood
⋯
HIV?
F
B
⋯
Y
M
A
⋯
N
M
O
⋯
N
M
O
⋯
Y
F
A
⋯
N
M
B
⋯
Y
“What fraction of people are
type B and HIV positive?”
C
C
Answer + Laplace(𝑂
𝑘 /𝜀𝑛)
Error → 0 if 𝑘 ≪ 𝑛2
Theorems: Cannot answer more than 𝑛2 queries if:
• answers computed independently [Dwork-Naor-V. ’12] OR
• data dimensionality is large (e.g. column means)
[Bun-Ullman-V. `14, Dwork-Smith-Steinke-Ullman-V. `15].
Some Differentially Private Algorithms
•
•
•
•
•
•
•
•
•
•
•
histograms [DMNS06]
contingency tables [BCDKMT07, GHRU11, TUV12, DNT14],
machine learning [BDMN05,KLNRS08],
regression & statistical estimation [CMS11,S11,KST11,ST12,JT13]
clustering [BDMN05,NRS07]
social network analysis [HLMJ09,GRU11,KRSY11,KNRS13,BBDS13]
approximation algorithms [GLMRT10]
singular value decomposition [HR12, HR13, KT13, DTTZ14]
streaming algorithms [DNRY10,DNPR10,MMNW11]
mechanism design [MT07,NST10,X11,NOS12,CCKMV12,HK12,KPRU12]
…
See Simons Institute Workshop on Big Data & Differential Privacy 12/13
Differential Privacy: Interpretations
Distribution of C(D,q1,…,qt) ≈𝜀 Distribution of C(D’,q1,…,qt)
• Whatever an adversary learns about me, it could have learned
from everyone else’s data.
• Mechanism cannot leak “individual-specific” information.
• Above interpretations hold regardless of adversary’s auxiliary
information.
• Composes gracefully (k repetitions ) k differentially private)
But
• No protection for information that is not localized to a few rows.
• No guarantee that subjects won’t be “harmed” by results of
analysis.
Answering multiple queries
𝑛
Sex
Blood
⋯
HIV?
F
B
⋯
Y
M
A
⋯
N
M
O
⋯
N
M
O
⋯
Y
F
A
⋯
N
M
B
⋯
Y
“What fraction of people are
type B and HIV positive?”
C
C
Answer + Laplace(𝑂
𝑘 /𝜀𝑛)
Error → 0 if 𝑘 ≪ 𝑛2
Theorems: Cannot answer more than 𝑛2 queries if:
• answers computed independently [Dwork-Naor-V. ’12] OR
• data dimensionality is large (e.g. column means)
[Bun-Ullman-V. `14, Dwork-Smith-Steinke-Ullman-V. `15].
Amazing possibility: synthetic data
[Blum-Ligett-Roth ’08, Hardt-Rothblum `10]
𝑛
Sex
Blood
⋯
HIV?
F
B
⋯
Y
M
A
⋯
N
M
O
⋯
N
M
O
⋯
F
A
M
B
Sex
Blood
⋯
HIV?
M
B
⋯
N
F
B
⋯
Y
M
O
⋯
Y
Y
F
A
⋯
N
⋯
N
F
O
⋯
N
⋯
Y
C
C
“fake” people
𝑑 ≪ 𝑛2
Utility: preserves fraction of people with every set of attributes!
Problem: uses computation time exponential in 𝑑.
Our result: synthetic data is hard
[Dwork-Naor-Reingold-Rothblum-V. `09, Ullman-V. `11]
𝑛
Sex
Blood
⋯
HIV?
F
B
⋯
Y
M
A
⋯
N
M
O
⋯
N
M
O
⋯
F
A
M
B
Sex
Blood
⋯
HIV?
M
B
⋯
N
F
B
⋯
Y
M
O
⋯
Y
Y
F
A
⋯
N
⋯
N
F
O
⋯
N
⋯
Y
C
C
“fake” people
𝑑 ≪ 𝑛2
Theorem: Any such C requires exponential computing time
(else we could break all of cryptography).
Our result: alternative summaries
[Thaler-Ullman-V. `12, Chandrasekaran-Thaler-Ullman-Wan `13]
𝑛
Sex
Blood
⋯
HIV?
F
B
⋯
Y
M
A
⋯
N
M
O
⋯
N
M
O
⋯
Y
F
A
⋯
N
M
B
⋯
Y
C
C
rich summary
𝑑 ≪ 𝑛2
• Contains the same statistical information as synthetic data
• But can be computed in sub-exponential time!
Open: is there a polynomial-time algorithm?
Amazing Possibility II:
Statistical Inference & Machine Learning
𝑛
Sex
Blood
⋯
HIV?
F
B
⋯
Y
M
A
⋯
N
M
O
⋯
N
M
O
⋯
Y
F
A
⋯
N
M
B
⋯
Y
C
Hypothesis ℎ or
model 𝜃 about world,
e.g. rule for predicting
disease from attributes
Theorem [KLNRS08,S11]: Differential privacy for vast array of
machine learning and statistical estimation problems with little
loss in convergence rate as 𝑛 → ∞.
• Optimizations & practical implementations for logistic regression, ERM,
LASSO, SVMs in [RBHT09,CMS11,ST13,JT14].
DP Theory & Practice
Theory: differential privacy research has
• many intriguing theoretical challenges
• rich connections w/other parts of CS theory & mathematics
e.g. cryptography, learning theory, game theory & mechanism design,
convex geometry, pseudorandomness, optimization, approximability,
communication complexity, statistics, …
Practice: interest from many communities in seeing whether
DP can be brought to practice
e.g. statistics, databases, medical informatics, privacy law, social
science, computer security, programming languages, …
Challenges for DP in Practice
• Accuracy for “small data” (moderate values of 𝑛)
• Modelling & managing privacy loss over time
– Especially over many different analysts & datasets
• Analysts used to working with raw data
– One approach: “Tiered access”
– DP for wide access, raw data only by approval with strict
terms of use (cf. Census PUMS vs. RDCs)
• Cases where privacy concerns are not “local” (e.g. privacy for
large groups) or utility is not “global” (e.g. targeting)
• Matching guarantees with privacy law & regulation
• …
Some Efforts to Bring DP to Practice
• CMU-Cornell-PennState “Integrating Statistical and Computational
Approaches to Privacy”
(See http://onthemap.ces.census.gov/)
• Google “RAPPOR"
• UCSD “Integrating Data for Analysis, Anonymization, and Sharing”
(iDash)
• UT Austin “Airavat: Security & Privacy for MapReduce”
• UPenn “Putting Differential Privacy to Work”
• Stanford-Berkeley-Microsoft “Towards Practicing Privacy”
• Duke-NISSS “Triangle Census Research Network”
• Harvard “Privacy Tools for Sharing Research Data”
• MIT/CSAIL/ALFA "MoocDB Privacy tools for Sharing MOOC data"
• …
Privacy tools for sharing research data
http://privacytools.seas.harvard.edu/
Computer Science, Law, Social Science, Statistics
Any opinions, findings, and conclusions or recommendations expressed here are
those of the author(s) and do not necessarily reflect the views of the funders of the work.
A SaTC
Frontier
project
Target: Data Repositories
30
Datasets are
restricted due to
privacy concerns
Goal: enable wider sharing while protecting privacy
Challenges for Sharing Sensitive Data
Complexity of Law
• Thousands of privacy laws in the US alone, at federal,
state and local level, usually context-specific:
HIPAA, FERPA, CIPSEA, Privacy Act, PPRA, ESRA, ….
Goal: make sharing easier for
Difficultyresearcher
of Deidentification
without expertise in
• Stripping “PII”privacy
usuallylaw/cs/stats
provides
weak protections and/or poor utility
Sweeney `97
Inefficient Process for Obtaining Restricted Data
• Can involve months of negotiation between institutions,
original researchers
Vision: Integrated Privacy Tools
Risk Assessment
and
De-Identification
Open
Access to
Sanitized
Data Set
Deposit in repository
IRB
proposal
& review
Consent
from
subjects
Policy
Proposals
and Best
Practices
Tools to be
developed during
project
Data
Set
Data Tag
Generator
Database of
Privacy Laws
&
Regulations
Differential
Privacy
Customized &
MachineActionable
Terms of Use
Query
Access
Restricted
Access
Already can run many statistical analyses (“Zelig methods”)
through the Dataverse interface, without downloading data.
A new interactive data exploration &
analysis tool to in Dataverse 4.0.
Plan: use differential privacy to enable access
to currently restricted datasets
Goals for our DP Tools
• General-purpose: applicable to most datasets uploaded to
Dataverse.
• Automated: no differential privacy expert optimizing
algorithms for a particular dataset or application
• Tiered access: DP interface for wide access to rough
statistical information, helping users decide whether to apply
for access to raw data (cf. Census PUMS vs RDCs)
(Limited) prototype on project website:
http://privacytools.seas.harvard.edu/
Differential Privacy: Summary
Differential Privacy offers
• Strong, scalable privacy guarantees
• Compatibility with many types of “big data” analyses
• Amazing possibilities for what can be achieved in principle
There are some challenges, but reasons for optimism
• Intensive research effort from many communities
• Some successful uses in practice already
• Differential privacy easier as 𝑛 → ∞
Privacy Models from Theoretical CS
Model
Utility
Privacy
Who Holds Data?
Differential Privacy
statistical analysis individual-specific
of dataset
info
trusted curator
Secure Function
Evaluation
any query desired everything other
than result of
query
original users
(or semi-trusted
delegates)
Fully Homomorphic any query desired everything
(or Functional)
(except possibly
Encryption
result of query)
untrusted server
See Shafi Goldwasser’s talk
at White House-MIT Big Data Privacy Workshop 3/3/14
Download