Differential Privacy: Theoretical & Practical Challenges Salil Vadhan Center for Research on Computation & Society John A. Paulson School of Engineering & Applied Sciences Harvard University on sabbatical at Shing-Tung Yau Center Department of Applied Mathematics National Chiao-Tung University Lecture at Institute of Information Science, Academia Sinica November 9, 2015 Data Privacy: The Problem Given a dataset with sensitive information, such as: • Census data • Academic research • Health records • Informing policy • Identifying subjects for • Social network activity drug trial • Searching for terrorists • Telecommunications data • Market analysis • … How can we: • enable “desirable uses” of the data • while protecting the “privacy” of the data subjects? ???? Approach 1: Encrypt the Data Sex Blood ⋯ HIV? Name Sex Chen F B ⋯ Y 100101 001001 110101 ⋯ 110111 Jones M A ⋯ N 101010 111010 111111 ⋯ 001001 Smith M O ⋯ N 001010 100100 011001 ⋯ 110101 Ross M O ⋯ Y 001110 010010 110101 ⋯ 100001 Lu F A ⋯ N 110101 000000 111001 ⋯ 010010 Shah M B ⋯ Y 111110 110010 000101 ⋯ 110101 Problems? Blood ⋯ Name HIV? Approach 2: Anonymize the Data Name Sex Blood ⋯ HIV? Chen F B ⋯ Y Jones M A ⋯ N Smith M O ⋯ N Ross M O ⋯ Y Lu F A ⋯ N Shah M B ⋯ Y [Sweeney `97] “re-identification” often easy Problems? Approach 3: Mediate Access Name Sex Blood ⋯ HIV? Chen F B ⋯ Y Jones M A ⋯ N Smith M O ⋯ N Ross M O ⋯ Y Lu F A ⋯ N Shah M B ⋯ Y C C trusted “curator” q1 a1 q2 a2 q3 a3 data analysts Problems? Even simple “aggregate” statistics can reveal individual info. [Dinur-Nissim `03, Homer et al. `08, Mukatran et al. `11, Dwork et al. `15] Privacy Models from Theoretical CS Model Utility Privacy Who Holds Data? Differential Privacy statistical analysis individual-specific of dataset info trusted curator Secure Function Evaluation any query desired everything other than result of query original users (or semi-trusted delegates) Fully Homomorphic any query desired everything (or Functional) (except possibly Encryption result of query) untrusted server Differential privacy [Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherryNissim ’05, Dwork-McSherry-Nissim-Smith ’06] Sex Blood ⋯ HIV? F B ⋯ Y M A ⋯ N M O ⋯ N M O ⋯ Y F A ⋯ N M B ⋯ Y C C curator q1 a1 q2 a2 q3 a3 data analysts Requirement: effect of each individual should be “hidden” Differential privacy [Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherryNissim ’05, Dwork-McSherry-Nissim-Smith ’06] Sex Blood ⋯ HIV? F B ⋯ Y M A ⋯ N M O ⋯ N M O ⋯ Y F A ⋯ N M B ⋯ Y C C curator q1 a1 q2 a2 q3 a3 adversary Differential privacy [Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherryNissim ’05, Dwork-McSherry-Nissim-Smith ’06] Sex Blood ⋯ HIV? F B ⋯ Y M A ⋯ N M O ⋯ N M O ⋯ Y F A ⋯ N M B ⋯ Y C C curator q1 a1 q2 a2 q3 a3 adversary Requirement: an adversary shouldn’t be able to tell if any one person’s data were changed arbitrarily Differential privacy [Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherryNissim ’05, Dwork-McSherry-Nissim-Smith ’06] Sex Blood ⋯ HIV? F B ⋯ Y M O ⋯ N M O ⋯ Y F A ⋯ N M B ⋯ Y C C curator q1 a1 q2 a2 q3 a3 adversary Requirement: an adversary shouldn’t be able to tell if any one person’s data were changed arbitrarily Differential privacy [Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherryNissim ’05, Dwork-McSherry-Nissim-Smith ’06] Sex Blood ⋯ ⋯ HIV? F B F A M O ⋯ N M O ⋯ Y F A ⋯ N M B ⋯ Y C Y Y C curator q1 a1 q2 a2 q3 a3 adversary Requirement: an adversary shouldn’t be able to tell if any one person’s data were changed arbitrarily Simple approach: random noise 𝑛 Sex Blood ⋯ HIV? F B ⋯ Y M A ⋯ N M O ⋯ N M O ⋯ Y F A ⋯ N M B ⋯ Y “What fraction of people are type B and HIV positive?” C M Answer + Noise(𝑂(1/𝑛)) Error → 0 as 𝑛 → ∞ • Very little noise needed to hide each person as 𝑛 → ∞. • Note: this is just for one query Differential privacy [Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherryNissim ’05, Dwork-McSherry-Nissim-Smith ’06] Sex Blood ⋯ ⋯ HIV? F B F A M O ⋯ N M O ⋯ Y F A ⋯ N M B ⋯ Y C Y Y C randomized curator q1 a1 q2 a2 q3 a3 adversary Requirement: for all D, D’ differing on one row, and all q1,…,qt Distribution of C(D,q1,…,qt) ≈𝜀 Distribution of C(D’,q1,…,qt) Differential privacy [Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherryNissim ’05, Dwork-McSherry-Nissim-Smith ’06] Sex Blood ⋯ ⋯ HIV? F B F A M O ⋯ N M O ⋯ Y F A ⋯ N M B ⋯ Y C Y Y C randomized curator q1 a1 q2 a2 q3 a3 adversary Requirement: for all D, D’ differing on one row, and all q1,…,qt Distribution of C(D,q1,…,qt) ≈𝜀 Distribution of C(D’,q1,…,qt) Differential privacy [Dinur-Nissim ’03+Dwork, Dwork-Nissim ’04, Blum-Dwork-McSherryNissim ’05, Dwork-McSherry-Nissim-Smith ’06] Sex Blood ⋯ ⋯ HIV? F B F A M O ⋯ N M O ⋯ Y F A ⋯ N M B ⋯ Y C Y Y C randomized curator q1 a1 q2 a2 q3 a3 adversary Requirement: for all D, D’ differing on one row, and all q1,…,qt sets T, Pr[C(D,q1,…,qt)T] ≲ (1+) Pr[C(D’,q1,…,qt)T] Simple approach: random noise 𝑛 Sex Blood ⋯ HIV? F B ⋯ Y M A ⋯ N M O ⋯ N M O ⋯ Y F A ⋯ N M B ⋯ Y “What fraction of people are type B and HIV positive?” C C Answer + Laplace(1/𝜀𝑛) Density ∝ exp(−𝜀𝑛 ⋅ 𝑥 ) • Very little noise needed to hide each person as 𝑛 → ∞. • Note: this is just for one query Answering multiple queries 𝑛 Sex Blood ⋯ HIV? F B ⋯ Y M A ⋯ N M O ⋯ N M O ⋯ Y F A ⋯ N M B ⋯ Y “What fraction of people are type B and HIV positive?” C C Answer + Laplace(𝑂 𝑘 /𝜀𝑛) Error → 0 if 𝑘 ≪ 𝑛2 Composition Thm [Dwork-Rothblum-V. `10]: 𝑘 independent 𝜀0 -DP algs ⇒ 𝑂 𝑘 ⋅ 𝜀0 -DP Answering multiple queries 𝑛 Sex Blood ⋯ HIV? F B ⋯ Y M A ⋯ N M O ⋯ N M O ⋯ Y F A ⋯ N M B ⋯ Y “What fraction of people are type B and HIV positive?” C C Answer + Laplace(𝑂 𝑘 /𝜀𝑛) Error → 0 if 𝑘 ≪ 𝑛2 Theorems: Cannot answer more than 𝑛2 queries if: • answers computed independently [Dwork-Naor-V. ’12] OR • data dimensionality is large (e.g. column means) [Bun-Ullman-V. `14, Dwork-Smith-Steinke-Ullman-V. `15]. Some Differentially Private Algorithms • • • • • • • • • • • histograms [DMNS06] contingency tables [BCDKMT07, GHRU11, TUV12, DNT14], machine learning [BDMN05,KLNRS08], regression & statistical estimation [CMS11,S11,KST11,ST12,JT13] clustering [BDMN05,NRS07] social network analysis [HLMJ09,GRU11,KRSY11,KNRS13,BBDS13] approximation algorithms [GLMRT10] singular value decomposition [HR12, HR13, KT13, DTTZ14] streaming algorithms [DNRY10,DNPR10,MMNW11] mechanism design [MT07,NST10,X11,NOS12,CCKMV12,HK12,KPRU12] … See Simons Institute Workshop on Big Data & Differential Privacy 12/13 Differential Privacy: Interpretations Distribution of C(D,q1,…,qt) ≈𝜀 Distribution of C(D’,q1,…,qt) • Whatever an adversary learns about me, it could have learned from everyone else’s data. • Mechanism cannot leak “individual-specific” information. • Above interpretations hold regardless of adversary’s auxiliary information. • Composes gracefully (k repetitions ) k differentially private) But • No protection for information that is not localized to a few rows. • No guarantee that subjects won’t be “harmed” by results of analysis. Answering multiple queries 𝑛 Sex Blood ⋯ HIV? F B ⋯ Y M A ⋯ N M O ⋯ N M O ⋯ Y F A ⋯ N M B ⋯ Y “What fraction of people are type B and HIV positive?” C C Answer + Laplace(𝑂 𝑘 /𝜀𝑛) Error → 0 if 𝑘 ≪ 𝑛2 Theorems: Cannot answer more than 𝑛2 queries if: • answers computed independently [Dwork-Naor-V. ’12] OR • data dimensionality is large (e.g. column means) [Bun-Ullman-V. `14, Dwork-Smith-Steinke-Ullman-V. `15]. Amazing possibility: synthetic data [Blum-Ligett-Roth ’08, Hardt-Rothblum `10] 𝑛 Sex Blood ⋯ HIV? F B ⋯ Y M A ⋯ N M O ⋯ N M O ⋯ F A M B Sex Blood ⋯ HIV? M B ⋯ N F B ⋯ Y M O ⋯ Y Y F A ⋯ N ⋯ N F O ⋯ N ⋯ Y C C “fake” people 𝑑 ≪ 𝑛2 Utility: preserves fraction of people with every set of attributes! Problem: uses computation time exponential in 𝑑. Our result: synthetic data is hard [Dwork-Naor-Reingold-Rothblum-V. `09, Ullman-V. `11] 𝑛 Sex Blood ⋯ HIV? F B ⋯ Y M A ⋯ N M O ⋯ N M O ⋯ F A M B Sex Blood ⋯ HIV? M B ⋯ N F B ⋯ Y M O ⋯ Y Y F A ⋯ N ⋯ N F O ⋯ N ⋯ Y C C “fake” people 𝑑 ≪ 𝑛2 Theorem: Any such C requires exponential computing time (else we could break all of cryptography). Our result: alternative summaries [Thaler-Ullman-V. `12, Chandrasekaran-Thaler-Ullman-Wan `13] 𝑛 Sex Blood ⋯ HIV? F B ⋯ Y M A ⋯ N M O ⋯ N M O ⋯ Y F A ⋯ N M B ⋯ Y C C rich summary 𝑑 ≪ 𝑛2 • Contains the same statistical information as synthetic data • But can be computed in sub-exponential time! Open: is there a polynomial-time algorithm? Amazing Possibility II: Statistical Inference & Machine Learning 𝑛 Sex Blood ⋯ HIV? F B ⋯ Y M A ⋯ N M O ⋯ N M O ⋯ Y F A ⋯ N M B ⋯ Y C Hypothesis ℎ or model 𝜃 about world, e.g. rule for predicting disease from attributes Theorem [KLNRS08,S11]: Differential privacy for vast array of machine learning and statistical estimation problems with little loss in convergence rate as 𝑛 → ∞. • Optimizations & practical implementations for logistic regression, ERM, LASSO, SVMs in [RBHT09,CMS11,ST13,JT14]. DP Theory & Practice Theory: differential privacy research has • many intriguing theoretical challenges • rich connections w/other parts of CS theory & mathematics e.g. cryptography, learning theory, game theory & mechanism design, convex geometry, pseudorandomness, optimization, approximability, communication complexity, statistics, … Practice: interest from many communities in seeing whether DP can be brought to practice e.g. statistics, databases, medical informatics, privacy law, social science, computer security, programming languages, … Challenges for DP in Practice • Accuracy for “small data” (moderate values of 𝑛) • Modelling & managing privacy loss over time – Especially over many different analysts & datasets • Analysts used to working with raw data – One approach: “Tiered access” – DP for wide access, raw data only by approval with strict terms of use (cf. Census PUMS vs. RDCs) • Cases where privacy concerns are not “local” (e.g. privacy for large groups) or utility is not “global” (e.g. targeting) • Matching guarantees with privacy law & regulation • … Some Efforts to Bring DP to Practice • CMU-Cornell-PennState “Integrating Statistical and Computational Approaches to Privacy” (See http://onthemap.ces.census.gov/) • Google “RAPPOR" • UCSD “Integrating Data for Analysis, Anonymization, and Sharing” (iDash) • UT Austin “Airavat: Security & Privacy for MapReduce” • UPenn “Putting Differential Privacy to Work” • Stanford-Berkeley-Microsoft “Towards Practicing Privacy” • Duke-NISSS “Triangle Census Research Network” • Harvard “Privacy Tools for Sharing Research Data” • MIT/CSAIL/ALFA "MoocDB Privacy tools for Sharing MOOC data" • … Privacy tools for sharing research data http://privacytools.seas.harvard.edu/ Computer Science, Law, Social Science, Statistics Any opinions, findings, and conclusions or recommendations expressed here are those of the author(s) and do not necessarily reflect the views of the funders of the work. A SaTC Frontier project Target: Data Repositories 30 Datasets are restricted due to privacy concerns Goal: enable wider sharing while protecting privacy Challenges for Sharing Sensitive Data Complexity of Law • Thousands of privacy laws in the US alone, at federal, state and local level, usually context-specific: HIPAA, FERPA, CIPSEA, Privacy Act, PPRA, ESRA, …. Goal: make sharing easier for Difficultyresearcher of Deidentification without expertise in • Stripping “PII”privacy usuallylaw/cs/stats provides weak protections and/or poor utility Sweeney `97 Inefficient Process for Obtaining Restricted Data • Can involve months of negotiation between institutions, original researchers Vision: Integrated Privacy Tools Risk Assessment and De-Identification Open Access to Sanitized Data Set Deposit in repository IRB proposal & review Consent from subjects Policy Proposals and Best Practices Tools to be developed during project Data Set Data Tag Generator Database of Privacy Laws & Regulations Differential Privacy Customized & MachineActionable Terms of Use Query Access Restricted Access Already can run many statistical analyses (“Zelig methods”) through the Dataverse interface, without downloading data. A new interactive data exploration & analysis tool to in Dataverse 4.0. Plan: use differential privacy to enable access to currently restricted datasets Goals for our DP Tools • General-purpose: applicable to most datasets uploaded to Dataverse. • Automated: no differential privacy expert optimizing algorithms for a particular dataset or application • Tiered access: DP interface for wide access to rough statistical information, helping users decide whether to apply for access to raw data (cf. Census PUMS vs RDCs) (Limited) prototype on project website: http://privacytools.seas.harvard.edu/ Differential Privacy: Summary Differential Privacy offers • Strong, scalable privacy guarantees • Compatibility with many types of “big data” analyses • Amazing possibilities for what can be achieved in principle There are some challenges, but reasons for optimism • Intensive research effort from many communities • Some successful uses in practice already • Differential privacy easier as 𝑛 → ∞ Privacy Models from Theoretical CS Model Utility Privacy Who Holds Data? Differential Privacy statistical analysis individual-specific of dataset info trusted curator Secure Function Evaluation any query desired everything other than result of query original users (or semi-trusted delegates) Fully Homomorphic any query desired everything (or Functional) (except possibly Encryption result of query) untrusted server See Shafi Goldwasser’s talk at White House-MIT Big Data Privacy Workshop 3/3/14