From Differential Privacy to Machine Learning, and Back Abhradeep Guha Thakurta Yahoo Labs, Sunnyvale Thesis: Differential privacy ⇒ generalizability Stable learning ⇒ differential privacy Part I of this Talk 1. Towards a rigorous notion of statistical data privacy 2. Differential privacy: An overview 3. Generalization guarantee via differential privacy 4. Application: Follow-the-perturbed-leader Need for a rigorous notion of privacy Learning from Private Data Individuals π1 π2 π1 ππ−1 ππ Learning from Private Data Individuals π1 π2 π1 ππ−1 ππ Trusted learning Algorithm β³ Learning from Private Data Individuals π1 π2 π1 ππ−1 ππ Trusted learning Algorithm β³ Summary statistics 1. Classifiers 2. Clusters 3. Regression coefficients Users Learning from Private Data Individuals π1 π2 π1 ππ−1 ππ Attacker Trusted learning Algorithm β³ Summary statistics 1. Classifiers 2. Clusters 3. Regression coefficients Users Learning from Private Data Whose data ? By whom ? To what end ? Typical output ? • Government agency: Census, general public, appointment, summary stats • Non-profit agency: MOOCs, researcher, improve class participation, summary stats and trends • For-profit agency: Yahoo!, researcher, improve sales, recommendations Learning from Private Data Two conflicting goals: π1 1. Utility: Release accurate information π1 2. Privacy: Protect privacy of individual entries ππ π2 ππ−1 Balancing the tradeoff is a difficult problem: 1. Netflix prize database attack [NS08] 2. Facebook advertisement system attack [Korolova11] 3. Amazon recommendation system attack [CKNFS11] Data privacy is an active area of research: • Computer science, economics, statistics, biology, social sciences … Learning Algorithm β³ Users Reconstruction attacks: A case of blatant non-privacy Reconstruction Attacks: General Principle Data set Query: π1 π1 π1 π2 Answering a set of π queries π1 , β― , πResponse: Show: π fairly accurately π1 allows recovering 99% of the data set ππ−1 ππ Violates Query: ππ any reasonable notion of privacy Response: ππ Data set of n records ⊆ 0,1 π Linear Reconstruction Attack [DN03] π1 0 1 π2 0 1 1 1 1 0 True response: ππ = 〈ππ , π·〉 Objective: Output π1 , β― , ππ s.t. • ∀π ∈ π , ππ − ππ ≤ πΌ 0 Theorem: If πΌ = π( π ) and π = πΩ(π), then π1 , β― , ππ ∼π’ 0,1 π 1 0 1 π 1 recovers 99% of the records in π· w.p. ≥ 1 − negl(π) π Set of queries ⊆ 0,1 π· Linear Reconstruction Attack [DN03] Theorem: If πΌ = π( π ) and π = Ω(π), then π1 , β― , ππ ∼π’ 0,1 π recovers 99% of the records in π· w.p.≥ 1 − negl(π) Efficient algorithm via linear programming Proof sketch: Recovery algorithm • Do an exhaustive search over π ∈ 0,1 • For all π ∈ π , π, ππ − ππ ≤ πΌ π until the following is true Disqualifying criteria Linear Reconstruction Attack [DN03] • Do an exhaustive search over π ∈ 0,1 • For all π ∈ π , π, ππ − ππ ≤ πΌ Disqualifying lemma: An π ∼π’ 0,1 π until the following is true Disqualifying criteria π disqualifies π w.p. ≥ 2/3 | π, π· − 〈π, π〉| is (symmetric) binomially distributed Anti-concentration: 〈π, π·〉 〈π, π〉 W.p≥ 2/3 it is Ω( π) Linear Reconstruction Attack [DN03] • Do an exhaustive search over π ∈ 0,1 π until the following is true • For all π ∈ π , π, ππ − ππ ≤ πΌ Disqualifying lemma: An π ∼π’ 0,1 Disqualifying criteria π disqualifies π w.p. ≥ 2/3 Disqualifies π | π, π· − 〈π, π〉| is (symmetric) binomially distributed Ω( π) Anti-concentration: 〈π, π·〉 π 〈π, π〉 By definition, π( π) Linear Reconstruction Attack [DN03] Theorem: If πΌ = π( π ) and π = Ω(π), then π1 , β― , ππ ∼π’ 0,1 π recovers 99% of the records in π· w.p.≥ 1 − negl(π) Proof ketch: Disqualifying lemma: An π ∼π’ 0,1 π disqualifies π w.p. ≥ 2/3 • Pr[No π eliminates π]≤ 1/3π ⇒ Pr[∃ π not eliminated]≤ 2π /3π ≤ negl(π) for π = Ω(π) Other Reconstruction Attacks [DMT07,DY08,KS13] • [DMT07] allows reconstruction even when answers to around 20% of the queries are arbitrary • [DY08] improves [DMT07] from arbitrarily answering 20% of the queries to arbitrarily answering almost 1/2 the queries • [KS13] extends reconstruction attacks beyond linear queries, e.g., machine learning applications Part I of this Talk 1. Towards a rigorous notion of statistical data privacy 2. Differential privacy: An overview 3. Generalization guarantee via differential privacy 4. Application: Follow-the-perturbed-leader Differential privacy: An overview What it Means to be Private? What we cannot hope to achieve [DN10]: • Adversary learns very little about an individual from the output Example: Martian scientist discovers I have one left and a right foot Prior belief: has two left feet Survey shows every human being has one left and one right foot What it Means to be Private? What we cannot hope to achieve [DN10]: • Adversary learns very little about an individual from the output Example: Martian scientist discovers I have one left and a right foot Posterior belief: has one left and one right foot Survey shows every human being has one left and one right foot Notice: This does not depend on being in the survey What it Means to be Private? What we can hope to achieve [DMNS06, DKMMN06]: • Adversary learns essentially the same thing irrespective of your presence or absence in the data set π1 Data set: π· A Random coins A(π·) π1 Data set: π·’ A A(π·’) Random coins • π· and π·′ are called neighboring data sets • Require: Neighboring data sets induce close distribution on outputs Differential Privacy [DMNS06, DKMMN06] Definition: A randomized algorithm A is (π, πΏ)-differentially private if • for all data sets π· and π· ′ that differ in one element • for all sets of answers π π Pr A π· ∈ π ≤ π Pr[A(π·′) ∈ π] + πΏ Semantics of Differential Privacy • Differential privacy is a condition on the algorithm • Guarantee is meaningful in the presence of any auxiliary information • Typically, think of privacy parameters: π ≈ 0.1 and πΏ = 1/πlog π , where π = # of data samples • Composition: π ’s and πΏ ‘s add up over multiple executions Few tools to design differentially private algorithms Laplace Mechanism [DMNS06] Data set π· = {π1 , β― , ππ } and π: π ∗ → βπ be a function on π· Global sensitivity: GS(π, 1)= max π·,π·′ ,ππ» π·,π·′ =1 |π π· − π π· ′ | 1 1. πΈ: Random variable sampled from Lap(GS(π, 1)/π)π 2. Output π π· + πΈ Theorem (Privacy): Algorithm is π-differentially private Gaussian Mechanism [DKMMN06] Data set π· = {π1 , β― , ππ } and π: π ∗ → βπ be a function on π· Global sensitivity: GS(π, 2)= max π·,π·′ ,ππ» π·,π·′ =1 |π π· − π π· ′ | 1. πΈ: Random variable sampled from π© 0, ππ (πΊπ π, 2 1/πΏ/π 2. Output π π· + πΈ Theorem (Privacy): Algorithm is (π, πΏ)-differentially private 2 Laplace Mechanism in Action: Computing Histogram • Data domain π = {π’1 , β― , π’π } (e.g., {red, yellow, green, blue}) • Data set π· ∈ π π : π1 , β― , ππ , of π samples from domain π Count Histogram representation: ππ ππ ππ ππ Laplace Mechanism in Action: Computing Histogram • π»(π·): Vector of counts for the π −bins ||1 = π π=1 π» π· ′ − π» π· π’π π’π =2 Count For all neighbors D and D’, ||π» π· − π» π·′ ππ ππ ππ ππ Laplace Mechanism in Action: Computing Histogram • π»(π·): Vector of counts for the π −bins 1. πΈ: Random variable sampled from Lap(2/π)π ||1 = π π=1 π» π· ′ − π» π· π’π π’π =2 Count For all neighbors D and D’, ||π» π· − π» π·′ 2. Output π» π· + πΈ ππ ππ ππ ππ Report-Noisy-Max (a.k.a. Exponential Mechanism) [MT07,BLST10] Set of candidate outputs π = Score function π: π × π π → β Domain of data sets Objective: Output π ∈ π , maximizing π(π , π·) Global sensitivity GS(π)= max |π π , π· − π(π , π·′)| for all neighbors π ∈π,π·,π·′ π· and π·′ Report-Noisy-Max (a.k.a. Exponential Mechanism) [MT07,BLST10] Objective: Output π ∈ π , maximizing π(π , π·) Global sensitivity GS(π)= max |π π , π· − π(π , π·′)| for all neighbors π ∈π,π·,π·′ π· and π·′ 1. ππ ← π π , π· + Lap(GS(π)/π) 2. Output π with highest value of ππ Theorem (Privacy): Algorithm is 2π-differentially private Local Sensitivity [NRS07] Data set π· = {π1 , β― , ππ } and π: π ∗ → βπ be a function on π· Local sensitivity: LS(π, π·, 1)= max ′ π·′ ,ππ» π·,π· =1 |π π· − π π· ′ | 1 1. πΈ: Random variable sampled from Lap(LS(π, π·, 1)/π)π 2. Output π π· + πΈ Not differentially private Part II : We show that local sensitivity is an useful tool Composition Theorems [DMNS06,DL09,DRV10] (π, πΏ)-diff. private algorithm π1 Data set (π, πΏ)-diff. private algorithm ππ Weak composition: π1 β β― β ππ is (ππ, ππΏ)-diff. private Strong composition ≈ π1 β β― β ππ is ( ππ, ππΏ)-diff. private Part I of this Talk 1. Towards a rigorous notion of statistical data privacy 2. Differential privacy: An overview 3. Generalization guarantee via differential privacy 4. Application: Follow-the-perturbed-leader Convex risk minimization Convex Empirical Risk Minimization (ERM): An Example Linear classifiers in βπ π Domain: Feature vector π₯ ∈ β Data set π· = π₯1 , π¦1 , β― , π₯π , π¦π Label π¦ ∈{yellow, red} ∼i.i.d. π +11 π-1 ERM: Use β π; π· = π=1 π¦π 〈π₯π , π〉 to approximate ππ Distribution π over (π₯, π¦)π Find π ∈ π that classifies π₯, π¦ ∼ π Minimize risk: ππ = arg min πΌ π∈π π₯,π¦ ∼π π¦〈π₯, π〉 Convex set π of constant diameter Empirical Risk Minimization (ERM) Setup Convex loss function: β: π × π → β Data set π· = {π1 , β― , ππ } Regularized ERM: π = Loss function: β(π; π·) 1 arg min π∈π π π π=1 β(π; ππ ) Objective: Minimize excess risk πΌπ∼π β(π; π) − min πΌπ∼π β(π; π) π∈π regularizer + π(π) Used to stop overfitting Differential privacy yields generalizability Privacy implies Generalizability: A Bird’s Eye View Differential privacy Stability (robustness to outliers) Generalization (low excess risk) Prediction Stability and Generalizability [SSSS09] πΌ −Prediction stability: For any pair of neighboring data sets π·, π· ′ ∀π ∈ π, πΌπ β(π π· ; π) − πΌπ β(π π· ′ ; π) = πΌ Theorem: Excess empirical risk of Algorithm π: πΌπ β(π π· ; π·) − min β π; π· ≤ π΄ π Prediction stability of π is πΌ Then πΌπ∼π β(π; π) − min πΌπ∼π β π; π π∈π ≤π΄+πΌ Prediction Stability and Generalizability [SSSS09] πΌ −Prediction stability: For any pair of neighboring data sets π·, π· ′ ∀π ∈ π, πΌπ β(π π· ; π) − πΌπ β(π π·′ ; π = πΌ Theorem: If ||π»β π; π ||2 ≤ πΏ, then πΌ = 2πΏ/(Δπ) Regularization implies stability: π(π·): π = 1 arg min π∈π π π π=1 β(π; ππ ) + Δ 2 π 2 2 Prediction Stability and Generalizability [Bassily, Smith,T.’2014] πΌ −Prediction stability: For any pair of neighboring data sets π·, π· ′ ∀π ∈ π, πΌπ β(π π· ; π) − πΌπ β(π π·′ ; π = πΌ Theorem: If π is (π, πΏ) −differentially private, then πΌ = ππΏ + πΏπΏ2 Advantage over πΏ2 −regularization: Stability does not rely on convexity of β Prediction Stability and Generalizability [Bassily, Smith,T.’2014] Theorem: Excess empirical risk of Algorithm π: πΌπ β(π π· ; π·) − min β π; π· ≤ π΄ π Part II of the talk: Can achieve A = O(πΏ π/(ππ)) Prediction stability of π is πΌ Then πΌπ∼π β(π; π) − min πΌπ∼π β π; π π∈π ≤π΄+πΌ Theorem: If π is (π, πΏ) −differentially private, then πΌ = ππΏ + πΏπΏ2 Prediction Stability and Generalizability [Bassily, Smith,T.’2014] Theorem: Excess empirical risk of (π, πΏ) −diff. private algorithm π: πΌπ β(π π· ; π·) − min β π; π· = π(πΏ π/(ππ)) π Prediction stability of π is ππΏ. Setting π = πΌπ∼π β(π; π) − min πΌπ∼π β π; π π∈π π/π, = π(πΏπ0.25 / π) Uniform convergence would have resulted in π dependence Proof sketches… Stability implies Generalizability Theorem: Excess empirical risk of Algorithm π: πΌπ β(π π· ; π·) − min β π; π· ≤ π΄ π Prediction stability of π is πΌ Then πΌπ∼π β(π; π) − min πΌπ∼π β π; π π∈π ≤π΄+πΌ Stability implies Generalizability: Proof Sketch Define true risk π π = πΌπ∼π β(π; π) True risk does not change on resampling a data point Data set: π· ∼ π π and π·π has π-sample replaced by π ∼ π Claim: ∀π ∈ π , πΌπ π π π· Follows from the fact that ππ is independent of π·π = πΌπ π π π·π ⇒ Claim: πΌπ π π π· = 1 π π π=1 πΌ β π π·π ; ππ Stability implies Generalizability: Proof Sketch Claim: ∀π ∈ π , πΌπ π π π· Claim: πΌπ π π π· = 1 π = πΌπ π π π·π π π=1 πΌ β π π·π ; ππ ⇒ πΌπ π π· 1 − β(π π· ; π·) = π ≤πΌ π πΌ β π π·π ; ππ − β(π π· ; ππ ) π=1 [Follows from prediction stability] Stability implies Generalizability: Proof Sketch •πΌ π π π· − β(π π· ; π·) ≤ πΌ True risk minimizer • πΌ π(ππ ) = πΌπ·∼ππ β(ππ ; π·) ≥ πΌ β π π· ; π· −π΄ ⇒ Theorem: Excess empirical risk of Algorithm π: πΌπ β(π π· ; π·) − min β π; π· ≤ π΄ π Prediction stability of π is πΌ Then πΌπ∼π β(π; π) − min πΌπ∼π β π; π π∈π ≤π΄+πΌ πΌ −Prediction Stability via Differential Privacy Theorem: If π is (π, πΏ) −differentially private, then πΌ = ππΏ + πΏπΏ2 Some fixed element from π Proof Sketch: By the definition of differential privacy, for all π ∈ π πΌ β(π π· ; π) − πΌ β(π ∗ ; π) − πΌ β(π π·′ ; π) − πΌ β(π ∗ ; π) ≤ π πΌ β π π·′ ; π ≤ ππΏ + πΏπΏ2 − πΌ β π∗; π + πΏπΏ2 Part I of this Talk 1. Towards a rigorous notion of statistical data privacy 2. Differential privacy: An overview 3. Generalization guarantee via differential privacy 4. Application: Follow-the-perturbed-leader Online learning with linear costs Online Learning Setup π1 ∈ πΏ1 -ball Player 1 Cost: 〈π1 , π1 〉 π1 ∈ πΏ∞ -ball Player 2 Online Learning Setup π2 ∈ πΏ1 −ball Player 1 Cost: 〈π2 , π2 〉 π2 ∈ πΏ∞ −ball Player 2 Online Learning Setup ππ ∈ πΏ1 −ball Minimize regret: Player 1 π π π1 , β― , ππ = ππ ∈ πΏ∞ −ball π‘=1 Player 2 Oblivious adversary: π1 , β― , ππ are fixed ahead of time Cost: 〈ππ , ππ 〉 π ππ‘ , ππ‘ − min π∈L1 〈π, ππ‘ 〉 π‘=1 Follow-the-leader and Follow-the-perturbed-leader Follow-the-leader (FTL) algorithm • ππ‘+1 ← arg min π∈πΏ1 π‘ π=1〈ππ , π〉 Theorem: FTL has regret of Ω(π) from: Proof sketch:Follows Generate π1 , β― , ππ ∼iid 0,1 π 1. Exponential mechanism 2. Strong composition theorem Follow-the-perturbed leader (FTPL) [KV05] • ππ‘+1 ← arg min π‘π=1〈ππ + ππ‘ , π〉 π∈πΏ1 (where ππ‘ ∼ Lap 1/π π ) Theorem: FTPL with π ≈ log(π)/ π has regret of π( log(π) π) Theorem: FTPL is π, πΏ −diff. private Regret Analysis of FTPL via Differential Privacy Theorem: FTPL with π ≈ log(π)/ π has regret of π( log(π) π) π‘+1 Proof Sketch: Be-the-leader (BTL): πFollows ← arg min from prediction stability π‘+1 π , π〉 π=1〈π π∈πΏ1 of differential privacy Claim: BTL has zero regret Be-the-perturbed-leader (BTPL): ππ‘+1 ← arg min π∈πΏ1 π‘+1 π=1〈ππ + ππ‘ , π〉 Claim: By π, πΏ −differential privacy, πΌ FTPL ≤ π π πΌ[BTPL] + πΏ Regret Analysis of FTPL via Differential Privacy Proof Sketch: Be-the-leader (BTL): ππ‘+1 ← arg min π∈πΏ1 π‘+1 π=1〈ππ , π〉 π‘+1 Be-the-perturbed-leader (BTPL):Optimizing ππ‘+1 ← arg min 〈ππ + πbound π‘ , π〉 on π, givesπ=1 the regret π∈πΏ1 Ignores dependence on π πΏ for simplicity Claim: By π, πΏ −differential privacy, πΌ FTPL ≤ π πΌ[BTPL] + πΏ Claim: πΌ BTPL ≤ πΌ BTL + π(log π /π) ⇒Regret(FTPL)= π(log π /π + ππ) Part I of this Talk 1. Towards a rigorous notion of statistical data privacy 2. Differential privacy: An overview 3. Generalization guarantee via differential privacy 4. Application: Follow-the-perturbed-leader Thesis: Differential privacy ⇒ generalizability Stable learning ⇒ differential privacy Concluding Remarks • Differential privacy is a cryptographically strong notion of data privacy • Differential privacy provides a very strong form of stability, i.e., stability on the measure induced on the output space • Stability from differential privacy is composable i.e., any post-processing on the output is also stable References [DMNS06] Calibrating noise to sensitivity in private data analysis. Cynthia Dwork, Kobbi Nissim, Frank McSherry and Adam Smith, 2006 [DKMMN06] Our data Ourselves: Privacy via distributed noise generation. Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Kobbi Nissim, 2006 [MT07] Mechanism design via differential privacy. Frank McSherry and Kunal Talwar, 2007 [BLST10] Discovering frequent patterns in sensitive data. Raghav Bhaskar, Srivatsan Laxman, Adam Smith, and Abhradeep Thakurta, 2010 References [BST14] Differentially Private Empirical Risk Minimization: Efficient Algorithms and Tight Algorithms. Raef Bassily, Adam Smith, and Abhradeep Thakurta, 2014 [DN10] On the Difficulties of Disclosure Prevention, or The Case for Differential Privacy. Cynthia Dwork and Moni Naor, 2010 [DN03] Revealing information while preserving privacy. Irit Dinur and Moni Naor, 2003