From Differential Privacy to Machine Learning, and Back Abhradeep Guha Thakurta Yahoo Labs, Sunnyvale Thesis: Differential privacy ⇒ generalizability Stable learning ⇒ differential privacy Part II of this Talk 1. Recap: Differential privacy and convex risk minimization 2. Private gradient descent and risk minimization in low-dimensions 3. Private Frank-Wolfe and risk minimization in high-dimensions 4. Private feature selection in high-dimensions and the LASSO Recap: Differential privacy Differential Privacy [DMNS06, DKMMN06] • Adversary learns essentially the same thing irrespective of your presence or absence in the data set ๐1 Data set: ๐ท A Random coins A(๐ท) ๐1 Data set: ๐ท’ A A(๐ท’) Random coins • ๐ท and ๐ท′ are called neighboring data sets • Require: Neighboring data sets induce close distribution on outputs Differential Privacy [DMNS06, DKMMN06] Definition: A randomized algorithm A is (๐, ๐ฟ)-differentially private if • for all data sets ๐ท and ๐ท ′ that differ in one element • for all sets of answers ๐ ๐ Pr A ๐ท ∈ ๐ ≤ ๐ Pr[A(๐ท′) ∈ ๐] + ๐ฟ Semantics of Differential Privacy • Differential privacy is a condition on the algorithm • Guarantee is meaningful in the presence of any auxiliary information • Typically, think of privacy parameters: ๐ ≈ 0.1 and ๐ฟ = 1/๐log ๐ , where ๐ = # of data samples • Composition: ๐ ’s and ๐ฟ ‘s add up over multiple executions Few tools to design differentially private algorithms Laplace Mechanism [DMNS06] Data set ๐ท = {๐1 , โฏ , ๐๐ } and ๐: ๐ ∗ → โ๐ be a function on ๐ท Global sensitivity: GS(๐, 1)= max ๐ท,๐ท′ ,๐๐ป ๐ท,๐ท′ =1 |๐ ๐ท − ๐ ๐ท ′ | 1 1. ๐ธ: Random variable sampled from Lap(GS(๐, 1)/๐)๐ 2. Output ๐ ๐ท + ๐ธ Theorem (Privacy): Algorithm is ๐-differentially private Gaussian Mechanism [DKMMN06] Data set ๐ท = {๐1 , โฏ , ๐๐ } and ๐: ๐ ∗ → โ๐ be a function on ๐ท Global sensitivity: GS(๐, 2)= max ๐ท,๐ท′ ,๐๐ป ๐ท,๐ท′ =1 |๐ ๐ท − ๐ ๐ท ′ | 1. ๐ธ: Random variable sampled from ๐ฉ 0, ๐๐ (๐บ๐ ๐, 2 1/๐ฟ/๐ 2. Output ๐ ๐ท + ๐ธ Theorem (Privacy): Algorithm is (๐, ๐ฟ)-differentially private 2 Report-Noisy-Max (a.k.a. Exponential Mechanism) [MT07,BLST10] Set of candidate outputs ๐ = Score function ๐: ๐ × ๐ ๐ → โ Domain of data sets Objective: Output ๐ ∈ ๐ , maximizing ๐(๐ , ๐ท) Global sensitivity GS(๐)= max |๐ ๐ , ๐ท − ๐(๐ , ๐ท′)| for all neighbors ๐ ∈๐,๐ท,๐ท′ ๐ท and ๐ท′ Report-Noisy-Max (a.k.a. Exponential Mechanism) [MT07,BLST10] Objective: Output ๐ ∈ ๐ , maximizing ๐(๐ , ๐ท) Global sensitivity GS(๐)= max |๐ ๐ , ๐ท − ๐(๐ , ๐ท′)| for all neighbors ๐ ∈๐,๐ท,๐ท′ ๐ท and ๐ท′ 1. ๐๐ ← ๐ ๐ , ๐ท + Lap(GS(๐)/๐) 2. Output ๐ with highest value of ๐๐ Theorem (Privacy): Algorithm is 2๐-differentially private Composition Theorems [DMNS06,DL09,DRV10] (๐, ๐ฟ)-diff. private algorithm ๐1 Data set (๐, ๐ฟ)-diff. private algorithm ๐๐ Weak composition: ๐1 โ โฏ โ ๐๐ is (๐๐, ๐๐ฟ)-diff. private Strong composition ≈ ๐1 โ โฏ โ ๐๐ is ( ๐๐, ๐๐ฟ)-diff. private Recap: (Private) convex risk minimization Convex Empirical Risk Minimization (ERM): An Example Linear classifiers in โ๐ ๐ Domain: Feature vector ๐ฅ ∈ โ Data set ๐ท = ๐ฅ1 , ๐ฆ1 , โฏ , ๐ฅ๐ , ๐ฆ๐ Label ๐ฆ ∈{yellow, red} ∼i.i.d. ๐ +11 ๐-1 ERM: Use โ ๐; ๐ท = ๐=1 ๐ฆ๐ 〈๐ฅ๐ , ๐〉 to approximate ๐๐ Distribution ๐ over (๐ฅ, ๐ฆ)๐ Find ๐ ∈ ๐ that classifies ๐ฅ, ๐ฆ ∼ ๐ Minimize risk: ๐๐ = arg min ๐ผ ๐∈๐ ๐ฅ,๐ฆ ∼๐ ๐ฆ〈๐ฅ, ๐〉 Convex set ๐ of constant diameter Empirical Risk Minimization (ERM) Setup Convex loss function: โ: ๐ × ๐ → โ Data set ๐ท = {๐1 , โฏ , ๐๐ } Regularized ERM: ๐ = Loss function: โ(๐; ๐ท) 1 arg min ๐∈๐ ๐ ๐ ๐=1 โ(๐; ๐๐ ) Objective: Minimize excess risk ๐ผ๐∼๐ โ(๐; ๐) − min ๐ผ๐∼๐ โ(๐; ๐) ๐∈๐ regularizer + ๐(๐) Used to stop overfitting Empirical Risk Minimization (ERM) Setup Convex loss function: โ: ๐ × ๐ → โ Loss function: โ(๐; ๐ท) Data set ๐ท = {๐1 , โฏ , ๐๐ } Recall: Differential privacy+ low excess empirical risk 1 ๐ ⇒ Risk LowMinimizer: excess true๐risk Empirical = arg min ๐=1 โ(๐; ๐๐ ) ๐∈๐ ๐ Today: Differentially private algorithm to output ๐priv that minimizes โ(๐priv ; ๐ท) − min โ(๐; ๐ท) ๐∈๐ Empirical Risk Minimization (ERM) Setup Today: Differentially private algorithm to output ๐priv that minimizes โ(๐priv ; ๐ท) − min โ(๐; ๐ท) ๐∈๐ • ๐ฟ2 /๐ฟ2 -setting: ||๐ปโ ๐; ๐ ||2 ≤ 1 and ||๐||2 ≤ 1 • Algorithm: Private gradient descent • ๐ฟ1 /๐ฟ∞ -setting: ||๐ปโ ๐; ๐ ||∞ ≤ 1 and ||๐||1 ≤ 1 • Algorithms: Frank-Wolfe and LASSO Part II of this Talk 1. Recap: Differential privacy and convex risk minimization 2. Private gradient descent and risk minimization in low-dimensions 3. Private Frank-Wolfe and risk minimization in high-dimensions 4. Private feature selection in high-dimensions and the LASSO Private ERM and noisy gradient descent (for ๐ฟ2/๐ฟ2-setting) [Bassily, Smith, T.’14] Convex empirical risk minimization: ๐ฟ2 /๐ฟ2 -setting Δ-strong convexity Lipschitz continuity ≤ ||๐1 − ๐2 ||2 โ(๐; ๐) Δ ≥ ๐ก(1 − ๐ก)||๐1 − ๐2 ||2 2 (for all ๐ก ∈ [0,1]) ๐1 ๐2 ∀๐1 , ๐2 ∈ ๐, โ ๐1 ; ๐ − โ ๐2 ; ๐ ≤ ||๐1 − ๐2 ||2 t ๐1 โ(๐; ๐) : (1-t) ๐2 Bounded set ๐: • Assume ||๐||2 is bounded by const. Why privacy is a concern in ERM? Computing the median Data set ๐ท = ๐1, โฏ , ๐๐ ,where each ๐๐ ∈ โ Median: ๐∗ = arg min ๐∈โ ๐ ๐=1 |๐ − ๐๐ | Median is a data point in ๐ท ๐1 ๐2 ๐∗ ๐๐ Why privacy is a concern in ERM? Support vector machine: Dual-formulation Support vectors Separating hyperplane Support vectors are essentially data points in ๐ท Δ − strongly convex Lipschitz Our results (data set size = ๐, dimension of set ๐ is ๐ < ๐) Privacy ๐-DP (๐, ๐ฟ)-DP ๐-DP (๐, ๐ฟ)-DP Excess empirical risk Technique ๐(๐/๐๐) Exponential sampling (based on [MT07]) ๐( ๐ log 2 (๐/๐ฟ) /๐๐) ๐(๐2 /(๐2 ๐)) ๐ log 2.5 (๐/๐ฟ) ๐ ๐2 Δ๐ Stochastic gradient descent (formal analysis and improvements over [WM10] and [CSS13]) “Localization” + exponential sampling Stochastic gradient descent Δ − strongly convex Lipschitz Our results (data set size = ๐, dimension of set ๐ is ๐ < ๐) Privacy ๐-DP (๐, ๐ฟ)-DP ๐-DP (๐, ๐ฟ)-DP Excess empirical risk Technique ๐(๐/๐๐) Exponential sampling (based on [MT07]) ๐( ๐ log 2 (๐/๐ฟ) /๐๐) ๐(๐2 /(๐2 ๐)) ๐ log 2.5 (๐/๐ฟ) ๐ ๐2 Δ๐ Stochastic gradient descent (formal analysis and improvements over [WM10] and [CSS13]) “Localization” + exponential sampling Stochastic gradient descent Δ − strongly convex Lipschitz Our results (data set size = ๐, dimension of set ๐ is ๐ < ๐) Privacy ๐-DP (๐, ๐ฟ)-DP ๐-DP (๐, ๐ฟ)-DP Excess empirical risk Technique ๐(๐/๐๐) Exponential sampling (based on [MT07]) ๐( ๐ log 2 (๐/๐ฟ) /๐๐) ๐(๐2 /(๐2 ๐)) ๐ log 2.5 (๐/๐ฟ) ๐ ๐2 Δ๐ Stochastic gradient descent (formal analysis and improvements over [WM10] and [CSS13]) “Localization” + exponential sampling Stochastic gradient descent Δ − strongly convex Lipschitz Our results (data set size = ๐, dimension of set ๐ is ๐ < ๐) Privacy ๐-DP (๐, ๐ฟ)-DP ๐-DP (๐, ๐ฟ)-DP Excess empirical risk Technique ๐(๐/๐๐) Exponential sampling (based on [MT07]) ๐( ๐ log 2 (๐/๐ฟ) /๐๐) ๐(๐2 /(๐2 ๐)) ๐ log 2.5 (๐/๐ฟ) ๐ ๐2 Δ๐ Stochastic gradient descent (formal analysis and improvements over [WM10] and [CSS13]) “Localization” + exponential sampling Stochastic gradient descent Our results (data set size = ๐, dimension of set ๐ is ๐ < ๐) Lipschitz Privacy (๐, ๐ฟ)-DP Excess empirical risk ๐( ๐ log 2 (๐/๐ฟ) /๐๐) Technique Stochastic gradient descent (formal analysis and improvements over [WM10] and [CSS13]) Private stochastic gradient descent Private stochastic gradient descent (Priv-SGD) Loss functions for ๐ท โ(๐; ๐1 ) โ(๐; ๐2 ) โ(๐; ๐๐ ) 1. Choose arbitrary ๐1 ∈ ๐ ๐1 Convex set: ๐ Private stochastic gradient descent (Priv-SGD) Loss functions for ๐ท โ(๐; ๐1 ) Learning rate=๐(1/ โ(๐; ๐2 ) ๐ก๐๐) โ(๐; ๐๐ ) 2. For each time step ๐ก ∈ [๐] a. Sample ๐ ∼ {๐1, โฏ , ๐๐} b. ๐๐ก+1 ← ๐๐ก − ๐ ๐ ⋅ ๐โ ๐๐ก ; ๐ + ๐๐ก where ๐๐ก ∼ ๐ฉ(0, ๐(๐Δ๐,๐ฟ ๐๐) ๐๐ก ๐ โ(๐; ๐) Private stochastic gradient descent (Priv-SGD) Loss functions for ๐ท โ(๐; ๐1 ) โ(๐; ๐2 ) โ(๐; ๐๐ ) 3. Project ๐ onto ๐, i.e., ๐๐ก+1 ← Π๐ (๐) ๐ ๐๐ก+1 Convex set: ๐ Private stochastic gradient descent (Priv-SGD) 1. Choose arbitrary ๐1 ∈ ๐ 2. For each time step ๐ก ∈ [๐] ๐1 a. Sample ๐ ∼ {๐1, โฏ , ๐๐} ๐๐ก Finally, output ๐ ๐ b. ๐ ← ๐๐ก − ๐ ๐ ⋅ ๐โ ๐๐ก ; ๐ + ๐๐ก where ๐๐ก ∼ ๐ฉ(0, ๐(๐Δ๐,๐ฟ ๐๐) 3. Project ๐ onto ๐, i.e., ๐๐ก+1 ← Π๐ (๐) ๐ Private stochastic gradient descent (Priv-SGD) Privacy guarantee: For ๐ = ๐2 , Priv-SGD is ๐, ๐ฟ −differentially private Key insights [denote # of iterations by ๐ = ๐2 , and ignore ๐ฟ]: 1. Providing ๐ ≈ 1 ๐ diff. privacy per iteration is sufficient 2. [KRSU10] Sampling ensures ๐ ≈ 1 ๐ diff. privacy to ๐ ← ๐๐ก − ๐ ๐ ⋅ ๐โ ๐๐ก ; ๐ + ๐๐ก Private stochastic gradient descent (Priv-SGD) Utility guarantee (for ๐ = ๐2 ): โ ๐๐๐๐๐ฃ ; ๐ท − min โ ๐; ๐ท = ๐ ๐∈๐ ๐ log 2 (๐/๐ฟ) ๐๐ Key insight: ∼๐ข โ(๐; ๐1 ) โ(๐; ๐๐ ) โ(๐; ๐) Private stochastic gradient descent (Priv-SGD) Utility guarantee (for ๐ = ๐2 ): โ ๐๐๐๐๐ฃ ; ๐ท − min โ ๐; ๐ท = ๐ ๐∈๐ ๐ log 2 (๐/๐ฟ) ๐๐ Key insight: Unbiased estimator of the gradient: ๐๐ก โ(๐; ๐) ๐ผ๐๐ก ,๐ ๐ ⋅ ๐โ ๐๐ก ; ๐ + ๐๐ก = ๐ ⋅ ๐โ(๐๐ก ; ๐ท) Private stochastic gradient descent (Priv-SGD) 1. Unbiased estimator of the gradient: ๐ผ๐๐ก ,๐ ๐ ⋅ ๐โ ๐๐ก ; ๐ + ๐๐ก = ๐ ⋅ ๐โ(๐๐ก ; ๐ท) 2. Bounded variance: ๐ผ๐๐ก ,๐ ||๐ ⋅ ๐โ ๐๐ก ; ๐ + ๐๐ก ||22 = ๐(๐๐2 log(๐/๐ฟ) /๐ 2 ) Utility guarantee now follows from [SZ’13] Private stochastic gradient descent (Priv-SGD) Running time: Assuming computing gradient takes time ๐(๐) Running time of Priv-SGD is ๐(๐2 ๐) Note: If we computed the true gradient of โ, instead of sampling, running time would have been ๐(๐3 ๐) Utility guarantee of Priv-SGD is optimal. • Proof via Fingerprinting codes [BS96,Tardos03] Part II of this Talk 1. Recap: Differential privacy and convex risk minimization 2. Private gradient descent and risk minimization in low-dimensions 3. Private Frank-Wolfe and risk minimization in high-dimensions 4. Private feature selection in high-dimensions and the LASSO Private ERM and the Frank-Wolfe (for ๐ฟ1/๐ฟ∞-setting) [Talwar, T., Zhang] Empirical Risk Minimization (ERM) Setup Today: Differentially private algorithm to output ๐priv that minimizes โ(๐priv ; ๐ท) − min โ(๐; ๐ท) ๐∈๐ Commonly referred to as high-dimensional setting • ๐ฟ2 /๐ฟ2 -setting: ||๐ปโ ๐; ๐ ||2 ≤ 1 and ||๐||2 ≤ 1 • Algorithm: Private gradient descent • ๐ฟ1 /๐ฟ∞ -setting: ||๐ปโ ๐; ๐ ||∞ ≤ 1 and ||๐||1 ≤ 1 • Algorithms: Frank-Wolfe and LASSO Frank-Wolfe on the ๐ฟ1 -ball (a stylized exposition) 1. Pick a corner of the ๐ฟ1 −ball (๐1 ∈ โ๐ ) Linearized loss 2. ๐ ← arg min〈๐ปโ(๐๐ก ; ๐ท), ๐〉 ๐∈๐ฟ1 3. ๐๐ก+1 ← ๐ผ๐๐ก + 1 − ๐ผ ๐ Typically, ๐ผ ≈ 1/๐ ๐ ๐1 Frank-Wolfe on the ๐ฟ1 -ball (a stylized exposition) • ๐ is always a corner Linearized loss • Final output ๐๐ is always a convex combination of the corners • For smooth losses, the convergence≈ ๐(1/๐) [FW56, Jaggi’13] ๐ ๐1 Differentially private Frank-Wolfe Recap: Report-Noisy-Max (a.k.a. Exponential Mechanism) Objective: Output ๐ ∈ ๐ , maximizing ๐(๐ , ๐ท) Global sensitivity GS(๐)= max |๐ ๐ , ๐ท − ๐(๐ , ๐ท′)| for all neighbors ๐ ∈๐,๐ท,๐ท′ ๐ท and ๐ท′ 1. ๐๐ ← ๐ ๐ , ๐ท + Lap(GS(๐)/๐) 2. Output ๐ with highest value of ๐๐ Theorem (Privacy): Algorithm is 2๐-differentially private Private Frank-Wolfe (hiding terms in ๐ฟ) 1. Pick a corner of the ๐ฟ1 −ball (๐1 ∈ โ๐ ) 2. ๐ ← arg min〈๐ปโ ๐๐ก ; ๐ท + ๐๐ก , ๐〉, ๐๐ก ∼ Lap( ๐/๐๐) ๐∈๐ฟ1 3. ๐๐ก+1 ← ๐๐ก /๐ + 1 − 1/๐ ๐ Report-noisy-max + strong composition Theorem (privacy): Algorithm is (๐, ๐ฟ) −differentially private Private Frank-Wolfe (hiding terms in ๐ฟ) Theorem (utility): For ๐ ≈ ๐2/3 , ๐ผ โ ๐T ; ๐ท − min โ ๐; ๐ท = ๐ ๐∈๐ฟ1 log ๐๐ ๐๐ 2/3 The guarantee is tight. Optimality via fingerprinting codes [BS96,Tardos03] Guarantee is meaningful even when ๐ โซ ๐ Part II of this Talk 1. Recap: Differential privacy and convex risk minimization 2. Private gradient descent and risk minimization in low-dimensions 3. Private Frank-Wolfe and risk minimization in high-dimensions 4. Private feature selection in high-dimensions and the LASSO Feature selection in high-dimensional regression Sparse Linear Regression in High-dimensions (๐ โซ ๐) • Data set: ๐ท = { ๐ฅ1 , ๐ฆ1 , โฏ , ๐ฅ๐ , ๐ฆ๐ } where ๐ฅ๐ ∈ โ๐ and ๐ฆ๐ ∈ โ • Assumption: Data generated by noisy linear system = Feature vector ๐ฅ๐ Parameter vector ๐ฆ๐ ∗ ๐๐×1 + ๐ค๐ Field noise Data normalization: • ∀๐ ∈ ๐ , ||๐ฅ๐ ||∞ ≤ 1 • ∀๐, ๐ค๐ is sub-Gaussian Sparse Linear Regression in High-dimensions (๐ โซ ๐) • Data set: ๐ท = { ๐ฅ1 , ๐ฆ1 , โฏ , ๐ฅ๐ , ๐ฆ๐ } where ๐ฅ๐ ∈ โ๐ and ๐ฆ๐ ∈ โ ๐ฆ๐×1 ๐๐×๐ ∗ ๐๐×1 + Field noise = Design matrix Parameter vector Response vector • Assumption: Data generated by noisy linear system ๐ค๐×1 ๐ฆ๐×1 = Design matrix + ๐ค๐×1 ๐๐×๐ ∗ ๐๐×1 • Sparsity: ๐ ∗ has ๐ < ๐ non-zero entries • Bounded norm: ∀๐ Field noise Response vector Sparse Linear Regression in High-dimensions (๐ โซ ๐) ∗ |๐๐ | This talk: With differential privacy ∈ (1 − Φ, Φ) for arbitrary small const. Φ Model selection problem: Find the non-zero coordinates of ๐ ∗ ๐ฆ๐×1 = Design matrix + Field noise Response vector Sparse Linear Regression in High-dimensions (๐ โซ ๐) ๐ค๐×1 ๐๐×๐ ∗ ๐๐×1 Model selection: Non-zero coordinates (or the support) of ๐ ∗ Solution: LASSO estimator [Tibshirani94,EFJT03,Wainwright06,CT07,ZY07,…] 1 ๐ ∈ arg min๐ ||๐ฆ − ๐๐||22 + Λ||๐||1 ๐∈โ 2๐ Consistency of the LASSO Estimator = + Consistency conditions* [Wainwright06,ZY07]: • Γ: Support of the underlying parameter vector ๐ ∗ ๐Γ Incoherence ||| ๐Γ๐๐ ๐Γ ๐Γ๐ Restricted Strong Convexity 1 −1 ๐ ๐Γ ๐Γ |||∞ < 4 ๐๐๐๐ ๐Γ๐ ๐Γ = Ω(๐) Consistency of the LASSO Estimator = + Consistency conditions* [Wainwright06,ZY07]: • Γ: Support of the underlying parameter vector ๐ ∗ Theorem*: Under proper choice of Λ and ๐ = Ω(๐ log ๐), support of the LASSO estimator ๐ equals support of ๐ ∗ Incoherence ||| ๐Γ๐๐ ๐Γ Restricted Strong Convexity 1 −1 ๐ ๐Γ ๐Γ |||∞ < 4 ๐๐๐๐ ๐Γ๐ ๐Γ = Ω(๐) Stochastic Consistency of the LASSO = + Consistency conditions* [Wainwright06,ZY07]: • Γ: Support of the underlying parameter vector ๐ ∗ Incoherence ||| ๐Γ๐๐ ๐Γ ๐Γ๐ ๐Γ Restricted Strong Convexity −1 1 |||∞ < 4 ๐๐๐๐ ๐Γ๐ ๐Γ = Ω(๐) Theorem [Wainwright06,ZY07]: If each data entry in ๐ ∼ ๐ฉ(0,1 /4), then the assumptions above are satisfied w.h.p. Notion of Neighboring Data sets Design matrix Response vector ๐ฅ๐ ๐ฆ๐ Data set ๐ท = ๐ ๐ Notion of Neighboring Data sets Design matrix ๐ฅ๐ ′ Data set ๐ท′ = Response vector ๐ฆ๐ ′ ๐ ๐ ๐ท and ๐ท′ are neighboring data sets Perturbation stability (a.k.a. zero local sensitivity) Aside: Local Sensitivity [NRS07] Data set ๐ท = {๐1 , โฏ , ๐๐ } and ๐: ๐ ∗ → โ๐ be a function on ๐ท Local sensitivity: LS(๐, ๐ท, 1)= max ′ ๐ท′ ,๐๐ป ๐ท,๐ท =1 |๐ ๐ท − ๐ ๐ท ′ | 1 1. ๐ธ: Random variable sampled from Lap(LS(๐, ๐ท, 1)/๐)๐ 2. Output ๐ ๐ท + ๐ธ Not differentially private Part II : We show that local sensitivity is an useful tool Perturbation Stability Data set ๐ท ๐1 ๐2 Function ๐ โฎ ๐๐ Output Perturbation Stability Data set ๐ท′ ๐1 ๐2 ′ Function ๐ โฎ ๐๐ Output Stability of ๐ at ๐ท: The output does not change on changing any one entry Equivalently, local sensitivity of ๐ at ๐ท is zero Distance to Instability Property • Definition: A function ๐: ๐ ∗ → ℜ is ๐ −stable at a data set ๐ท if • For any data set ๐ท′ ∈ ๐ ∗ , with ๐ทΔ๐ท′ ≤ ๐, ๐ ๐ท = ๐(๐ท ′ ) • Distance to instability: max(๐ ๐ท ๐๐ ๐ − ๐ ๐ก๐๐๐๐) ๐ • Objective: Output ๐(๐ท) while preserving differential privacy All data sets Unstable data sets Distance > ๐ ๐ท Stable data sets Propose-Test-Release (PTR) framework [DL09, KRSY11, Smith T.’13] A Meta-algorithm: Propose-Test-Release (PTR) Basic tool: Laplace mechanism 1. ๐๐๐ ๐ก ← max(๐ ๐ท ๐๐ ๐ − ๐ ๐ก๐๐๐๐) ๐ 2. ๐๐๐ ๐ก ← ๐๐๐ ๐ก + ๐ฟ๐๐ 3. If ๐๐๐ ๐ก > ๐๐๐(1/๐ฟ) , ๐ 1 ๐ then return ๐(๐ท), else return ⊥ Theorem: The algorithm is ๐, ๐ฟ −differentially private Theorem: If ๐ is 2๐๐๐ 1/๐ฟ ๐ outputs ๐(๐ท) -stable at ๐ท, then w.p. ≥ 1 − ๐ฟ the algorithm Recap: Propose-Test-Release Framework (PTR) TBD: Some global sensitivity one query 1. ๐๐๐ ๐ก ← max(๐ ๐ท ๐๐ ๐ − ๐ ๐ก๐๐๐๐) ๐ 2. ๐๐๐ ๐ก ← ๐๐๐ ๐ก + ๐ฟ๐๐ 3. If ๐๐๐ ๐ก > ๐๐๐(1/๐ฟ) , ๐ 1 ๐ then return ๐(๐ท), else return ⊥ Theorem: The algorithm is ๐, ๐ฟ −differentially private Theorem: If ๐ is 2๐๐๐ 1/๐ฟ ๐ outputs ๐(๐ท) -stable at ๐ท, then w.p. ≥ 1 − ๐ฟ the algorithm Instantiation of PTR for the LASSO = LASSO: ๐ ∈ 1 arg min๐ ||๐ฆ ๐∈โ 2๐ − ๐๐||22 + Λ||๐||1 • Set function ๐ =support of ๐ • Issue: For ๐, distance to instability might not be efficiently computable + From [Smith,T.’13] Consistency conditions Perturbation stability Proxy conditions This talk Consistency conditions Perturbation stability Proxy conditions (Efficiently testable with privacy) Perturbation Stability of the LASSO = LASSO: ๐ ∈ 1 arg min๐ ||๐ฆ ๐∈โ 2๐ + − ๐๐||22 + Λ||๐||1 ๐ฝ(๐) Theorem: Consistency conditions on LASSO are sufficient for perturbation stability Proof Sketch: 1. Analyze Karush-Kuhn-Tucker (KKT) optimality conditions at ๐ 2. Show that support(๐) is stable via using ‘’dual certificate’’ on stable instances Perturbation Stability of the LASSO = Proof Sketch: 1 ๐ Gradient of LASSO ๐๐ฝ๐ท (๐)= − ๐ ๐ฆ − ๐๐ + Λ๐||๐||1 ๐ Lasso objective on ๐ท ๐ 0 ∈ ๐๐ฝ๐ท (๐) Lasso objective on ๐ท′ ๐′ 0 ∈ ๐๐ฝ๐ท′ (๐′) + Perturbation Stability of the LASSO = Proof Sketch: 1 ๐ Gradient of LASSO ๐๐ฝ๐ท (๐)= − ๐ ๐ฆ − ๐๐ + Λ๐||๐||1 ๐ Argue using the optimality conditions of ๐๐ฝ๐ท (๐) and ๐๐ฝ๐ท′ (๐′) 1. No zero coordinates of ๐ become non-zero in ๐′ (use mutual incoherence condition) 2. No non-zero coordinates of ๐ become zero in ๐′ (use restricted strong convexity condition) + Perturbation Stability Test for the LASSO = + 0 0 Γ: Support of ๐ Γ c : Complement of the support of ๐ Test for the following (real test is more complex): • Restricted Strong Convexity (RSC): Minimum eigenvalue of ๐ΓT XΓ is Ω(๐) • Strong stability: Negative of the (absolute) coordinates of the gradient of the least-squared loss in Γ ๐ are โช Λ Geometry of the Stability of LASSO = + Intuition: Strong convexity ensures supp(๐)⊆ supp(๐′) 1. Strong convexity ensures ||๐Γ − ๐′Γ ||∞ is small Lasso objective along Γ 2. If ∀๐, |๐Γ (๐)| is large, then ∀๐, ๐′Γ ๐ > 0 3. Consistency conditions imply ∀๐, |๐Γ (๐)| is large Dimension 2 in Γ ๐ Dimension 1 in Γ ๐ Geometry of the Stability of LASSO = + Intuition: Strong stability ensures no zero coordinate in ๐ becomes non-zero in ๐′ Lasso objective along Γ c Slope: Λ Slope: -Λ Dimension 1 in Γ Dimension 2 in Γ ๐ ๐ • For the minimizer ๐ to move along Γ ๐ , the perturbation to the gradient of least-squared loss has to be large Geometry of the Stability of LASSO = + Gradient of the least-squared loss: Γ −๐ ๐ ๐ฆ − ๐๐ = Lasso objective along Γ c ๐๐ Γc Slope: Λ Slope: -Λ ๐๐ Dimension 1 in Γ Dimension 2 in Γ ๐ ๐ • Strong stability: |๐๐ | โช Λ for all ๐ ∈ Γ ๐ ⇒ ๐ ∈ Γ ๐ has a sub-gradient of zero for LASSO(๐ท′) Making the Stability Test Private (Simplified) = + Test for Restricted Strong Convexity: ๐1 ๐ท ๐2 Test for strong stability: ๐2 ๐ท Issue: If ๐1 ๐ท > ๐ก1 and ๐2 ๐ท > ๐ก2 , then sensitivities are Δ1 and Δ2 ๐1 Our solution: Proxy distance ๐ = max ๐๐ ๐ท −๐ก๐ min Δ๐ ๐ • ๐ has global sensitivity of one ๐1 and ๐2 are both large and insensitive + 1,0 Private Model Selection with Optimal Sample Complexity = Nearly optimal sample complexity + 1. Compute ๐= function of ๐1 (๐ท) and ๐2 ๐ท 2. ๐๐๐ ๐ก ← ๐ + ๐ฟ๐๐ 3. If ๐๐๐ ๐ก > ๐๐๐(๐/๐น) , ๐ 1 ๐ then return ๐ ๐ข๐๐(๐), else return ⊥ Theorem: The algorithm is ๐, ๐ฟ −differentially private Theorem: Under consistency conditions , log ๐ > ๐ผ 2 ๐ 3 and ๐ = Ω(๐ log ๐), w.h.p. the support of ๐ ∗ is output. Here ๐ผ = log(1/๐ฟ)/๐. Thesis: Differential privacy ⇒ generalizability Stable learning ⇒ differential privacy Part I of this Talk 1. Towards a rigorous notion of statistical data privacy 2. Differential privacy: An overview 3. Generalization guarantee via differential privacy 4. Application: Follow-the-perturbed-leader Part II of this Talk 1. Recap: Differential privacy and convex risk minimization 2. Private gradient descent and risk minimization in low-dimensions 3. Private Frank-Wolfe and risk minimization in high-dimensions 4. Private feature selection in high-dimensions and the LASSO Concluding Remarks • Diff. privacy and robust (stable) machine learning are closely related • Not in this talk: • Private machine learning via bootstrapping [Smith T.’13] • Private non-convex learning via bootstrapping [BDMRTW] • False discovery rate control via differential privacy [DFHPR14] Open Questions • Develop the theory behind private non-convex learning • Analyze algorithms like expectation maximization and alternating minimization • Private learning with time-series data (e.g., auto-regressive models) • Private matrix completion (the Netflix problem) using Frank-Wolfe