Presentation

From Differential Privacy to Machine Learning, and Back Abhradeep Guha Thakurta Yahoo Labs, Sunnyvale Thesis: Differential privacy ⇒ generalizability Stable learning ⇒ differential privacy Part II of this Talk 1. Recap: Differential privacy and convex risk minimization 2. Private gradient descent and risk minimization in low-dimensions 3. Private Frank-Wolfe and risk minimization in high-dimensions 4. Private feature selection in high-dimensions and the LASSO Recap: Differential privacy Differential Privacy [DMNS06, DKMMN06] • Adversary learns essentially the same thing irrespective of your presence or absence in the data set 𝑑1 Data set: 𝐷 A Random coins A(𝐷) 𝑑1 Data set: 𝐷’ A A(𝐷’) Random coins • 𝐷 and 𝐷′ are called neighboring data sets • Require: Neighboring data sets induce close distribution on outputs Differential Privacy [DMNS06, DKMMN06] Definition: A randomized algorithm A is (𝜖, 𝛿)-differentially private if • for all data sets 𝐷 and 𝐷 ′ that differ in one element • for all sets of answers 𝑆 𝜖 Pr A 𝐷 ∈ 𝑆 ≤ 𝑒 Pr[A(𝐷′) ∈ 𝑆] + 𝛿 Semantics of Differential Privacy • Differential privacy is a condition on the algorithm • Guarantee is meaningful in the presence of any auxiliary information • Typically, think of privacy parameters: 𝜖 ≈ 0.1 and 𝛿 = 1/𝑛log 𝑛 , where 𝑛 = # of data samples • Composition: 𝜖 ’s and 𝛿 ‘s add up over multiple executions Few tools to design differentially private algorithms Laplace Mechanism [DMNS06] Data set 𝐷 = {𝑑1 , ⋯ , 𝑑𝑛 } and 𝑓: 𝑈 ∗ → ℝ𝑝 be a function on 𝐷 Global sensitivity: GS(𝑓, 1)= max 𝐷,𝐷′ ,𝑑𝐻 𝐷,𝐷′ =1 |𝑓 𝐷 − 𝑓 𝐷 ′ | 1 1. 𝐸: Random variable sampled from Lap(GS(𝑓, 1)/𝜖)𝑝 2. Output 𝑓 𝐷 + 𝐸 Theorem (Privacy): Algorithm is 𝜖-differentially private Gaussian Mechanism [DKMMN06] Data set 𝐷 = {𝑑1 , ⋯ , 𝑑𝑛 } and 𝑓: 𝑈 ∗ → ℝ𝑝 be a function on 𝐷 Global sensitivity: GS(𝑓, 2)= max 𝐷,𝐷′ ,𝑑𝐻 𝐷,𝐷′ =1 |𝑓 𝐷 − 𝑓 𝐷 ′ | 1. 𝐸: Random variable sampled from 𝒩 0, 𝕀𝑝 (𝐺𝑆 𝑓, 2 1/𝛿/𝜖 2. Output 𝑓 𝐷 + 𝐸 Theorem (Privacy): Algorithm is (𝜖, 𝛿)-differentially private 2 Report-Noisy-Max (a.k.a. Exponential Mechanism) [MT07,BLST10] Set of candidate outputs 𝑆 = Score function 𝑓: 𝑆 × 𝒟 𝑛 → ℝ Domain of data sets Objective: Output 𝑠 ∈ 𝑆 , maximizing 𝑓(𝑠, 𝐷) Global sensitivity GS(𝑓)= max |𝑓 𝑠, 𝐷 − 𝑓(𝑠, 𝐷′)| for all neighbors 𝑠∈𝑆,𝐷,𝐷′ 𝐷 and 𝐷′ Report-Noisy-Max (a.k.a. Exponential Mechanism) [MT07,BLST10] Objective: Output 𝑠 ∈ 𝑆 , maximizing 𝑓(𝑠, 𝐷) Global sensitivity GS(𝑓)= max |𝑓 𝑠, 𝐷 − 𝑓(𝑠, 𝐷′)| for all neighbors 𝑠∈𝑆,𝐷,𝐷′ 𝐷 and 𝐷′ 1. 𝑎𝑠 ← 𝑓 𝑠, 𝐷 + Lap(GS(𝑓)/𝜖) 2. Output 𝑠 with highest value of 𝑎𝑠 Theorem (Privacy): Algorithm is 2𝜖-differentially private Composition Theorems [DMNS06,DL09,DRV10] (𝜖, 𝛿)-diff. private algorithm 𝒜1 Data set (𝜖, 𝛿)-diff. private algorithm 𝒜𝑘 Weak composition: 𝒜1 ∘ ⋯ ∘ 𝒜𝑘 is (𝑘𝜖, 𝑘𝛿)-diff. private Strong composition ≈ 𝒜1 ∘ ⋯ ∘ 𝒜𝑘 is ( 𝑘𝜖, 𝑘𝛿)-diff. private Recap: (Private) convex risk minimization Convex Empirical Risk Minimization (ERM): An Example Linear classifiers in ℝ𝑝 𝑝 Domain: Feature vector 𝑥 ∈ ℝ Data set 𝐷 = 𝑥1 , 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 Label 𝑦 ∈{yellow, red} ∼i.i.d. 𝜏 +11 𝑛-1 ERM: Use ℒ 𝜃; 𝐷 = 𝑖=1 𝑦𝑖 ⟨𝑥𝑖 , 𝜃⟩ to approximate 𝜃𝑅 Distribution 𝜏 over (𝑥, 𝑦)𝑛 Find 𝜃 ∈ 𝒞 that classifies 𝑥, 𝑦 ∼ 𝜏 Minimize risk: 𝜃𝑅 = arg min 𝔼 𝜃∈𝒞 𝑥,𝑦 ∼𝜏 𝑦⟨𝑥, 𝜃⟩ Convex set 𝒞 of constant diameter Empirical Risk Minimization (ERM) Setup Convex loss function: ℓ: 𝒞 × 𝒟 → ℝ Data set 𝐷 = {𝑑1 , ⋯ , 𝑑𝑛 } Regularized ERM: 𝜃 = Loss function: ℒ(𝜃; 𝐷) 1 arg min 𝜃∈𝒞 𝑛 𝑛 𝑖=1 ℓ(𝜃; 𝑑𝑖 ) Objective: Minimize excess risk 𝔼𝑑∼𝜏 ℓ(𝜃; 𝑑) − min 𝔼𝑑∼𝜏 ℓ(𝜃; 𝑑) 𝜃∈𝒞 regularizer + 𝑟(𝜃) Used to stop overfitting Empirical Risk Minimization (ERM) Setup Convex loss function: ℓ: 𝒞 × 𝒟 → ℝ Loss function: ℒ(𝜃; 𝐷) Data set 𝐷 = {𝑑1 , ⋯ , 𝑑𝑛 } Recall: Differential privacy+ low excess empirical risk 1 𝑛 ⇒ Risk LowMinimizer: excess true𝜃risk Empirical = arg min 𝑖=1 ℓ(𝜃; 𝑑𝑖 ) 𝜃∈𝒞 𝑛 Today: Differentially private algorithm to output 𝜃priv that minimizes ℒ(𝜃priv ; 𝐷) − min ℒ(𝜃; 𝐷) 𝜃∈𝒞 Empirical Risk Minimization (ERM) Setup Today: Differentially private algorithm to output 𝜃priv that minimizes ℒ(𝜃priv ; 𝐷) − min ℒ(𝜃; 𝐷) 𝜃∈𝒞 • 𝐿2 /𝐿2 -setting: ||𝛻ℓ 𝜃; 𝑑 ||2 ≤ 1 and ||𝒞||2 ≤ 1 • Algorithm: Private gradient descent • 𝐿1 /𝐿∞ -setting: ||𝛻ℓ 𝜃; 𝑑 ||∞ ≤ 1 and ||𝒞||1 ≤ 1 • Algorithms: Frank-Wolfe and LASSO Part II of this Talk 1. Recap: Differential privacy and convex risk minimization 2. Private gradient descent and risk minimization in low-dimensions 3. Private Frank-Wolfe and risk minimization in high-dimensions 4. Private feature selection in high-dimensions and the LASSO Private ERM and noisy gradient descent (for 𝐿2/𝐿2-setting) [Bassily, Smith, T.’14] Convex empirical risk minimization: 𝐿2 /𝐿2 -setting Δ-strong convexity Lipschitz continuity ≤ ||𝜃1 − 𝜃2 ||2 ℓ(𝜃; 𝑑) Δ ≥ 𝑡(1 − 𝑡)||𝜃1 − 𝜃2 ||2 2 (for all 𝑡 ∈ [0,1]) 𝜃1 𝜃2 ∀𝜃1 , 𝜃2 ∈ 𝒞, ℓ 𝜃1 ; 𝑑 − ℓ 𝜃2 ; 𝑑 ≤ ||𝜃1 − 𝜃2 ||2 t 𝜃1 ℓ(𝜃; 𝑑) : (1-t) 𝜃2 Bounded set 𝒞: • Assume ||𝒞||2 is bounded by const. Why privacy is a concern in ERM? Computing the median Data set 𝐷 = 𝑑1, ⋯ , 𝑑𝑛 ,where each 𝑑𝑖 ∈ ℝ Median: 𝜃∗ = arg min 𝜃∈ℝ 𝑛 𝑖=1 |𝜃 − 𝑑𝑖 | Median is a data point in 𝐷 𝑑1 𝑑2 𝜃∗ 𝑑𝑛 Why privacy is a concern in ERM? Support vector machine: Dual-formulation Support vectors Separating hyperplane Support vectors are essentially data points in 𝐷 Δ − strongly convex Lipschitz Our results (data set size = 𝑛, dimension of set 𝒞 is 𝑝 < 𝑛) Privacy 𝜖-DP (𝜖, 𝛿)-DP 𝜖-DP (𝜖, 𝛿)-DP Excess empirical risk Technique 𝑂(𝑝/𝑛𝜖) Exponential sampling (based on [MT07]) 𝑂( 𝑝 log 2 (𝑛/𝛿) /𝑛𝜖) 𝑂(𝑝2 /(𝑛2 𝜖)) 𝑝 log 2.5 (𝑛/𝛿) 𝑂 𝑛2 Δ𝜖 Stochastic gradient descent (formal analysis and improvements over [WM10] and [CSS13]) “Localization” + exponential sampling Stochastic gradient descent Δ − strongly convex Lipschitz Our results (data set size = 𝑛, dimension of set 𝒞 is 𝑝 < 𝑛) Privacy 𝜖-DP (𝜖, 𝛿)-DP 𝜖-DP (𝜖, 𝛿)-DP Excess empirical risk Technique 𝑂(𝑝/𝑛𝜖) Exponential sampling (based on [MT07]) 𝑂( 𝑝 log 2 (𝑛/𝛿) /𝑛𝜖) 𝑂(𝑝2 /(𝑛2 𝜖)) 𝑝 log 2.5 (𝑛/𝛿) 𝑂 𝑛2 Δ𝜖 Stochastic gradient descent (formal analysis and improvements over [WM10] and [CSS13]) “Localization” + exponential sampling Stochastic gradient descent Δ − strongly convex Lipschitz Our results (data set size = 𝑛, dimension of set 𝒞 is 𝑝 < 𝑛) Privacy 𝜖-DP (𝜖, 𝛿)-DP 𝜖-DP (𝜖, 𝛿)-DP Excess empirical risk Technique 𝑂(𝑝/𝑛𝜖) Exponential sampling (based on [MT07]) 𝑂( 𝑝 log 2 (𝑛/𝛿) /𝑛𝜖) 𝑂(𝑝2 /(𝑛2 𝜖)) 𝑝 log 2.5 (𝑛/𝛿) 𝑂 𝑛2 Δ𝜖 Stochastic gradient descent (formal analysis and improvements over [WM10] and [CSS13]) “Localization” + exponential sampling Stochastic gradient descent Δ − strongly convex Lipschitz Our results (data set size = 𝑛, dimension of set 𝒞 is 𝑝 < 𝑛) Privacy 𝜖-DP (𝜖, 𝛿)-DP 𝜖-DP (𝜖, 𝛿)-DP Excess empirical risk Technique 𝑂(𝑝/𝑛𝜖) Exponential sampling (based on [MT07]) 𝑂( 𝑝 log 2 (𝑛/𝛿) /𝑛𝜖) 𝑂(𝑝2 /(𝑛2 𝜖)) 𝑝 log 2.5 (𝑛/𝛿) 𝑂 𝑛2 Δ𝜖 Stochastic gradient descent (formal analysis and improvements over [WM10] and [CSS13]) “Localization” + exponential sampling Stochastic gradient descent Our results (data set size = 𝑛, dimension of set 𝒞 is 𝑝 < 𝑛) Lipschitz Privacy (𝜖, 𝛿)-DP Excess empirical risk 𝑂( 𝑝 log 2 (𝑛/𝛿) /𝑛𝜖) Technique Stochastic gradient descent (formal analysis and improvements over [WM10] and [CSS13]) Private stochastic gradient descent Private stochastic gradient descent (Priv-SGD) Loss functions for 𝐷 ℓ(𝜃; 𝑑1 ) ℓ(𝜃; 𝑑2 ) ℓ(𝜃; 𝑑𝑛 ) 1. Choose arbitrary 𝜃1 ∈ 𝒞 𝜃1 Convex set: 𝒞 Private stochastic gradient descent (Priv-SGD) Loss functions for 𝐷 ℓ(𝜃; 𝑑1 ) Learning rate=𝑂(1/ ℓ(𝜃; 𝑑2 ) 𝑡𝑇𝑝) ℓ(𝜃; 𝑑𝑛 ) 2. For each time step 𝑡 ∈ [𝑇] a. Sample 𝑑 ∼ {𝑑1, ⋯ , 𝑑𝑛} b. 𝜃𝑡+1 ← 𝜃𝑡 − 𝜂 𝑛 ⋅ 𝜕ℓ 𝜃𝑡 ; 𝑑 + 𝑏𝑡 where 𝑏𝑡 ∼ 𝒩(0, 𝑂(𝑇Δ𝜖,𝛿 𝕀𝑝) 𝜃𝑡 𝜃 ℓ(𝜃; 𝑑) Private stochastic gradient descent (Priv-SGD) Loss functions for 𝐷 ℓ(𝜃; 𝑑1 ) ℓ(𝜃; 𝑑2 ) ℓ(𝜃; 𝑑𝑛 ) 3. Project 𝜃 onto 𝒞, i.e., 𝜃𝑡+1 ← Π𝒞 (𝜃) 𝜃 𝜃𝑡+1 Convex set: 𝒞 Private stochastic gradient descent (Priv-SGD) 1. Choose arbitrary 𝜃1 ∈ 𝒞 2. For each time step 𝑡 ∈ [𝑇] 𝜃1 a. Sample 𝑑 ∼ {𝑑1, ⋯ , 𝑑𝑛} 𝜃𝑡 Finally, output 𝜃 𝑇 b. 𝜃 ← 𝜃𝑡 − 𝜂 𝑛 ⋅ 𝜕ℓ 𝜃𝑡 ; 𝑑 + 𝑏𝑡 where 𝑏𝑡 ∼ 𝒩(0, 𝑂(𝑇Δ𝜖,𝛿 𝕀𝑝) 3. Project 𝜃 onto 𝒞, i.e., 𝜃𝑡+1 ← Π𝒞 (𝜃) 𝜃 Private stochastic gradient descent (Priv-SGD) Privacy guarantee: For 𝑇 = 𝑛2 , Priv-SGD is 𝜖, 𝛿 −differentially private Key insights [denote # of iterations by 𝑇 = 𝑛2 , and ignore 𝛿]: 1. Providing 𝜖 ≈ 1 𝑇 diff. privacy per iteration is sufficient 2. [KRSU10] Sampling ensures 𝜖 ≈ 1 𝑇 diff. privacy to 𝜃 ← 𝜃𝑡 − 𝜂 𝑛 ⋅ 𝜕ℓ 𝜃𝑡 ; 𝑑 + 𝑏𝑡 Private stochastic gradient descent (Priv-SGD) Utility guarantee (for 𝑇 = 𝑛2 ): ℒ 𝜃𝑝𝑟𝑖𝑣 ; 𝐷 − min ℒ 𝜃; 𝐷 = 𝑂 𝜃∈𝒞 𝑝 log 2 (𝑛/𝛿) 𝑛𝜖 Key insight: ∼𝑢 ℓ(𝜃; 𝑑1 ) ℓ(𝜃; 𝑑𝑛 ) ℓ(𝜃; 𝑑) Private stochastic gradient descent (Priv-SGD) Utility guarantee (for 𝑇 = 𝑛2 ): ℒ 𝜃𝑝𝑟𝑖𝑣 ; 𝐷 − min ℒ 𝜃; 𝐷 = 𝑂 𝜃∈𝒞 𝑝 log 2 (𝑛/𝛿) 𝑛𝜖 Key insight: Unbiased estimator of the gradient: 𝜃𝑡 ℓ(𝜃; 𝑑) 𝔼𝑏𝑡 ,𝑑 𝑛 ⋅ 𝜕ℓ 𝜃𝑡 ; 𝑑 + 𝑏𝑡 = 𝑛 ⋅ 𝜕ℒ(𝜃𝑡 ; 𝐷) Private stochastic gradient descent (Priv-SGD) 1. Unbiased estimator of the gradient: 𝔼𝑏𝑡 ,𝑑 𝑛 ⋅ 𝜕ℓ 𝜃𝑡 ; 𝑑 + 𝑏𝑡 = 𝑛 ⋅ 𝜕ℒ(𝜃𝑡 ; 𝐷) 2. Bounded variance: 𝔼𝑏𝑡 ,𝑑 ||𝑛 ⋅ 𝜕ℓ 𝜃𝑡 ; 𝑑 + 𝑏𝑡 ||22 = 𝑂(𝑝𝑛2 log(𝑛/𝛿) /𝜖 2 ) Utility guarantee now follows from [SZ’13] Private stochastic gradient descent (Priv-SGD) Running time: Assuming computing gradient takes time 𝑂(𝑝) Running time of Priv-SGD is 𝑂(𝑛2 𝑝) Note: If we computed the true gradient of ℒ, instead of sampling, running time would have been 𝑂(𝑛3 𝑝) Utility guarantee of Priv-SGD is optimal. • Proof via Fingerprinting codes [BS96,Tardos03] Part II of this Talk 1. Recap: Differential privacy and convex risk minimization 2. Private gradient descent and risk minimization in low-dimensions 3. Private Frank-Wolfe and risk minimization in high-dimensions 4. Private feature selection in high-dimensions and the LASSO Private ERM and the Frank-Wolfe (for 𝐿1/𝐿∞-setting) [Talwar, T., Zhang] Empirical Risk Minimization (ERM) Setup Today: Differentially private algorithm to output 𝜃priv that minimizes ℒ(𝜃priv ; 𝐷) − min ℒ(𝜃; 𝐷) 𝜃∈𝒞 Commonly referred to as high-dimensional setting • 𝐿2 /𝐿2 -setting: ||𝛻ℓ 𝜃; 𝑑 ||2 ≤ 1 and ||𝒞||2 ≤ 1 • Algorithm: Private gradient descent • 𝐿1 /𝐿∞ -setting: ||𝛻ℓ 𝜃; 𝑑 ||∞ ≤ 1 and ||𝒞||1 ≤ 1 • Algorithms: Frank-Wolfe and LASSO Frank-Wolfe on the 𝐿1 -ball (a stylized exposition) 1. Pick a corner of the 𝐿1 −ball (𝜃1 ∈ ℝ𝑝 ) Linearized loss 2. 𝑠 ← arg min⟨𝛻ℒ(𝜃𝑡 ; 𝐷), 𝜃⟩ 𝜃∈𝐿1 3. 𝜃𝑡+1 ← 𝛼𝜃𝑡 + 1 − 𝛼 𝑠 Typically, 𝛼 ≈ 1/𝑇 𝑠 𝜃1 Frank-Wolfe on the 𝐿1 -ball (a stylized exposition) • 𝑠 is always a corner Linearized loss • Final output 𝜃𝑇 is always a convex combination of the corners • For smooth losses, the convergence≈ 𝑂(1/𝑇) [FW56, Jaggi’13] 𝑠 𝜃1 Differentially private Frank-Wolfe Recap: Report-Noisy-Max (a.k.a. Exponential Mechanism) Objective: Output 𝑠 ∈ 𝑆 , maximizing 𝑓(𝑠, 𝐷) Global sensitivity GS(𝑓)= max |𝑓 𝑠, 𝐷 − 𝑓(𝑠, 𝐷′)| for all neighbors 𝑠∈𝑆,𝐷,𝐷′ 𝐷 and 𝐷′ 1. 𝑎𝑠 ← 𝑓 𝑠, 𝐷 + Lap(GS(𝑓)/𝜖) 2. Output 𝑠 with highest value of 𝑎𝑠 Theorem (Privacy): Algorithm is 2𝜖-differentially private Private Frank-Wolfe (hiding terms in 𝛿) 1. Pick a corner of the 𝐿1 −ball (𝜃1 ∈ ℝ𝑝 ) 2. 𝑠 ← arg min⟨𝛻ℒ 𝜃𝑡 ; 𝐷 + 𝑏𝑡 , 𝜃⟩, 𝑏𝑡 ∼ Lap( 𝑇/𝑛𝜖) 𝜃∈𝐿1 3. 𝜃𝑡+1 ← 𝜃𝑡 /𝑇 + 1 − 1/𝑇 𝑠 Report-noisy-max + strong composition Theorem (privacy): Algorithm is (𝜖, 𝛿) −differentially private Private Frank-Wolfe (hiding terms in 𝛿) Theorem (utility): For 𝑇 ≈ 𝑛2/3 , 𝔼 ℒ 𝜃T ; 𝐷 − min ℒ 𝜃; 𝐷 = 𝑂 𝜃∈𝐿1 log 𝑛𝑝 𝑛𝜖 2/3 The guarantee is tight. Optimality via fingerprinting codes [BS96,Tardos03] Guarantee is meaningful even when 𝑝 ≫ 𝑛 Part II of this Talk 1. Recap: Differential privacy and convex risk minimization 2. Private gradient descent and risk minimization in low-dimensions 3. Private Frank-Wolfe and risk minimization in high-dimensions 4. Private feature selection in high-dimensions and the LASSO Feature selection in high-dimensional regression Sparse Linear Regression in High-dimensions (𝑝 ≫ 𝑛) • Data set: 𝐷 = { 𝑥1 , 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 } where 𝑥𝑖 ∈ ℝ𝑝 and 𝑦𝑖 ∈ ℝ • Assumption: Data generated by noisy linear system = Feature vector 𝑥𝑖 Parameter vector 𝑦𝑖 ∗ 𝜃𝑝×1 + 𝑤𝑖 Field noise Data normalization: • ∀𝑖 ∈ 𝑛 , ||𝑥𝑖 ||∞ ≤ 1 • ∀𝑖, 𝑤𝑖 is sub-Gaussian Sparse Linear Regression in High-dimensions (𝑝 ≫ 𝑛) • Data set: 𝐷 = { 𝑥1 , 𝑦1 , ⋯ , 𝑥𝑛 , 𝑦𝑛 } where 𝑥𝑖 ∈ ℝ𝑝 and 𝑦𝑖 ∈ ℝ 𝑦𝑛×1 𝑋𝑛×𝑝 ∗ 𝜃𝑝×1 + Field noise = Design matrix Parameter vector Response vector • Assumption: Data generated by noisy linear system 𝑤𝑛×1 𝑦𝑛×1 = Design matrix + 𝑤𝑛×1 𝑋𝑛×𝑝 ∗ 𝜃𝑝×1 • Sparsity: 𝜃 ∗ has 𝑠 < 𝑛 non-zero entries • Bounded norm: ∀𝑖 Field noise Response vector Sparse Linear Regression in High-dimensions (𝑝 ≫ 𝑛) ∗ |𝜃𝑖 | This talk: With differential privacy ∈ (1 − Φ, Φ) for arbitrary small const. Φ Model selection problem: Find the non-zero coordinates of 𝜃 ∗ 𝑦𝑛×1 = Design matrix + Field noise Response vector Sparse Linear Regression in High-dimensions (𝑝 ≫ 𝑛) 𝑤𝑛×1 𝑋𝑛×𝑝 ∗ 𝜃𝑝×1 Model selection: Non-zero coordinates (or the support) of 𝜃 ∗ Solution: LASSO estimator [Tibshirani94,EFJT03,Wainwright06,CT07,ZY07,…] 1 𝜃 ∈ arg min𝑝 ||𝑦 − 𝑋𝜃||22 + Λ||𝜃||1 𝜃∈ℝ 2𝑛 Consistency of the LASSO Estimator = + Consistency conditions* [Wainwright06,ZY07]: • Γ: Support of the underlying parameter vector 𝜃 ∗ 𝑋Γ Incoherence ||| 𝑋Γ𝑇𝑐 𝑋Γ 𝑋Γ𝑐 Restricted Strong Convexity 1 −1 𝑇 𝑋Γ 𝑋Γ |||∞ < 4 𝜆𝑚𝑖𝑛 𝑋Γ𝑇 𝑋Γ = Ω(𝑛) Consistency of the LASSO Estimator = + Consistency conditions* [Wainwright06,ZY07]: • Γ: Support of the underlying parameter vector 𝜃 ∗ Theorem*: Under proper choice of Λ and 𝑛 = Ω(𝑠log 𝑝), support of the LASSO estimator 𝜃 equals support of 𝜃 ∗ Incoherence ||| 𝑋Γ𝑇𝑐 𝑋Γ Restricted Strong Convexity 1 −1 𝑇 𝑋Γ 𝑋Γ |||∞ < 4 𝜆𝑚𝑖𝑛 𝑋Γ𝑇 𝑋Γ = Ω(𝑛) Stochastic Consistency of the LASSO = + Consistency conditions* [Wainwright06,ZY07]: • Γ: Support of the underlying parameter vector 𝜃 ∗ Incoherence ||| 𝑋Γ𝑇𝑐 𝑋Γ 𝑋Γ𝑇 𝑋Γ Restricted Strong Convexity −1 1 |||∞ < 4 𝜆𝑚𝑖𝑛 𝑋Γ𝑇 𝑋Γ = Ω(𝑛) Theorem [Wainwright06,ZY07]: If each data entry in 𝑋 ∼ 𝒩(0,1 /4), then the assumptions above are satisfied w.h.p. Notion of Neighboring Data sets Design matrix Response vector 𝑥𝑖 𝑦𝑖 Data set 𝐷 = 𝑛 𝑝 Notion of Neighboring Data sets Design matrix 𝑥𝑖 ′ Data set 𝐷′ = Response vector 𝑦𝑖 ′ 𝑛 𝑝 𝐷 and 𝐷′ are neighboring data sets Perturbation stability (a.k.a. zero local sensitivity) Aside: Local Sensitivity [NRS07] Data set 𝐷 = {𝑑1 , ⋯ , 𝑑𝑛 } and 𝑓: 𝑈 ∗ → ℝ𝑝 be a function on 𝐷 Local sensitivity: LS(𝑓, 𝐷, 1)= max ′ 𝐷′ ,𝑑𝐻 𝐷,𝐷 =1 |𝑓 𝐷 − 𝑓 𝐷 ′ | 1 1. 𝐸: Random variable sampled from Lap(LS(𝑓, 𝐷, 1)/𝜖)𝑝 2. Output 𝑓 𝐷 + 𝐸 Not differentially private Part II : We show that local sensitivity is an useful tool Perturbation Stability Data set 𝐷 𝑑1 𝑑2 Function 𝑓 ⋮ 𝑑𝑛 Output Perturbation Stability Data set 𝐷′ 𝑑1 𝑑2 ′ Function 𝑓 ⋮ 𝑑𝑛 Output Stability of 𝑓 at 𝐷: The output does not change on changing any one entry Equivalently, local sensitivity of 𝑓 at 𝐷 is zero Distance to Instability Property • Definition: A function 𝑓: 𝑈 ∗ → ℜ is 𝑘 −stable at a data set 𝐷 if • For any data set 𝐷′ ∈ 𝑈 ∗ , with 𝐷Δ𝐷′ ≤ 𝑘, 𝑓 𝐷 = 𝑓(𝐷 ′ ) • Distance to instability: max(𝑓 𝐷 𝑖𝑠 𝑘 − 𝑠𝑡𝑎𝑏𝑙𝑒) 𝑘 • Objective: Output 𝑓(𝐷) while preserving differential privacy All data sets Unstable data sets Distance > 𝑘 𝐷 Stable data sets Propose-Test-Release (PTR) framework [DL09, KRSY11, Smith T.’13] A Meta-algorithm: Propose-Test-Release (PTR) Basic tool: Laplace mechanism 1. 𝑑𝑖𝑠𝑡 ← max(𝑓 𝐷 𝑖𝑠 𝑘 − 𝑠𝑡𝑎𝑏𝑙𝑒) 𝑘 2. 𝑑𝑖𝑠𝑡 ← 𝑑𝑖𝑠𝑡 + 𝐿𝑎𝑝 3. If 𝑑𝑖𝑠𝑡 > 𝑙𝑜𝑔(1/𝛿) , 𝜖 1 𝜖 then return 𝑓(𝐷), else return ⊥ Theorem: The algorithm is 𝜖, 𝛿 −differentially private Theorem: If 𝑓 is 2𝑙𝑜𝑔 1/𝛿 𝜖 outputs 𝑓(𝐷) -stable at 𝐷, then w.p. ≥ 1 − 𝛿 the algorithm Recap: Propose-Test-Release Framework (PTR) TBD: Some global sensitivity one query 1. 𝑑𝑖𝑠𝑡 ← max(𝑓 𝐷 𝑖𝑠 𝑘 − 𝑠𝑡𝑎𝑏𝑙𝑒) 𝑘 2. 𝑑𝑖𝑠𝑡 ← 𝑑𝑖𝑠𝑡 + 𝐿𝑎𝑝 3. If 𝑑𝑖𝑠𝑡 > 𝑙𝑜𝑔(1/𝛿) , 𝜖 1 𝜖 then return 𝑓(𝐷), else return ⊥ Theorem: The algorithm is 𝜖, 𝛿 −differentially private Theorem: If 𝑓 is 2𝑙𝑜𝑔 1/𝛿 𝜖 outputs 𝑓(𝐷) -stable at 𝐷, then w.p. ≥ 1 − 𝛿 the algorithm Instantiation of PTR for the LASSO = LASSO: 𝜃 ∈ 1 arg min𝑝 ||𝑦 𝜃∈ℝ 2𝑛 − 𝑋𝜃||22 + Λ||𝜃||1 • Set function 𝑓 =support of 𝜃 • Issue: For 𝑓, distance to instability might not be efficiently computable + From [Smith,T.’13] Consistency conditions Perturbation stability Proxy conditions This talk Consistency conditions Perturbation stability Proxy conditions (Efficiently testable with privacy) Perturbation Stability of the LASSO = LASSO: 𝜃 ∈ 1 arg min𝑝 ||𝑦 𝜃∈ℝ 2𝑛 + − 𝑋𝜃||22 + Λ||𝜃||1 𝐽(𝜃) Theorem: Consistency conditions on LASSO are sufficient for perturbation stability Proof Sketch: 1. Analyze Karush-Kuhn-Tucker (KKT) optimality conditions at 𝜃 2. Show that support(𝜃) is stable via using ‘’dual certificate’’ on stable instances Perturbation Stability of the LASSO = Proof Sketch: 1 𝑇 Gradient of LASSO 𝜕𝐽𝐷 (𝜃)= − 𝑋 𝑦 − 𝑋𝜃 + Λ𝜕||𝜃||1 𝑛 Lasso objective on 𝐷 𝜃 0 ∈ 𝜕𝐽𝐷 (𝜃) Lasso objective on 𝐷′ 𝜃′ 0 ∈ 𝜕𝐽𝐷′ (𝜃′) + Perturbation Stability of the LASSO = Proof Sketch: 1 𝑇 Gradient of LASSO 𝜕𝐽𝐷 (𝜃)= − 𝑋 𝑦 − 𝑋𝜃 + Λ𝜕||𝜃||1 𝑛 Argue using the optimality conditions of 𝜕𝐽𝐷 (𝜃) and 𝜕𝐽𝐷′ (𝜃′) 1. No zero coordinates of 𝜃 become non-zero in 𝜃′ (use mutual incoherence condition) 2. No non-zero coordinates of 𝜃 become zero in 𝜃′ (use restricted strong convexity condition) + Perturbation Stability Test for the LASSO = + 0 0 Γ: Support of 𝜃 Γ c : Complement of the support of 𝜃 Test for the following (real test is more complex): • Restricted Strong Convexity (RSC): Minimum eigenvalue of 𝑋ΓT XΓ is Ω(𝑛) • Strong stability: Negative of the (absolute) coordinates of the gradient of the least-squared loss in Γ 𝑐 are ≪ Λ Geometry of the Stability of LASSO = + Intuition: Strong convexity ensures supp(𝜃)⊆ supp(𝜃′) 1. Strong convexity ensures ||𝜃Γ − 𝜃′Γ ||∞ is small Lasso objective along Γ 2. If ∀𝑖, |𝜃Γ (𝑖)| is large, then ∀𝑖, 𝜃′Γ 𝑖 > 0 3. Consistency conditions imply ∀𝑖, |𝜃Γ (𝑖)| is large Dimension 2 in Γ 𝑐 Dimension 1 in Γ 𝜃 Geometry of the Stability of LASSO = + Intuition: Strong stability ensures no zero coordinate in 𝜃 becomes non-zero in 𝜃′ Lasso objective along Γ c Slope: Λ Slope: -Λ Dimension 1 in Γ Dimension 2 in Γ 𝑐 𝜃 • For the minimizer 𝜃 to move along Γ 𝑐 , the perturbation to the gradient of least-squared loss has to be large Geometry of the Stability of LASSO = + Gradient of the least-squared loss: Γ −𝑋 𝑇 𝑦 − 𝑋𝜃 = Lasso objective along Γ c 𝑎𝑖 Γc Slope: Λ Slope: -Λ 𝑎𝑝 Dimension 1 in Γ Dimension 2 in Γ 𝑐 𝜃 • Strong stability: |𝑎𝑖 | ≪ Λ for all 𝑖 ∈ Γ 𝑐 ⇒ 𝑖 ∈ Γ 𝑐 has a sub-gradient of zero for LASSO(𝐷′) Making the Stability Test Private (Simplified) = + Test for Restricted Strong Convexity: 𝑔1 𝐷 𝑔2 Test for strong stability: 𝑔2 𝐷 Issue: If 𝑔1 𝐷 > 𝑡1 and 𝑔2 𝐷 > 𝑡2 , then sensitivities are Δ1 and Δ2 𝑔1 Our solution: Proxy distance 𝑑 = max 𝑔𝑖 𝐷 −𝑡𝑖 min Δ𝑖 𝑖 • 𝑑 has global sensitivity of one 𝑔1 and 𝑔2 are both large and insensitive + 1,0 Private Model Selection with Optimal Sample Complexity = Nearly optimal sample complexity + 1. Compute 𝑑= function of 𝑔1 (𝐷) and 𝑔2 𝐷 2. 𝑑𝑖𝑠𝑡 ← 𝑑 + 𝐿𝑎𝑝 3. If 𝑑𝑖𝑠𝑡 > 𝒍𝒐𝒈(𝟏/𝜹) , 𝝐 1 𝜖 then return 𝑠𝑢𝑝𝑝(𝜃), else return ⊥ Theorem: The algorithm is 𝜖, 𝛿 −differentially private Theorem: Under consistency conditions , log 𝑝 > 𝛼 2 𝑠 3 and 𝑛 = Ω(𝑠 log 𝑝), w.h.p. the support of 𝜃 ∗ is output. Here 𝛼 = log(1/𝛿)/𝜖. Thesis: Differential privacy ⇒ generalizability Stable learning ⇒ differential privacy Part I of this Talk 1. Towards a rigorous notion of statistical data privacy 2. Differential privacy: An overview 3. Generalization guarantee via differential privacy 4. Application: Follow-the-perturbed-leader Part II of this Talk 1. Recap: Differential privacy and convex risk minimization 2. Private gradient descent and risk minimization in low-dimensions 3. Private Frank-Wolfe and risk minimization in high-dimensions 4. Private feature selection in high-dimensions and the LASSO Concluding Remarks • Diff. privacy and robust (stable) machine learning are closely related • Not in this talk: • Private machine learning via bootstrapping [Smith T.’13] • Private non-convex learning via bootstrapping [BDMRTW] • False discovery rate control via differential privacy [DFHPR14] Open Questions • Develop the theory behind private non-convex learning • Analyze algorithms like expectation maximization and alternating minimization • Private learning with time-series data (e.g., auto-regressive models) • Private matrix completion (the Netflix problem) using Frank-Wolfe

Presentation

Related documents

Products

Support

Presentation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib