Presentation

advertisement
From Differential Privacy to
Machine Learning, and Back
Abhradeep Guha Thakurta
Yahoo Labs, Sunnyvale
Thesis: Differential privacy ⇒ generalizability
Stable learning ⇒ differential privacy
Part II of this Talk
1. Recap: Differential privacy and convex risk minimization
2. Private gradient descent and risk minimization in low-dimensions
3. Private Frank-Wolfe and risk minimization in high-dimensions
4. Private feature selection in high-dimensions and the LASSO
Recap: Differential privacy
Differential Privacy [DMNS06, DKMMN06]
• Adversary learns essentially the same thing irrespective of your
presence or absence in the data set
๐‘‘1
Data set: ๐ท
A
Random coins
A(๐ท)
๐‘‘1
Data set: ๐ท’
A
A(๐ท’)
Random coins
• ๐ท and ๐ท′ are called neighboring data sets
• Require: Neighboring data sets induce close distribution on outputs
Differential Privacy [DMNS06, DKMMN06]
Definition:
A randomized algorithm A is (๐œ–, ๐›ฟ)-differentially private if
• for all data sets ๐ท and ๐ท ′ that differ in one element
• for all sets of answers ๐‘†
๐œ–
Pr A ๐ท ∈ ๐‘† ≤ ๐‘’ Pr[A(๐ท′) ∈ ๐‘†] + ๐›ฟ
Semantics of Differential Privacy
• Differential privacy is a condition on the algorithm
• Guarantee is meaningful in the presence of any auxiliary information
• Typically, think of privacy parameters: ๐œ– ≈ 0.1 and ๐›ฟ = 1/๐‘›log ๐‘› ,
where ๐‘› = # of data samples
• Composition: ๐œ– ’s and ๐›ฟ ‘s add up over multiple executions
Few tools to design differentially private
algorithms
Laplace Mechanism [DMNS06]
Data set ๐ท = {๐‘‘1 , โ‹ฏ , ๐‘‘๐‘› } and ๐‘“: ๐‘ˆ ∗ → โ„๐‘ be a function on ๐ท
Global sensitivity: GS(๐‘“, 1)=
max
๐ท,๐ท′ ,๐‘‘๐ป ๐ท,๐ท′ =1
|๐‘“ ๐ท − ๐‘“ ๐ท ′ | 1
1. ๐ธ: Random variable sampled from Lap(GS(๐‘“, 1)/๐œ–)๐‘
2. Output ๐‘“ ๐ท + ๐ธ
Theorem (Privacy): Algorithm is ๐œ–-differentially private
Gaussian Mechanism [DKMMN06]
Data set ๐ท = {๐‘‘1 , โ‹ฏ , ๐‘‘๐‘› } and ๐‘“: ๐‘ˆ ∗ → โ„๐‘ be a function on ๐ท
Global sensitivity: GS(๐‘“, 2)=
max
๐ท,๐ท′ ,๐‘‘๐ป ๐ท,๐ท′ =1
|๐‘“ ๐ท − ๐‘“ ๐ท ′ |
1. ๐ธ: Random variable sampled from ๐’ฉ 0, ๐•€๐‘ (๐บ๐‘† ๐‘“, 2
1/๐›ฟ/๐œ–
2. Output ๐‘“ ๐ท + ๐ธ
Theorem (Privacy): Algorithm is (๐œ–, ๐›ฟ)-differentially private
2
Report-Noisy-Max (a.k.a. Exponential Mechanism)
[MT07,BLST10]
Set of candidate outputs ๐‘† =
Score function ๐‘“: ๐‘† × ๐’Ÿ ๐‘› → โ„
Domain of data sets
Objective: Output ๐‘  ∈ ๐‘† , maximizing ๐‘“(๐‘ , ๐ท)
Global sensitivity GS(๐‘“)= max |๐‘“ ๐‘ , ๐ท − ๐‘“(๐‘ , ๐ท′)| for all neighbors
๐‘ ∈๐‘†,๐ท,๐ท′
๐ท and ๐ท′
Report-Noisy-Max (a.k.a. Exponential Mechanism)
[MT07,BLST10]
Objective: Output ๐‘  ∈ ๐‘† , maximizing ๐‘“(๐‘ , ๐ท)
Global sensitivity GS(๐‘“)= max |๐‘“ ๐‘ , ๐ท − ๐‘“(๐‘ , ๐ท′)| for all neighbors
๐‘ ∈๐‘†,๐ท,๐ท′
๐ท and ๐ท′
1. ๐‘Ž๐‘  ← ๐‘“ ๐‘ , ๐ท + Lap(GS(๐‘“)/๐œ–)
2. Output ๐‘  with highest value of ๐‘Ž๐‘ 
Theorem (Privacy): Algorithm is 2๐œ–-differentially private
Composition Theorems [DMNS06,DL09,DRV10]
(๐œ–, ๐›ฟ)-diff. private algorithm ๐’œ1
Data set
(๐œ–, ๐›ฟ)-diff. private algorithm ๐’œ๐‘˜
Weak composition:
๐’œ1 โˆ˜ โ‹ฏ โˆ˜ ๐’œ๐‘˜ is (๐‘˜๐œ–, ๐‘˜๐›ฟ)-diff. private
Strong composition ≈
๐’œ1 โˆ˜ โ‹ฏ โˆ˜ ๐’œ๐‘˜ is ( ๐‘˜๐œ–, ๐‘˜๐›ฟ)-diff. private
Recap: (Private) convex risk minimization
Convex Empirical Risk Minimization (ERM): An Example
Linear classifiers in โ„๐‘
๐‘
Domain:
Feature
vector
๐‘ฅ
∈
โ„
Data set ๐ท = ๐‘ฅ1 , ๐‘ฆ1 , โ‹ฏ , ๐‘ฅ๐‘› , ๐‘ฆ๐‘›
Label ๐‘ฆ ∈{yellow, red}
∼i.i.d. ๐œ
+11 ๐‘›-1
ERM: Use โ„’ ๐œƒ; ๐ท =
๐‘–=1 ๐‘ฆ๐‘– ⟨๐‘ฅ๐‘– , ๐œƒ⟩ to approximate ๐œƒ๐‘…
Distribution ๐œ over (๐‘ฅ, ๐‘ฆ)๐‘›
Find ๐œƒ ∈ ๐’ž that classifies ๐‘ฅ, ๐‘ฆ ∼ ๐œ
Minimize risk: ๐œƒ๐‘… = arg min ๐”ผ
๐œƒ∈๐’ž
๐‘ฅ,๐‘ฆ ∼๐œ
๐‘ฆ⟨๐‘ฅ, ๐œƒ⟩
Convex set ๐’ž of
constant diameter
Empirical Risk Minimization (ERM) Setup
Convex loss function: โ„“: ๐’ž × ๐’Ÿ → โ„
Data set ๐ท = {๐‘‘1 , โ‹ฏ , ๐‘‘๐‘› }
Regularized ERM: ๐œƒ =
Loss function: โ„’(๐œƒ; ๐ท)
1
arg min
๐œƒ∈๐’ž ๐‘›
๐‘›
๐‘–=1 โ„“(๐œƒ; ๐‘‘๐‘– )
Objective: Minimize excess risk
๐”ผ๐‘‘∼๐œ โ„“(๐œƒ; ๐‘‘) − min ๐”ผ๐‘‘∼๐œ โ„“(๐œƒ; ๐‘‘)
๐œƒ∈๐’ž
regularizer
+ ๐‘Ÿ(๐œƒ)
Used to stop overfitting
Empirical Risk Minimization (ERM) Setup
Convex loss function: โ„“: ๐’ž × ๐’Ÿ → โ„
Loss function: โ„’(๐œƒ; ๐ท)
Data set ๐ท = {๐‘‘1 , โ‹ฏ , ๐‘‘๐‘› }
Recall: Differential privacy+ low excess empirical risk
1 ๐‘›
⇒ Risk
LowMinimizer:
excess true๐œƒrisk
Empirical
= arg min
๐‘–=1 โ„“(๐œƒ; ๐‘‘๐‘– )
๐œƒ∈๐’ž ๐‘›
Today: Differentially private algorithm to output ๐œƒpriv that minimizes
โ„’(๐œƒpriv ; ๐ท) − min โ„’(๐œƒ; ๐ท)
๐œƒ∈๐’ž
Empirical Risk Minimization (ERM) Setup
Today: Differentially private algorithm to output ๐œƒpriv that minimizes
โ„’(๐œƒpriv ; ๐ท) − min โ„’(๐œƒ; ๐ท)
๐œƒ∈๐’ž
• ๐ฟ2 /๐ฟ2 -setting: ||๐›ปโ„“ ๐œƒ; ๐‘‘ ||2 ≤ 1 and ||๐’ž||2 ≤ 1
• Algorithm: Private gradient descent
• ๐ฟ1 /๐ฟ∞ -setting: ||๐›ปโ„“ ๐œƒ; ๐‘‘ ||∞ ≤ 1 and ||๐’ž||1 ≤ 1
• Algorithms: Frank-Wolfe and LASSO
Part II of this Talk
1. Recap: Differential privacy and convex risk minimization
2. Private gradient descent and risk minimization in low-dimensions
3. Private Frank-Wolfe and risk minimization in high-dimensions
4. Private feature selection in high-dimensions and the LASSO
Private ERM and noisy gradient descent
(for ๐ฟ2/๐ฟ2-setting) [Bassily, Smith, T.’14]
Convex empirical risk minimization: ๐ฟ2 /๐ฟ2 -setting
Δ-strong convexity
Lipschitz continuity
≤ ||๐œƒ1 − ๐œƒ2 ||2
โ„“(๐œƒ; ๐‘‘)
Δ
≥ ๐‘ก(1 − ๐‘ก)||๐œƒ1 − ๐œƒ2 ||2
2
(for all ๐‘ก ∈ [0,1])
๐œƒ1
๐œƒ2
∀๐œƒ1 , ๐œƒ2 ∈ ๐’ž, โ„“ ๐œƒ1 ; ๐‘‘ − โ„“ ๐œƒ2 ; ๐‘‘
≤ ||๐œƒ1 − ๐œƒ2 ||2
t
๐œƒ1
โ„“(๐œƒ; ๐‘‘)
: (1-t)
๐œƒ2
Bounded set ๐’ž:
• Assume ||๐’ž||2 is bounded by const.
Why privacy is a concern in ERM?
Computing the median
Data set ๐ท = ๐‘‘1, โ‹ฏ , ๐‘‘๐‘› ,where each ๐‘‘๐‘– ∈ โ„
Median: ๐œƒ∗ = arg min
๐œƒ∈โ„
๐‘›
๐‘–=1 |๐œƒ
− ๐‘‘๐‘– |
Median is a data point in ๐ท
๐‘‘1
๐‘‘2
๐œƒ∗
๐‘‘๐‘›
Why privacy is a concern in ERM?
Support vector machine: Dual-formulation
Support vectors
Separating hyperplane
Support vectors are essentially data
points in ๐ท
Δ − strongly convex
Lipschitz
Our results (data set size = ๐‘›, dimension of set ๐’ž is ๐‘ < ๐‘›)
Privacy
๐œ–-DP
(๐œ–, ๐›ฟ)-DP
๐œ–-DP
(๐œ–, ๐›ฟ)-DP
Excess empirical risk
Technique
๐‘‚(๐‘/๐‘›๐œ–)
Exponential sampling
(based on [MT07])
๐‘‚( ๐‘ log 2 (๐‘›/๐›ฟ) /๐‘›๐œ–)
๐‘‚(๐‘2 /(๐‘›2 ๐œ–))
๐‘ log 2.5 (๐‘›/๐›ฟ)
๐‘‚
๐‘›2 Δ๐œ–
Stochastic gradient descent
(formal analysis and improvements
over [WM10] and [CSS13])
“Localization” + exponential
sampling
Stochastic gradient descent
Δ − strongly convex
Lipschitz
Our results (data set size = ๐‘›, dimension of set ๐’ž is ๐‘ < ๐‘›)
Privacy
๐œ–-DP
(๐œ–, ๐›ฟ)-DP
๐œ–-DP
(๐œ–, ๐›ฟ)-DP
Excess empirical risk
Technique
๐‘‚(๐‘/๐‘›๐œ–)
Exponential sampling
(based on [MT07])
๐‘‚( ๐‘ log 2 (๐‘›/๐›ฟ) /๐‘›๐œ–)
๐‘‚(๐‘2 /(๐‘›2 ๐œ–))
๐‘ log 2.5 (๐‘›/๐›ฟ)
๐‘‚
๐‘›2 Δ๐œ–
Stochastic gradient descent
(formal analysis and improvements
over [WM10] and [CSS13])
“Localization” + exponential
sampling
Stochastic gradient descent
Δ − strongly convex
Lipschitz
Our results (data set size = ๐‘›, dimension of set ๐’ž is ๐‘ < ๐‘›)
Privacy
๐œ–-DP
(๐œ–, ๐›ฟ)-DP
๐œ–-DP
(๐œ–, ๐›ฟ)-DP
Excess empirical risk
Technique
๐‘‚(๐‘/๐‘›๐œ–)
Exponential sampling
(based on [MT07])
๐‘‚( ๐‘ log 2 (๐‘›/๐›ฟ) /๐‘›๐œ–)
๐‘‚(๐‘2 /(๐‘›2 ๐œ–))
๐‘ log 2.5 (๐‘›/๐›ฟ)
๐‘‚
๐‘›2 Δ๐œ–
Stochastic gradient descent
(formal analysis and improvements
over [WM10] and [CSS13])
“Localization” + exponential
sampling
Stochastic gradient descent
Δ − strongly convex
Lipschitz
Our results (data set size = ๐‘›, dimension of set ๐’ž is ๐‘ < ๐‘›)
Privacy
๐œ–-DP
(๐œ–, ๐›ฟ)-DP
๐œ–-DP
(๐œ–, ๐›ฟ)-DP
Excess empirical risk
Technique
๐‘‚(๐‘/๐‘›๐œ–)
Exponential sampling
(based on [MT07])
๐‘‚( ๐‘ log 2 (๐‘›/๐›ฟ) /๐‘›๐œ–)
๐‘‚(๐‘2 /(๐‘›2 ๐œ–))
๐‘ log 2.5 (๐‘›/๐›ฟ)
๐‘‚
๐‘›2 Δ๐œ–
Stochastic gradient descent
(formal analysis and improvements
over [WM10] and [CSS13])
“Localization” + exponential
sampling
Stochastic gradient descent
Our results (data set size = ๐‘›, dimension of set ๐’ž is ๐‘ < ๐‘›)
Lipschitz
Privacy
(๐œ–, ๐›ฟ)-DP
Excess empirical risk
๐‘‚( ๐‘ log 2 (๐‘›/๐›ฟ) /๐‘›๐œ–)
Technique
Stochastic gradient descent
(formal analysis and improvements
over [WM10] and [CSS13])
Private stochastic gradient descent
Private stochastic gradient descent (Priv-SGD)
Loss functions for ๐ท
โ„“(๐œƒ; ๐‘‘1 )
โ„“(๐œƒ; ๐‘‘2 )
โ„“(๐œƒ; ๐‘‘๐‘› )
1. Choose arbitrary ๐œƒ1 ∈ ๐’ž
๐œƒ1
Convex set: ๐’ž
Private stochastic gradient descent (Priv-SGD)
Loss functions for ๐ท
โ„“(๐œƒ; ๐‘‘1 )
Learning rate=๐‘‚(1/
โ„“(๐œƒ; ๐‘‘2 ) ๐‘ก๐‘‡๐‘)
โ„“(๐œƒ; ๐‘‘๐‘› )
2. For each time step ๐‘ก ∈ [๐‘‡]
a. Sample ๐‘‘ ∼ {๐‘‘1, โ‹ฏ , ๐‘‘๐‘›}
b. ๐œƒ๐‘ก+1 ← ๐œƒ๐‘ก − ๐œ‚ ๐‘› ⋅ ๐œ•โ„“ ๐œƒ๐‘ก ; ๐‘‘ + ๐‘๐‘ก
where ๐‘๐‘ก ∼ ๐’ฉ(0, ๐‘‚(๐‘‡Δ๐œ–,๐›ฟ ๐•€๐‘)
๐œƒ๐‘ก
๐œƒ
โ„“(๐œƒ; ๐‘‘)
Private stochastic gradient descent (Priv-SGD)
Loss functions for ๐ท
โ„“(๐œƒ; ๐‘‘1 )
โ„“(๐œƒ; ๐‘‘2 )
โ„“(๐œƒ; ๐‘‘๐‘› )
3. Project ๐œƒ onto ๐’ž, i.e.,
๐œƒ๐‘ก+1 ← Π๐’ž (๐œƒ)
๐œƒ
๐œƒ๐‘ก+1
Convex set: ๐’ž
Private stochastic gradient descent (Priv-SGD)
1. Choose arbitrary ๐œƒ1 ∈ ๐’ž
2. For each time step ๐‘ก ∈ [๐‘‡]
๐œƒ1
a. Sample ๐‘‘ ∼ {๐‘‘1, โ‹ฏ , ๐‘‘๐‘›}
๐œƒ๐‘ก
Finally,
output
๐œƒ
๐‘‡
b. ๐œƒ ← ๐œƒ๐‘ก − ๐œ‚ ๐‘› ⋅ ๐œ•โ„“ ๐œƒ๐‘ก ; ๐‘‘ + ๐‘๐‘ก
where ๐‘๐‘ก ∼ ๐’ฉ(0, ๐‘‚(๐‘‡Δ๐œ–,๐›ฟ ๐•€๐‘)
3. Project ๐œƒ onto ๐’ž, i.e.,
๐œƒ๐‘ก+1 ← Π๐’ž (๐œƒ)
๐œƒ
Private stochastic gradient descent (Priv-SGD)
Privacy guarantee: For ๐‘‡ = ๐‘›2 , Priv-SGD is ๐œ–, ๐›ฟ −differentially private
Key insights [denote # of iterations by ๐‘‡ = ๐‘›2 , and ignore ๐›ฟ]:
1. Providing ๐œ– ≈
1
๐‘‡
diff. privacy per iteration is sufficient
2. [KRSU10] Sampling ensures ๐œ– ≈
1
๐‘‡
diff. privacy to
๐œƒ ← ๐œƒ๐‘ก − ๐œ‚ ๐‘› ⋅ ๐œ•โ„“ ๐œƒ๐‘ก ; ๐‘‘ + ๐‘๐‘ก
Private stochastic gradient descent (Priv-SGD)
Utility guarantee (for ๐‘‡ = ๐‘›2 ):
โ„’ ๐œƒ๐‘๐‘Ÿ๐‘–๐‘ฃ ; ๐ท − min โ„’ ๐œƒ; ๐ท = ๐‘‚
๐œƒ∈๐’ž
๐‘ log 2 (๐‘›/๐›ฟ)
๐‘›๐œ–
Key insight:
∼๐‘ข
โ„“(๐œƒ; ๐‘‘1 )
โ„“(๐œƒ; ๐‘‘๐‘› )
โ„“(๐œƒ; ๐‘‘)
Private stochastic gradient descent (Priv-SGD)
Utility guarantee (for ๐‘‡ = ๐‘›2 ):
โ„’ ๐œƒ๐‘๐‘Ÿ๐‘–๐‘ฃ ; ๐ท − min โ„’ ๐œƒ; ๐ท = ๐‘‚
๐œƒ∈๐’ž
๐‘ log 2 (๐‘›/๐›ฟ)
๐‘›๐œ–
Key insight:
Unbiased estimator of the gradient:
๐œƒ๐‘ก
โ„“(๐œƒ; ๐‘‘)
๐”ผ๐‘๐‘ก ,๐‘‘ ๐‘› ⋅ ๐œ•โ„“ ๐œƒ๐‘ก ; ๐‘‘ + ๐‘๐‘ก = ๐‘› ⋅ ๐œ•โ„’(๐œƒ๐‘ก ; ๐ท)
Private stochastic gradient descent (Priv-SGD)
1. Unbiased estimator of the gradient:
๐”ผ๐‘๐‘ก ,๐‘‘ ๐‘› ⋅ ๐œ•โ„“ ๐œƒ๐‘ก ; ๐‘‘ + ๐‘๐‘ก = ๐‘› ⋅ ๐œ•โ„’(๐œƒ๐‘ก ; ๐ท)
2. Bounded variance:
๐”ผ๐‘๐‘ก ,๐‘‘ ||๐‘› ⋅ ๐œ•โ„“ ๐œƒ๐‘ก ; ๐‘‘ + ๐‘๐‘ก ||22 = ๐‘‚(๐‘๐‘›2 log(๐‘›/๐›ฟ) /๐œ– 2 )
Utility guarantee now follows from [SZ’13]
Private stochastic gradient descent (Priv-SGD)
Running time: Assuming computing gradient takes time ๐‘‚(๐‘)
Running time of Priv-SGD is ๐‘‚(๐‘›2 ๐‘)
Note: If we computed the true gradient of โ„’, instead of sampling,
running time would have been ๐‘‚(๐‘›3 ๐‘)
Utility guarantee of Priv-SGD is optimal.
• Proof via Fingerprinting codes [BS96,Tardos03]
Part II of this Talk
1. Recap: Differential privacy and convex risk minimization
2. Private gradient descent and risk minimization in low-dimensions
3. Private Frank-Wolfe and risk minimization in high-dimensions
4. Private feature selection in high-dimensions and the LASSO
Private ERM and the Frank-Wolfe
(for ๐ฟ1/๐ฟ∞-setting) [Talwar, T., Zhang]
Empirical Risk Minimization (ERM) Setup
Today: Differentially private algorithm to output ๐œƒpriv that minimizes
โ„’(๐œƒpriv ; ๐ท) − min โ„’(๐œƒ; ๐ท)
๐œƒ∈๐’ž
Commonly referred to as
high-dimensional setting
• ๐ฟ2 /๐ฟ2 -setting: ||๐›ปโ„“ ๐œƒ; ๐‘‘ ||2 ≤ 1 and ||๐’ž||2 ≤ 1
• Algorithm: Private gradient descent
• ๐ฟ1 /๐ฟ∞ -setting: ||๐›ปโ„“ ๐œƒ; ๐‘‘ ||∞ ≤ 1 and ||๐’ž||1 ≤ 1
• Algorithms: Frank-Wolfe and LASSO
Frank-Wolfe on the ๐ฟ1 -ball (a stylized exposition)
1. Pick a corner of the ๐ฟ1 −ball
(๐œƒ1 ∈ โ„๐‘ )
Linearized loss
2. ๐‘  ← arg min⟨๐›ปโ„’(๐œƒ๐‘ก ; ๐ท), ๐œƒ⟩
๐œƒ∈๐ฟ1
3. ๐œƒ๐‘ก+1 ← ๐›ผ๐œƒ๐‘ก + 1 − ๐›ผ ๐‘ 
Typically, ๐›ผ ≈ 1/๐‘‡
๐‘ 
๐œƒ1
Frank-Wolfe on the ๐ฟ1 -ball (a stylized exposition)
• ๐‘  is always a corner
Linearized loss
• Final output ๐œƒ๐‘‡ is always a convex
combination of the corners
• For smooth losses, the
convergence≈ ๐‘‚(1/๐‘‡)
[FW56, Jaggi’13]
๐‘ 
๐œƒ1
Differentially private Frank-Wolfe
Recap: Report-Noisy-Max (a.k.a. Exponential Mechanism)
Objective: Output ๐‘  ∈ ๐‘† , maximizing ๐‘“(๐‘ , ๐ท)
Global sensitivity GS(๐‘“)= max |๐‘“ ๐‘ , ๐ท − ๐‘“(๐‘ , ๐ท′)| for all neighbors
๐‘ ∈๐‘†,๐ท,๐ท′
๐ท and ๐ท′
1. ๐‘Ž๐‘  ← ๐‘“ ๐‘ , ๐ท + Lap(GS(๐‘“)/๐œ–)
2. Output ๐‘  with highest value of ๐‘Ž๐‘ 
Theorem (Privacy): Algorithm is 2๐œ–-differentially private
Private Frank-Wolfe (hiding terms in ๐›ฟ)
1. Pick a corner of the ๐ฟ1 −ball
(๐œƒ1 ∈ โ„๐‘ )
2. ๐‘  ← arg min⟨๐›ปโ„’ ๐œƒ๐‘ก ; ๐ท + ๐‘๐‘ก , ๐œƒ⟩, ๐‘๐‘ก ∼ Lap( ๐‘‡/๐‘›๐œ–)
๐œƒ∈๐ฟ1
3. ๐œƒ๐‘ก+1 ← ๐œƒ๐‘ก /๐‘‡ + 1 − 1/๐‘‡ ๐‘ 
Report-noisy-max +
strong composition
Theorem (privacy): Algorithm is (๐œ–, ๐›ฟ) −differentially private
Private Frank-Wolfe (hiding terms in ๐›ฟ)
Theorem (utility): For ๐‘‡ ≈ ๐‘›2/3 ,
๐”ผ โ„’ ๐œƒT ; ๐ท
− min โ„’ ๐œƒ; ๐ท = ๐‘‚
๐œƒ∈๐ฟ1
log ๐‘›๐‘
๐‘›๐œ– 2/3
The guarantee is tight. Optimality via fingerprinting codes
[BS96,Tardos03]
Guarantee is meaningful even when ๐‘ โ‰ซ ๐‘›
Part II of this Talk
1. Recap: Differential privacy and convex risk minimization
2. Private gradient descent and risk minimization in low-dimensions
3. Private Frank-Wolfe and risk minimization in high-dimensions
4. Private feature selection in high-dimensions and the LASSO
Feature selection in high-dimensional regression
Sparse Linear Regression in High-dimensions (๐‘ โ‰ซ ๐‘›)
• Data set: ๐ท = { ๐‘ฅ1 , ๐‘ฆ1 , โ‹ฏ , ๐‘ฅ๐‘› , ๐‘ฆ๐‘› } where ๐‘ฅ๐‘– ∈ โ„๐‘ and ๐‘ฆ๐‘– ∈ โ„
• Assumption: Data generated by noisy linear system
=
Feature vector
๐‘ฅ๐‘–
Parameter vector
๐‘ฆ๐‘–
∗
๐œƒ๐‘×1
+
๐‘ค๐‘–
Field noise
Data normalization:
• ∀๐‘– ∈ ๐‘› , ||๐‘ฅ๐‘– ||∞ ≤ 1
• ∀๐‘–, ๐‘ค๐‘– is sub-Gaussian
Sparse Linear Regression in High-dimensions (๐‘ โ‰ซ ๐‘›)
• Data set: ๐ท = { ๐‘ฅ1 , ๐‘ฆ1 , โ‹ฏ , ๐‘ฅ๐‘› , ๐‘ฆ๐‘› } where ๐‘ฅ๐‘– ∈ โ„๐‘ and ๐‘ฆ๐‘– ∈ โ„
๐‘ฆ๐‘›×1
๐‘‹๐‘›×๐‘
∗
๐œƒ๐‘×1
+
Field noise
=
Design matrix
Parameter vector
Response vector
• Assumption: Data generated by noisy linear system
๐‘ค๐‘›×1
๐‘ฆ๐‘›×1
=
Design matrix
+
๐‘ค๐‘›×1
๐‘‹๐‘›×๐‘
∗
๐œƒ๐‘×1
• Sparsity: ๐œƒ ∗ has ๐‘  < ๐‘› non-zero entries
• Bounded norm: ∀๐‘–
Field noise
Response vector
Sparse Linear Regression in High-dimensions (๐‘ โ‰ซ ๐‘›)
∗
|๐œƒ๐‘– |
This talk: With differential
privacy
∈ (1 − Φ, Φ) for arbitrary small const. Φ
Model selection problem: Find the non-zero coordinates of ๐œƒ ∗
๐‘ฆ๐‘›×1
=
Design matrix
+
Field noise
Response vector
Sparse Linear Regression in High-dimensions (๐‘ โ‰ซ ๐‘›)
๐‘ค๐‘›×1
๐‘‹๐‘›×๐‘
∗
๐œƒ๐‘×1
Model selection: Non-zero coordinates (or the support) of ๐œƒ ∗
Solution: LASSO estimator [Tibshirani94,EFJT03,Wainwright06,CT07,ZY07,…]
1
๐œƒ ∈ arg min๐‘
||๐‘ฆ − ๐‘‹๐œƒ||22 + Λ||๐œƒ||1
๐œƒ∈โ„ 2๐‘›
Consistency of the LASSO Estimator
=
+
Consistency conditions* [Wainwright06,ZY07]:
• Γ: Support of the underlying parameter vector ๐œƒ ∗
๐‘‹Γ
Incoherence
||| ๐‘‹Γ๐‘‡๐‘ ๐‘‹Γ
๐‘‹Γ๐‘
Restricted Strong Convexity
1
−1
๐‘‡
๐‘‹Γ ๐‘‹Γ |||∞ <
4
๐œ†๐‘š๐‘–๐‘› ๐‘‹Γ๐‘‡ ๐‘‹Γ = Ω(๐‘›)
Consistency of the LASSO Estimator
=
+
Consistency conditions* [Wainwright06,ZY07]:
• Γ: Support of the underlying parameter vector ๐œƒ ∗
Theorem*: Under proper choice of Λ and ๐‘› = Ω(๐‘ log ๐‘), support
of the LASSO estimator ๐œƒ equals support of ๐œƒ ∗
Incoherence
||| ๐‘‹Γ๐‘‡๐‘ ๐‘‹Γ
Restricted Strong Convexity
1
−1
๐‘‡
๐‘‹Γ ๐‘‹Γ |||∞ <
4
๐œ†๐‘š๐‘–๐‘› ๐‘‹Γ๐‘‡ ๐‘‹Γ = Ω(๐‘›)
Stochastic Consistency of the LASSO
=
+
Consistency conditions* [Wainwright06,ZY07]:
• Γ: Support of the underlying parameter vector ๐œƒ ∗
Incoherence
||| ๐‘‹Γ๐‘‡๐‘ ๐‘‹Γ ๐‘‹Γ๐‘‡ ๐‘‹Γ
Restricted Strong Convexity
−1
1
|||∞ <
4
๐œ†๐‘š๐‘–๐‘› ๐‘‹Γ๐‘‡ ๐‘‹Γ = Ω(๐‘›)
Theorem [Wainwright06,ZY07]: If each data entry in ๐‘‹ ∼ ๐’ฉ(0,1
/4), then the assumptions above are satisfied w.h.p.
Notion of Neighboring Data sets
Design matrix
Response vector
๐‘ฅ๐‘–
๐‘ฆ๐‘–
Data set ๐ท = ๐‘›
๐‘
Notion of Neighboring Data sets
Design matrix
๐‘ฅ๐‘– ′
Data set ๐ท′ =
Response vector
๐‘ฆ๐‘– ′
๐‘›
๐‘
๐ท and ๐ท′ are neighboring data sets
Perturbation stability
(a.k.a. zero local sensitivity)
Aside: Local Sensitivity [NRS07]
Data set ๐ท = {๐‘‘1 , โ‹ฏ , ๐‘‘๐‘› } and ๐‘“: ๐‘ˆ ∗ → โ„๐‘ be a function on ๐ท
Local sensitivity: LS(๐‘“, ๐ท, 1)=
max ′
๐ท′ ,๐‘‘๐ป ๐ท,๐ท =1
|๐‘“ ๐ท − ๐‘“ ๐ท ′ | 1
1. ๐ธ: Random variable sampled from Lap(LS(๐‘“, ๐ท, 1)/๐œ–)๐‘
2. Output ๐‘“ ๐ท + ๐ธ
Not differentially private
Part II : We show that local sensitivity is an useful tool
Perturbation Stability
Data set ๐ท
๐‘‘1
๐‘‘2
Function ๐‘“
โ‹ฎ
๐‘‘๐‘›
Output
Perturbation Stability
Data set ๐ท′
๐‘‘1
๐‘‘2 ′
Function ๐‘“
โ‹ฎ
๐‘‘๐‘›
Output
Stability of ๐‘“ at ๐ท: The output does not change on changing any one entry
Equivalently, local sensitivity of ๐‘“ at ๐ท is zero
Distance to Instability Property
• Definition: A function ๐‘“: ๐‘ˆ ∗ → ℜ is ๐‘˜ −stable at a data set ๐ท if
• For any data set ๐ท′ ∈ ๐‘ˆ ∗ , with ๐ทΔ๐ท′ ≤ ๐‘˜,
๐‘“ ๐ท = ๐‘“(๐ท ′ )
• Distance to instability:
max(๐‘“ ๐ท ๐‘–๐‘  ๐‘˜ − ๐‘ ๐‘ก๐‘Ž๐‘๐‘™๐‘’)
๐‘˜
• Objective: Output ๐‘“(๐ท) while preserving
differential privacy
All data sets
Unstable data sets
Distance > ๐‘˜
๐ท
Stable data sets
Propose-Test-Release (PTR) framework
[DL09, KRSY11, Smith T.’13]
A Meta-algorithm: Propose-Test-Release (PTR)
Basic tool: Laplace
mechanism
1. ๐‘‘๐‘–๐‘ ๐‘ก ← max(๐‘“ ๐ท ๐‘–๐‘  ๐‘˜ − ๐‘ ๐‘ก๐‘Ž๐‘๐‘™๐‘’)
๐‘˜
2. ๐‘‘๐‘–๐‘ ๐‘ก ← ๐‘‘๐‘–๐‘ ๐‘ก + ๐ฟ๐‘Ž๐‘
3. If ๐‘‘๐‘–๐‘ ๐‘ก >
๐‘™๐‘œ๐‘”(1/๐›ฟ)
,
๐œ–
1
๐œ–
then return ๐‘“(๐ท), else return ⊥
Theorem: The algorithm is ๐œ–, ๐›ฟ −differentially private
Theorem: If ๐‘“ is
2๐‘™๐‘œ๐‘” 1/๐›ฟ
๐œ–
outputs ๐‘“(๐ท)
-stable at ๐ท, then w.p. ≥ 1 − ๐›ฟ the algorithm
Recap: Propose-Test-Release
Framework
(PTR)
TBD: Some global sensitivity one query
1. ๐‘‘๐‘–๐‘ ๐‘ก ← max(๐‘“ ๐ท ๐‘–๐‘  ๐‘˜ − ๐‘ ๐‘ก๐‘Ž๐‘๐‘™๐‘’)
๐‘˜
2. ๐‘‘๐‘–๐‘ ๐‘ก ← ๐‘‘๐‘–๐‘ ๐‘ก + ๐ฟ๐‘Ž๐‘
3. If ๐‘‘๐‘–๐‘ ๐‘ก >
๐‘™๐‘œ๐‘”(1/๐›ฟ)
,
๐œ–
1
๐œ–
then return ๐‘“(๐ท), else return ⊥
Theorem: The algorithm is ๐œ–, ๐›ฟ −differentially private
Theorem: If ๐‘“ is
2๐‘™๐‘œ๐‘” 1/๐›ฟ
๐œ–
outputs ๐‘“(๐ท)
-stable at ๐ท, then w.p. ≥ 1 − ๐›ฟ the algorithm
Instantiation of PTR for the LASSO
=
LASSO: ๐œƒ ∈
1
arg min๐‘ ||๐‘ฆ
๐œƒ∈โ„ 2๐‘›
− ๐‘‹๐œƒ||22 + Λ||๐œƒ||1
• Set function ๐‘“ =support of ๐œƒ
• Issue: For ๐‘“, distance to instability might not be efficiently
computable
+
From [Smith,T.’13]
Consistency conditions
Perturbation stability
Proxy conditions
This talk
Consistency conditions
Perturbation stability
Proxy conditions
(Efficiently testable with privacy)
Perturbation Stability of the LASSO
=
LASSO: ๐œƒ ∈
1
arg min๐‘ ||๐‘ฆ
๐œƒ∈โ„ 2๐‘›
+
− ๐‘‹๐œƒ||22 + Λ||๐œƒ||1
๐ฝ(๐œƒ)
Theorem: Consistency conditions on LASSO are sufficient for
perturbation stability
Proof Sketch: 1. Analyze Karush-Kuhn-Tucker (KKT) optimality
conditions at ๐œƒ
2. Show that support(๐œƒ) is stable via using
‘’dual certificate’’ on stable instances
Perturbation Stability of the LASSO
=
Proof Sketch:
1 ๐‘‡
Gradient of LASSO ๐œ•๐ฝ๐ท (๐œƒ)= − ๐‘‹ ๐‘ฆ − ๐‘‹๐œƒ + Λ๐œ•||๐œƒ||1
๐‘›
Lasso objective on ๐ท
๐œƒ
0 ∈ ๐œ•๐ฝ๐ท (๐œƒ)
Lasso objective on ๐ท′
๐œƒ′
0 ∈ ๐œ•๐ฝ๐ท′ (๐œƒ′)
+
Perturbation Stability of the LASSO
=
Proof Sketch:
1 ๐‘‡
Gradient of LASSO ๐œ•๐ฝ๐ท (๐œƒ)= − ๐‘‹ ๐‘ฆ − ๐‘‹๐œƒ + Λ๐œ•||๐œƒ||1
๐‘›
Argue using the optimality conditions of ๐œ•๐ฝ๐ท (๐œƒ) and ๐œ•๐ฝ๐ท′ (๐œƒ′)
1. No zero coordinates of ๐œƒ become non-zero in ๐œƒ′
(use mutual incoherence condition)
2. No non-zero coordinates of ๐œƒ become zero in ๐œƒ′
(use restricted strong convexity condition)
+
Perturbation Stability Test for the LASSO
=
+
0 0
Γ: Support of ๐œƒ
Γ c : Complement of the support of ๐œƒ
Test for the following (real test is more complex):
• Restricted Strong Convexity (RSC): Minimum eigenvalue of ๐‘‹ΓT XΓ is Ω(๐‘›)
• Strong stability: Negative of the (absolute) coordinates of the gradient
of the least-squared loss in Γ ๐‘ are โ‰ช Λ
Geometry of the Stability of LASSO
=
+
Intuition: Strong convexity ensures supp(๐œƒ)⊆ supp(๐œƒ′)
1. Strong convexity ensures
||๐œƒΓ − ๐œƒ′Γ ||∞ is small
Lasso objective along Γ
2. If ∀๐‘–, |๐œƒΓ (๐‘–)| is large, then
∀๐‘–, ๐œƒ′Γ ๐‘– > 0
3. Consistency conditions imply
∀๐‘–, |๐œƒΓ (๐‘–)| is large
Dimension 2 in Γ ๐‘
Dimension 1 in Γ
๐œƒ
Geometry of the Stability of LASSO
=
+
Intuition: Strong stability ensures no zero coordinate in ๐œƒ becomes
non-zero in ๐œƒ′
Lasso objective along Γ c
Slope: Λ
Slope: -Λ
Dimension 1 in Γ
Dimension 2 in Γ ๐‘
๐œƒ
• For the minimizer ๐œƒ to move along Γ ๐‘ , the perturbation to the
gradient of least-squared loss has to be large
Geometry of the Stability of LASSO
=
+
Gradient of the least-squared loss:
Γ
−๐‘‹ ๐‘‡ ๐‘ฆ − ๐‘‹๐œƒ =
Lasso objective along Γ c
๐‘Ž๐‘–
Γc
Slope: Λ
Slope: -Λ
๐‘Ž๐‘
Dimension 1 in Γ
Dimension 2 in Γ ๐‘
๐œƒ
• Strong stability: |๐‘Ž๐‘– | โ‰ช Λ for all ๐‘– ∈ Γ ๐‘ ⇒
๐‘– ∈ Γ ๐‘ has a sub-gradient of zero for LASSO(๐ท′)
Making the Stability Test Private (Simplified)
=
+
Test for Restricted Strong Convexity: ๐‘”1 ๐ท
๐‘”2
Test for strong stability: ๐‘”2 ๐ท
Issue: If ๐‘”1 ๐ท > ๐‘ก1 and ๐‘”2 ๐ท > ๐‘ก2 , then
sensitivities are Δ1 and Δ2
๐‘”1
Our solution: Proxy distance
๐‘‘ = max
๐‘”๐‘– ๐ท −๐‘ก๐‘–
min
Δ๐‘–
๐‘–
• ๐‘‘ has global sensitivity of one
๐‘”1 and ๐‘”2 are both large
and insensitive
+ 1,0
Private Model Selection with Optimal Sample
Complexity
=
Nearly optimal sample complexity
+
1. Compute ๐‘‘= function of ๐‘”1 (๐ท) and ๐‘”2 ๐ท
2. ๐‘‘๐‘–๐‘ ๐‘ก ← ๐‘‘ + ๐ฟ๐‘Ž๐‘
3. If ๐‘‘๐‘–๐‘ ๐‘ก >
๐’๐’๐’ˆ(๐Ÿ/๐œน)
,
๐
1
๐œ–
then return ๐‘ ๐‘ข๐‘๐‘(๐œƒ), else return ⊥
Theorem: The algorithm is ๐œ–, ๐›ฟ −differentially private
Theorem: Under consistency conditions , log ๐‘ > ๐›ผ 2 ๐‘  3 and ๐‘› = Ω(๐‘  log ๐‘),
w.h.p. the support of ๐œƒ ∗ is output. Here ๐›ผ = log(1/๐›ฟ)/๐œ–.
Thesis: Differential privacy ⇒ generalizability
Stable learning ⇒ differential privacy
Part I of this Talk
1. Towards a rigorous notion of statistical data privacy
2. Differential privacy: An overview
3. Generalization guarantee via differential privacy
4. Application: Follow-the-perturbed-leader
Part II of this Talk
1. Recap: Differential privacy and convex risk minimization
2. Private gradient descent and risk minimization in low-dimensions
3. Private Frank-Wolfe and risk minimization in high-dimensions
4. Private feature selection in high-dimensions and the LASSO
Concluding Remarks
• Diff. privacy and robust (stable) machine learning are closely related
• Not in this talk:
• Private machine learning via bootstrapping [Smith T.’13]
• Private non-convex learning via bootstrapping [BDMRTW]
• False discovery rate control via differential privacy [DFHPR14]
Open Questions
• Develop the theory behind private non-convex learning
• Analyze algorithms like expectation maximization and alternating
minimization
• Private learning with time-series data (e.g., auto-regressive models)
• Private matrix completion (the Netflix problem) using Frank-Wolfe
Download