Integrating Differential Privacy with Statistical Theory Adam Smith Computer Science & Engineering Department Penn State http://www.cse.psu.edu/~asmith Eagl II: October 4, 2009 1 Wednesday, October 14, 2009 Privacy in Statistical Databases Individuals x1 x2 .. . xn Server/agency A local random coins ( queries answers Users ) Government, researchers, businesses (or) Malicious adversary Large collections of personal information • census data • medical/public health data Recently: • social networks • larger data sets • recommendation systems • more types of data • trace data: search records, etc • intrusion-detection systems 2 Wednesday, October 14, 2009 Privacy in Statistical Databases 3 Wednesday, October 14, 2009 Privacy in Statistical Databases • Two conflicting goals Utility: Users can extract “global” statistics Confidentiality: Individual information stays hidden • How can we define these precisely? Variations on model studied in • Statistics (“statistical disclosure control”) • Data mining / database (“privacy-preserving data mining” *) Recent: theoretical CS 3 Wednesday, October 14, 2009 Differential Privacy Individuals x1 x2 .. . xn Server/agency A local random coins ( queries answers Users ) Government, researchers, businesses (or) Malicious adversary • Definition of privacy in statistical databases Imposes restrictions on algorithm A generating output • If A satisfies restrictions, then output provides privacy no matter what user/intruder knows ahead of time • Question: how useful are algorithms that satisfy differential privacy? 4 Wednesday, October 14, 2009 Statistics in a Nutshell probability model data inference / learning • This talk: Differentially private alg’s for recovering model Performance = convergence to “true” model • Goals: interesting problems + bridge linguistic gap 5 Wednesday, October 14, 2009 This talk: Valid Statistical Inference • Construct differentially private algorithms with same asymptotic error as best non-private algorithm Parametric: for any* parametric model, there exists a private, efficient estimator (i.e. minimal variance) Nonparametric: for any* distribution on [0,1], there is a private histogram estimator with same convergence rate as best (non-private) histogram estimator 6 Wednesday, October 14, 2009 This talk: Valid Statistical Inference • Construct differentially private algorithms with same asymptotic error as best non-private algorithm Parametric: for any* parametric model, there exists a private, efficient estimator (i.e. minimal variance) Nonparametric: for any* distribution on [0,1], there is a private histogram estimator with same convergence rate as best (non-private) histogram estimator Other recent work along these lines: • • • • • Dwork, Lei 2009 Wasserman, Zhou 2008 Chaudhuri, Monteleoni 2008 McSherry, Williams 2009 ... 6 Wednesday, October 14, 2009 Main Idea for both cases • Add noise to carefully modified estimator Several ways to design differentially private algorithms Adding noise is the simplest • Prove that required noise is less than inherent variability due to sampling original (non-private) estimate Wednesday, October 14, 2009 perturbed (private) estimate sampling noise (confidence interval ) 7 Reminder: differential privacy • Intuition: Changes to my data not noticeable by users Output is “independent” of my data 8 Wednesday, October 14, 2009 Defining Privacy [DiNi,DwNi,BDMN,DMNS] x1 x2 .. . xn A A(x) local random coins n = (x , ..., x ) ∈ D • Data set x 1 n Domain D can be numbers, categories, tax forms Think of x as fixed (not random) • A = randomized procedure run by the agency A(x) is a random variable distributed over possible outputs Randomness might come from adding noise, resampling, etc. 9 Wednesday, October 14, 2009 Defining Privacy [DiNi,DwNi,BDMN,DMNS] x1 x2 .. . xn x1 A local random coins A(x) x!2 .. . xn A A(x’) local random coins x’ is a neighbor of x if they differ in one data point 10 Wednesday, October 14, 2009 Defining Privacy [DiNi,DwNi,BDMN,DMNS] x1 x2 .. . xn x1 A local random coins A(x) x!2 .. . xn A A(x’) local random coins x’ is a neighbor of x if they differ in one data point Neighboring databases induce close distributions on outputs 10 Wednesday, October 14, 2009 Defining Privacy [DiNi,DwNi,BDMN,DMNS] x1 x2 .. . xn x1 A A(x) local random coins x!2 A .. . xn A(x’) local random coins x’ is a neighbor of x if they differ in one data point Definition: A is ε-differentially private if, for all neighbors x, x’, for all subsets S of outputs ! Neighboring databases induce close distributions on outputs Pr(A(x) ∈ S) ≤ e · Pr(A(x ) ∈ S) ! 10 Wednesday, October 14, 2009 Defining Privacy [DiNi,DwNi,BDMN,DMNS] • This is a condition on the algorithm (process) A Saying “this output is safe” doesn’t take into account how it was computed • Semantics: no matter what user knows ahead of time, learn the same things about me whether or not my data is present Definition: A is ε-differentially private if, for all neighbors x, x’, for all subsets S of outputs ! Neighboring databases induce close distributions on outputs Pr(A(x) ∈ S) ≤ e · Pr(A(x ) ∈ S) ! 11 Wednesday, October 14, 2009 Example: Perturbing the Average x1 x2 .. . xn A A(x) = x̄ + noise local random coins xi ∈ {0, 1} x̄ = 1 n ! i xi 12 Wednesday, October 14, 2009 Example: Perturbing the Average x1 x2 .. . xn A A(x) = x̄ + noise local random coins • Data points are binary responses xi ∈ {0, 1} • Server wants to release average x̄ = 1 n ! i xi 12 Wednesday, October 14, 2009 Example: Perturbing the Average x1 x2 .. . xn A local random coins A(x) = x̄ + noise ≈ x̄ ± • Data points are binary responses 1 !n xi ∈ {0, 1} • Server wants to release average x̄ = 1 n ! i xi • Claim: Can obtain ε-differential privacy with noise ≈ 1 !n 12 Wednesday, October 14, 2009 Example: Perturbing the Average x1 x2 .. . xn A local random coins A(x) = x̄ + noise ≈ x̄ ± • Data points are binary responses 1 !n xi ∈ {0, 1} • Server wants to release average x̄ = 1 n ! i xi • Claim: Can obtain ε-differential privacy with noise ≈ 1 !n • Is this a lot? If x is a random sample from a large underlying population, 1 then sampling noise ≈ √ n Statistical error swamps noise required for privacy 12 Wednesday, October 14, 2009 Example: Perturbing the Average x1 x2 .. . xn A A(x) = x̄ + noise local random coins xi ∈ {0, 1} x̄ = 1 n ! i xi 13 Wednesday, October 14, 2009 Example: Perturbing the Average x1 x2 .. . xn A local random coins A(x) = x̄ + noise ≈ x̄ ± • Data points are binary responses 1 !n xi ∈ {0, 1} • Server wants to release average x̄ = 1 n ! i xi • Claim: Can obtain ε-differential privacy with noise ≈ 1 !n 13 Wednesday, October 14, 2009 Example: Perturbing the Average x1 x2 .. . xn A(x) = x̄ + noise A local random coins ≈ x̄ ± • Data points are binary responses 1 !n xi ∈ {0, 1} • Server wants to release average x̄ = 1 n ! i xi • Claim: Can obtain ε-differential privacy with noise ≈ 1 !n Laplace distribution Lap(λ) has density h(y) ∝ e−|y|/λ Sliding property: h(y) h(y+δ) ≤ eδ/λ h(y) y 13 Wednesday, October 14, 2009 Example: Perturbing the Average x1 x2 .. . xn A(x) = x̄ + noise A local random coins ≈ x̄ ± • Data points are binary responses 1 !n xi ∈ {0, 1} • Server wants to release average x̄ = 1 n ! i xi • Claim: Can obtain ε-differential privacy with noise ≈ 1 !n Laplace distribution Lap(λ) has density h(y) ∝ e−|y|/λ Sliding property: h(y) h(y+δ) ≤ eδ/λ h(y + δ) h(y) y 13 Wednesday, October 14, 2009 Example: Perturbing the Average x1 x2 A(x) = x̄ + noise A .. . xn ≈ x̄ ± local random coins • Data points are binary responses 1 !n xi ∈ {0, 1} • Server wants to release average x̄ = 1 n ! i xi • Claim: Can obtain ε-differential privacy with noise ≈ 1 !n Laplace distribution Lap(λ) has density h(y) ∝ e−|y|/λ Sliding property: h(y) h(y+δ) ≤ eδ/λ h(y + δ) h(y) A(x) = blue curve, A(x’) = red curve δ = |x̄ − x̄! | ≤ Wednesday, October 14, 2009 1 n =⇒ blue curve ≤ e! red curve y 13 When Does Noise Not Matter? 1 • Average: A(x) = x̄ + Lap( !n ) Suppose X1, X2, X3, ...,Xn are i.i.d. random variables √ D X̄ is a random variable, and n · (X̄ − µ)−→Normal A(X) − X̄ P −→0 StdDev(X̄) if ! ! √1 n No “cost” to privacy: • A(X) is “as good as” X̄ for statistical inference* X̄ -1.5 -1 -0.5 A(X) 0.8 0 0.5 1 1.5 14 Wednesday, October 14, 2009 When Does Noise Not Matter? 15 Wednesday, October 14, 2009 When Does Noise Not Matter? • Mean example generalizes to other statistics • Theorem: For any* exponential family, can release “approximately sufficient” statistics Suff. stats T(X) are sums, add noise d for dimension d !n P A(X) − T (X) −→0 StdDev(T (X)) 15 Wednesday, October 14, 2009 When Does Noise Not Matter? • Mean example generalizes to other statistics • Theorem: For any* exponential family, can release “approximately sufficient” statistics Suff. stats T(X) are sums, add noise d for dimension d !n P A(X) − T (X) −→0 StdDev(T (X)) • Asymptotic result: Indicates that useful analysis possible Requires more sophisticated processing for small n 15 Wednesday, October 14, 2009 When Does Noise Not Matter? • Mean example generalizes to other statistics • Theorem: For any* exponential family, can release “approximately sufficient” statistics Suff. stats T(X) are sums, add noise d for dimension d !n P A(X) − T (X) −→0 StdDev(T (X)) • Asymptotic result: Indicates that useful analysis possible Requires more sophisticated processing for small n • Noise degrades with dimension More information ⇒ less privacy Research question: Is this necessary? 15 Wednesday, October 14, 2009 • Histogram Density Estimation Calibrating noise to sensitivity • Maximum Likelihood Estimator Sub-sample and aggregate 16 Wednesday, October 14, 2009 Global Sensitivity [DMNS06] Individuals x1 x2 Server/agency .. . xn A f(x) + noise User local random coins • Intuition: f(x) can be released accurately when f is insensitive to individual entries x1 , x2 , . . . , xn • Global Sensitivity: GSf = • Example: GSaverage = max ! !f (x) − f (x )!1 ! neighbors x,x 1 n 17 Wednesday, October 14, 2009 Global Sensitivity [DMNS06] Individuals x1 x2 Server/agency A .. . xn f(x) + noise User local random coins • Intuition: f(x) can be released accurately when f is insensitive to individual entries x1 , x2 , . . . , xn • Global Sensitivity: GSf = • Example: GSaverage = ! !f (x) − f (x )!1 ! neighbors x,x 1 n Theorem: If A(x) = f (x) + Lap Wednesday, October 14, 2009 max ! GSf ! " , then A is !-differentially private. 17 Example: Histograms f (x) = (n1 , n2 , ..., , nd ) where nj = #{i : xi in j-th interval} Lap(1/!) 0 1/d 1 18 Wednesday, October 14, 2009 Example: Histograms • Say x1,x2,...,xn in [0,1] Partition [0,1] into d intervals of equal size f (x) = (n1 , n2 , ..., , nd ) where nj = #{i : xi in j-th interval} GSf = 2 Sufficient to add noise Lap(1/!) to each count • Independent of the dimension 0 1/d 1 18 Wednesday, October 14, 2009 Example: Histograms • Say x1,x2,...,xn in [0,1] arbitrary domain D Partition [0,1] into d intervals of equal size into d disjoint “bins” f (x) = (n1 , n2 , ..., , nd ) where nj = #{i : xi in j-th interval}bin GSf = 2 Sufficient to add noise Lap(1/!) to each count • Independent of the dimension 0 1/d 1 18 Wednesday, October 14, 2009 Example: Histograms • Say x1,x2,...,xn in [0,1] Partition [0,1] into d intervals of equal size f (x) = (n1 , n2 , ..., , nd ) where nj = #{i : xi in j-th interval} GSf = 2 Sufficient to add noise Lap(1/!) to each count • Independent of the dimension • For any smooth density h, if Xi i.i.d. ~ h, noisy histogram converges to h Expected L2 error O( 1 √ 3 n) 1 if n ! 3 ! Same as “best” non-private estimator 0 1/d 1 19 Wednesday, October 14, 2009 Proof • L2 error: ! " # 1 IM SE = Eĥ !h − ĥ!22 = (h − ĥ)2 dx 0 • If bins have fixed width t 1 1 2 ! Theorem [Scott]: IM SE = + t R(h ) + nt n √ 3 Minimized for t = n • Additional square error due to noise Additional error per bin t × 1 t2 n2 !2 Total additional error: ĥ(x) = # pts ± (1/!) nt t 1 1/t × 1/(tn2 !2 ) = 2 2 2 !"#$ ! "# $ t n ! # bins error per bin Total error (by additivity of variance): 1 1 2 + t + 2 2 2 = (original error) × (1 + o(1)) nt t n ! Wednesday, October 14, 2009 0 1/d 1 20 Frequency Polygon • Frequency polygon linear interpolation through midpoints of histogram • For any smooth density h, if Xi i.i.d. ~ h, frequency polygon of noisy histogram converges to h Expected L2 error O(n −4/5 ) if n ! Same as non-private estimator 1 !5/2 • Similar, more complicated proof 0 Wednesday, October 14, 2009 1/d 1 21 More detail • This actually shows that for any given bin width, can find noisy estimator that is close to non-noisy estimator • Does not address how to choose exact bin width Subject to extensive research Common “bandwidth selection” criteria can be approximated privately: • compute many different candidates • use exponential mechanism to select best candidate(s) 22 Wednesday, October 14, 2009 • Histogram Density Estimation Calibrating noise to sensitivity • Maximum Likelihood Estimator Sub-sample and aggregate 23 Wednesday, October 14, 2009 High global sensitivity: example 3 High Global Sensitivity: Learning Mixtures Example 3: cluster centers Database entries: points in a metric space. ! x x !" !" ! " !" !"!" !" !" !" !" !" !" !" !" "! !" !"!"!" !" !" !" !" !" !" !" !" "! !" !" !"!"!" !" !" !" !" !" !" !" !" !" !" ! " !" !"!" !" !" !" !" !" !" !" !" "! !" !"!"!" !" !" !" !" !" !" !" !" "! !" !"!"!" !" !" !" !" !" !" !" Global sensitivity of cluster centers is roughly the diameter of the space. • But intuitively, if clustering is ”good”, cluster centers should be insensitive. 16 Wednesday, October 14, 2009 24 High global sensitivity: example 3 High Global Sensitivity: Learning Mixtures Example 3: cluster centers Database entries: points in a metric space. ! x x !" !" ! " !" !"!" !" !" !" !" !" !" !" !" #$ !" !" !" !" !" !"!"!" !" !" !" !"!"!" !" !" !" !" !" !" !" !" !" !" !" !" !" !" !" !" ! " !" !"!" !" !" !" !" !" !" !" !" "! !" !"!"!" !" !" !" !" !" !" !" #$ !" !" !" !"!"!" !" !" !" !" !" !" !" !" Global sensitivity of cluster centers is roughly the diameter of the space. • But intuitively, if clustering is ”good”, cluster centers should be insensitive. 16 Wednesday, October 14, 2009 25 High global sensitivity: example 3 High Global Sensitivity: Learning Mixtures Example 3: cluster centers Database entries: points in a metric space. ! x x !" !" ! " !" !"!" !" !" !" !" !" !" !" # $ !" #$ !" !" !" !" !" !"!"!"# !" !"!"!" !" !" $!" !" !" !" !" !" !" !" !" !" !" !" !" !" !" !" ! " !" !"!" !" !" !" !" !" !" !" $ # !" "! !" !"!"!" !" !" !" !" !" !" !" #$ !" !" !" !"!"!"# $!" !" !" !" !" !" !" !" Global sensitivity of cluster centers is roughly the diameter of the space. • But intuitively, if clustering is ”good”, cluster centers should be insensitive. 16 Wednesday, October 14, 2009 26 What about maximum likelihood estimates? • Sometimes MLE is well-behaved, e.g. observed proportion for binomial • Sometimes we have no idea e.g. no closed form expression for most loglinear models • (loglinear = popular class of statistical models for categorical data) Can have arbitrarily bad sensitivity 27 Wednesday, October 14, 2009 Sample-and-Aggregate Methodology [NRS] Sample-and-Aggregate Intuition: Replace f with a less sensitive function f˜. f˜(x) = g(f (sample1 ), f (sample2 ), . . . , f (samples )) k x ! xi1 , . . . , xit " xj1 , . . . , xjt $ ! f" $ ! f" ' %& " aggregation function g! noise calibrated calibrated noise sensitivity of gf˜ totosensitivity Wednesday, October 14, 2009 ... # xk1 , . . . , xkt $ (+ " ( output $ ! f" 28 T ∗ (x) = k θ̂ x(i−1)t+1 , ..., xit + Lap k" Λ is the diameterPoint of the parameter’s range, and Lap(λ) is a random variabl Example:whereEfficient Estimates i=1 the Laplacian distribution with parameter (standard deviation) λ. • Given a parametric model {fθ : θ ∈ Θ} • MLE = argmaxθ (fθ (x)) • Converges to Normal Var(MLE) = (IF(θ)n)-1 Can be corrected so that bias( θ̂ ) = O(n-3/2) ! !! " ! !! xi1 , . . . , xit ' ! θ̂" ( x # #$ %% % xj1 , . . . , xjt . . . %% % & x k 1 , . . . , x kt ' ' " ! ! " θ̂ , θ̂ , (( * ( z1 * z2 · · · ,,zk , (( , + * , ) " ( ! g noise calibrated to sensitivity of g aggregation function ' . +" . output Figure 1: Sample-and-aggregate • Theorem: If model is well-behaved, then sample6 θ̂ n = ω(1/" ) aggregate using gives efficient estimator if For concreteness, we first analyze the one-dimensional estimator described in Lemma 2.2 ([1, 2]). For any choice of the number of blocks k, the estimator T ∗ i Average MLE from different samples • Basic version: Proof. Fix a particular value of x, and consider the effect of changing a single # databasespace x (for is anybounded, particular index i). At most of the is numbers zj can ch If parameter sensitivity of one average O(1/k) 29 the block which contains xi . The number zj that changes can go up or down by Wednesday, October 14, 2009 T ∗ (x) = k θ̂ x(i−1)t+1 , ..., xit + Lap k" where Λ is the diameter of the parameter’s range, and Lap(λ) is a random variabl Proof idea i=1 the Laplacian distribution with parameter (standard deviation) λ. • Each MLE ≈ N((k/n)3/2, k/I(n) ! ! !! x %% % # #$ % %% "! ! & • Output ≈ Gaussian since xi , . . . , x i xj , . . . , x j . . . xk , . . . , x k each MLE ≈ Gaussian ' ' ' ! " ! " ! " θ̂ ( θ̂ , θ̂ , (( * • Variance of average ( z1 * z2 · · · ,,zk , (( , + * , ) " ( g! aggregation function = (k/n) / k = 1/n ' noise calibrated . +" . output • Bias of average to sensitivity of g = bias of each term = (k/n)3/2 Figure 1: Sample-and-aggregate • Total squared error 2 + E(noise2) = variance + bias For concreteness, we first analyze the one-dimensional estimator described in 1 t 1 t 1 t 6 1 k Lemma 1 ([1, + choice o(1) of the when n= ω(1/! ) estimator T ∗ i 3 2 2]). For1any 2.2 number of blocks k, the = +( ) +( ) = 2/3 )) n n Proof. Fix k!a particular valuenof x, and (set k = o(n consider the effect of changing a single database x# (for any particular index i). At most one of the numbers zj can ch 30 the block which contains xi . The number zj that changes can go up or down by Wednesday, October 14, 2009 Higher Dimension • Theorem holds in any fixed dimension actually up to poly(n) • Higher dimension requires significant extra work need to find an ellipse-shaped “envelope” for data set that captures 1-o(1) fraction of points in set... ... differentially privately Use Dunagan-Vempala ’01 • Open: understanding exact dependence on dimension 31 Wednesday, October 14, 2009 Conclusions • Define privacy in terms of my effect on output Meaningful despite arbitrary external information I should participate if I get benefit • What can we compute with rigorous guarantees? This talk: statistical estimators that are “as good” as optimal non-private estimators • First step: basic asymptotic theory • New aspect to “curse” of dimensionality Future / ongoing work: • More sophisticated analyses: linear/logistic regression, kernel density estimates, etc • Multiple analyses (see Blum-Ligett-Roth) Wednesday, October 14, 2009 32