Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, Xiaokui Xiao The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes Function 𝐹: Linear vs. Logarithmic Experiments The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes Function 𝐹: Linear vs. Logarithmic Experiments company institute sensitive database 𝐷 public adversary company similar properties accurate inference sensitive database 𝐷 synthetic database 𝐷 ∗ How can we design such a private data release algorithm? adversary The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes Function 𝐹: Linear vs. Logarithmic Experiments Definition of 𝜺-Differential Privacy ◦ A randomized data release algorithm 𝑨 satisfies 𝜀-differential privacy, if for any two neighboring datasets 𝑫, 𝑫′ and for any possible synthetic data 𝑫∗ , Pr 𝑨 𝑫 = 𝑫∗ ≤ exp(𝜺) ⋅ Pr 𝑨 𝑫′ = 𝑫∗ Name Has cancer? Name Has cancer? Alice Yes Alice Yes Bob No Bob No Chris Yes Chris Yes Denise Yes Denise Yes Eric No Eric No Frank Yes Frank No A general approach to achieve differential privacy is injecting Laplace noise to the output, in order to cover the impact of any individual! More details in Preliminaries part of the paper Design a data release algorithm with differential privacy guarantee. The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes Function 𝐹: Linear vs. Logarithmic Experiments To build a synthetic data, we need to understand the tuple distribution Pr[∗] of the sensitive data. convert sensitive database 𝐷 + noise full-dim tuple distribution sample noisy distribution synthetic database 𝐷 ∗ Example: Database has 10M tuples, 10 attributes (dimensions), and 20 values per attribute: Scalability: full distribution Pr[∗] has 2010 ≈ 10𝑇 cells ◦ most of them have non-zero counts after noise injection ◦ privacy is expensive (computation, storage) Signal-to-noise: avg. information in each cell 10−6 ; avg. noise is 10 (for 𝜀 = 0.1) 10𝑀 is 10𝑇 = Previous solutions suffer from either scalability or signal-to-noise problem The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes Function 𝐹: Linear vs. Logarithmic Experiments convert sensitive database 𝐷 sample + noise full-dim tuple distribution noisy distribution synthetic database 𝐷 ∗ approximate convert + noise a set of low-dim distributions noisy low-dim distributions sample The advantages of using low-dimensional distributions ◦ easy to compute ◦ small domain -> high signal density -> robust against noise But, how to find a set of low-dim distributions that provides a good approximation to full distribution? The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes Function 𝐹: Linear vs. Logarithmic Experiments A 5-dimensional database: Pr 𝑎𝑔𝑒 Pr 𝑤𝑜𝑟𝑘 | 𝑎𝑔𝑒 age workclass income education title Pr 𝑒𝑑𝑢 | 𝑎𝑔𝑒 Pr 𝑡𝑖𝑡𝑙𝑒 | 𝑤𝑜𝑟𝑘 Pr 𝑖𝑛𝑐𝑜𝑚𝑒 | 𝑤𝑜𝑟𝑘 A 5-dimensional database: age workclass income education title Pr ∗ ≈ Pr 𝑎𝑔𝑒 ⋅ Pr 𝑤𝑜𝑟𝑘 | 𝑎𝑔𝑒 ⋅ Pr 𝑒𝑑𝑢 | 𝑎𝑔𝑒 ⋅ Pr 𝑡𝑖𝑡𝑙𝑒 | 𝑤𝑜𝑟𝑘 ⋅ Pr 𝑖𝑛𝑐𝑜𝑚𝑒 | 𝑤𝑜𝑟𝑘 age workclass income education title Pr ∗ ≈ Pr 𝑎𝑔𝑒 ⋅ Pr 𝑒𝑑𝑢| 𝑎𝑔𝑒 ⋅ Pr 𝑤𝑜𝑟𝑘 | 𝑎𝑔𝑒, 𝑒𝑑𝑢 ⋅ Pr 𝑡𝑖𝑡𝑙𝑒 | 𝑒𝑑𝑢, 𝑤𝑜𝑟𝑘 ⋅ Pr 𝑖𝑛𝑐𝑜𝑚𝑒 | 𝑤𝑜𝑟𝑘, 𝑡𝑖𝑡𝑙𝑒 Quality of Bayesian network decides the quality of approximation The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes Function 𝐹: Linear vs. Logarithmic Experiments STEP 1: Choose a suitable Bayesian network 𝒩 ◦ must in a differentially private way STEP 2: Compute conditional distributions implied by 𝒩 ◦ straightforward to do under differential privacy ◦ inject noise – Laplace mechanism STEP 3: Generate synthetic data by sampling from 𝒩 ◦ post-processing: no privacy issues Finding optimal 1-degree Bayesian network was solved in [Chow-Liu’68]. It is a DAG of maximum in-degree 1, and maximizes the sum of mutual information 𝐼 of its edges 𝐼(𝑋, 𝑌) , 𝑋,𝑌 :edge where 𝐼 𝑋, 𝑌 = 𝑦∈𝑌 𝑥∈𝑋 Pr 𝑥, 𝑦 Pr[𝑥, 𝑦] log Pr 𝑥 Pr 𝑦 . Finding optimal 1-degree Bayesian network was solved in [Chow-Liu’68]. It is a DAG of maximum in-degree 1, and maximizes the sum of mutual information 𝐼 of its edges ⇔ finding the maximum spanning tree, where the weight of edge (𝑋, 𝑌) is mutual information 𝐼(𝑋, 𝑌). Build a 1-degree BN for database 𝐴 𝐵 𝐶 𝐷 Alan 0 0 0 0 Bob 0 0 0 0 Cykie 1 1 1 0 David 0 0 0 0 Eric 1 1 0 0 Frank 1 1 0 0 George 0 0 0 0 Helen 1 1 1 0 Ivan 0 0 0 0 Jack 1 1 0 0 Start from a random attribute 𝐴 A C B D Select next tree edge by its mutual information A 0.5 0.5 B 0.5 0.2 𝐴 Alan 0 0.3 Bob 0 𝐵 0 0 Cykie 1 1 1 David 0.5 0.5 0 Eric 1 0 0 1 0 Frank 1 1 0 George 0 0 0 0 Helen 1 1 1 0 Ivan 0 0 0 0 Jack 1 1 0 0 0 C D 𝐶 𝐷 0 0 0 candidates: 0 𝐴→𝐵 0 0𝐴 → 𝐶 0𝐴 → 𝐷 Select next tree edge by its mutual information A 𝑰=𝟏 B 𝑰 = 𝟎. 𝟒 C candidates: 𝐴→𝐵 𝐴→𝐶 𝐴→𝐷 𝑰=𝟎 D Select next tree edge by its mutual information A C B D Select next tree edge by its mutual information 𝑰=𝟎 B C 𝑰 = 𝟎. 𝟒 A candidates: 𝐴→𝐶 𝐴→𝐷 𝐵→𝐶 𝐵→𝐷 𝑰=𝟎 𝑰 = 𝟎. 𝟐 D Select next tree edge by its mutual information A C DONE! B D It is NP-hard to train the optimal 𝑘-degree Bayesian network, when 𝑘 > 1 [JMLR’04]. Most approximation algorithms are too complicated to be converted into private algorithms. In our paper, we find a way to extend the Chow-Liu solution (1-degree) to higher degree cases. In this talk, we focus on 1-degree cases for simplicity. Do it under Differential Privacy! (Non-private) select the edge with maximum 𝐼 (Private) 𝐼 is data-sensitive -> the best edge is also data-sensitive Solution: randomized edge selection! Databases 𝐷 Edges 𝑒 define 𝑞 𝐷, 𝑒 → 𝑅 How good edge 𝑒 is as the result of selection, given database 𝐷 𝜀 𝑞 𝐷, 𝑒 Return 𝑒 with probability: Pr[𝑒] ∝ exp ⋅ 2 Δ 𝑞 ′ , 𝑒) where Δ 𝑞 = max 𝑞 𝐷, 𝑒 − 𝑞(𝐷 ′ 𝐷,𝐷 ,𝑒 info noise 1 Do it under Differential Privacy! Problem solved? Select edges with exponential mechanism NO ◦ define 𝑞(edge) = 𝐼(edge) ◦ we prove Δ 𝐼 = Θ(log 𝑛 /𝑛), where 𝑛 = |𝐷|. (Lemma 1) Sensitivity (noise scale) log 𝑛/𝑛 𝜀 large 𝐼 𝑒 info is too for 𝐼(𝑒) Pr 𝑒 ∝ exp ⋅ 2 log 𝑛/𝑛 noise The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes Function 𝐹: Linear vs. Logarithmic Experiments 𝐼 and 𝐹 have a strong positive correlation Functions Range (scale of info) Sensitivity (scale of noise) 𝐼 Θ(1) Θ(log 𝑛 /𝑛) 𝐹 Θ(1) Θ(1/𝑛) IDEA: define score to agree with 𝐼 at maximum values and interpolate linearly in-between Pr[𝑥, 𝑦] how far? 1 𝐹=− 2 min Π:𝑜𝑝𝑡𝑖𝑚𝑎𝑙 Π: “optimal” dbns over 𝑋, 𝑌 that maximize 𝐼(𝑋, 𝑌) Π Pr 𝑥, 𝑦 − Π 1 Range of 𝐹: Θ(1) Sensitivity of 𝐹: Θ(1/𝑛) 1 𝐹=− 2 0.5 0.4 0.5 𝐼=1 min Π:𝑜𝑝𝑡𝑖𝑚𝑎𝑙 Pr 𝑥, 𝑦 − Π 0.5 0.2 0.3 𝐼 = 0.4 𝐹 = −0.2 1 0.5 1.6 0.5 𝐼=1 𝐼 𝐹 𝐹 and 𝐼 of random distributions correlation coefficient 𝑟 = 0.9472 The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes Function 𝐹: Linear vs. Logarithmic Experiments Adult dataset We use four datasets in our experiments ◦ Adult, NLTCS, TPC-E, BR2000 Adult dataset ◦ census data of 45,222 individuals ◦ 15 attributes: age, workclass, education, marital status, etc. ◦ tuple domain size (full-dimensional): about 252 Query: all 2-way marginals Query: all 3-way marginals Adult, 𝑌 = gender Adult, 𝑌 = education Query: build 4 classifiers Adult, 𝑌 = gender Adult, 𝑌 = education Query: build 4 classifiers Differential privacy can be applied effectively for data release Key ideas of the solution: ◦ Bayesian networks for dimension reduction ◦ carefully designed linear quality for exponential mechanism Many open problems remain: ◦ extend to other forms of data: graph data, mobility data ◦ obtain alternate (workable) privacy definitions Thanks! Privacy, accuracy, and consistency too: a holistic solution to contingency table release [PODS’07] ◦ incurs an exponential running time ◦ only optimized for low-dimensional marginals Differentially private publication of sparse data [ICDT’12] ◦ achieves scalability, but no help for signal-to-noise problem Differentially private spatial decompositions [ICDE’12] ◦ coarsens the histogram H to control nr. cells ◦ has some limits, e.g., range queries, ordinal domain Assume that 𝑑𝑜𝑚 𝑋 ≤ |𝑑𝑜𝑚(𝑌)|. A distribution Pr[𝑥, 𝑦] maximizes the mutual information between 𝑋 and 𝑌 if and only if ◦ Pr 𝑥 = 1/|𝑑𝑜𝑚(𝑋)|, for any 𝑥 ∈ 𝑑𝑜𝑚(𝑋); ◦ For each 𝑦 ∈ 𝑑𝑜𝑚(𝑌), there is at most one 𝑥 ∈ 𝑑𝑜𝑚(𝑋) with Pr 𝑥, 𝑦 > 0. two score functions for real 𝑥 ⟹ log(1 + 𝑥) and 𝑥 neighboring databases ⟹ 𝑥 and 𝑥 + Δ𝑥 Sensitivity (noise) ⟹ max of derivative ∞ and 1 query 𝑓 privacy budget 𝜀 noisy answer 𝑂 database differentially private algorithm 1. risk of privacy breach cumulates after answering multiple queries 2. It requires specific DP algorithm for every particular query user query 𝑓 noisy answer 𝑂 private data release privacy budget 𝜀 synthetic data Reusability: only access sensitive data once Generality: support most queries