Slides - Jun Zhang

Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, Xiaokui Xiao  The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges  The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes  Function 𝐹: Linear vs. Logarithmic  Experiments  The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges  The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes  Function 𝐹: Linear vs. Logarithmic  Experiments company institute sensitive database 𝐷 public adversary company similar properties accurate inference sensitive database 𝐷 synthetic database 𝐷 ∗ How can we design such a private data release algorithm? adversary  The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges  The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes  Function 𝐹: Linear vs. Logarithmic  Experiments  Definition of 𝜺-Differential Privacy ◦ A randomized data release algorithm 𝑨 satisfies 𝜀-differential privacy, if for any two neighboring datasets 𝑫, 𝑫′ and for any possible synthetic data 𝑫∗ , Pr 𝑨 𝑫 = 𝑫∗ ≤ exp(𝜺) ⋅ Pr 𝑨 𝑫′ = 𝑫∗ Name Has cancer? Name Has cancer? Alice Yes Alice Yes Bob No Bob No Chris Yes Chris Yes Denise Yes Denise Yes Eric No Eric No Frank Yes Frank No  A general approach to achieve differential privacy is injecting Laplace noise to the output, in order to cover the impact of any individual!  More details in Preliminaries part of the paper Design a data release algorithm with differential privacy guarantee.  The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges  The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes  Function 𝐹: Linear vs. Logarithmic  Experiments  To build a synthetic data, we need to understand the tuple distribution Pr[∗] of the sensitive data. convert sensitive database 𝐷 + noise full-dim tuple distribution sample noisy distribution synthetic database 𝐷 ∗   Example: Database has 10M tuples, 10 attributes (dimensions), and 20 values per attribute: Scalability: full distribution Pr[∗] has 2010 ≈ 10𝑇 cells ◦ most of them have non-zero counts after noise injection ◦ privacy is expensive (computation, storage)  Signal-to-noise: avg. information in each cell 10−6 ; avg. noise is 10 (for 𝜀 = 0.1) 10𝑀 is 10𝑇 = Previous solutions suffer from either scalability or signal-to-noise problem  The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges  The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes  Function 𝐹: Linear vs. Logarithmic  Experiments convert sensitive database 𝐷 sample + noise full-dim tuple distribution noisy distribution synthetic database 𝐷 ∗ approximate convert + noise a set of low-dim distributions noisy low-dim distributions sample  The advantages of using low-dimensional distributions ◦ easy to compute ◦ small domain -> high signal density -> robust against noise  But, how to find a set of low-dim distributions that provides a good approximation to full distribution?  The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges  The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes  Function 𝐹: Linear vs. Logarithmic  Experiments  A 5-dimensional database: Pr 𝑎𝑔𝑒 Pr 𝑤𝑜𝑟𝑘 | 𝑎𝑔𝑒 age workclass income education title Pr 𝑒𝑑𝑢 | 𝑎𝑔𝑒 Pr 𝑡𝑖𝑡𝑙𝑒 | 𝑤𝑜𝑟𝑘 Pr 𝑖𝑛𝑐𝑜𝑚𝑒 | 𝑤𝑜𝑟𝑘  A 5-dimensional database: age workclass income education title Pr ∗ ≈ Pr 𝑎𝑔𝑒 ⋅ Pr 𝑤𝑜𝑟𝑘 | 𝑎𝑔𝑒 ⋅ Pr 𝑒𝑑𝑢 | 𝑎𝑔𝑒 ⋅ Pr 𝑡𝑖𝑡𝑙𝑒 | 𝑤𝑜𝑟𝑘 ⋅ Pr 𝑖𝑛𝑐𝑜𝑚𝑒 | 𝑤𝑜𝑟𝑘 age workclass income education title Pr ∗ ≈ Pr 𝑎𝑔𝑒 ⋅ Pr 𝑒𝑑𝑢| 𝑎𝑔𝑒 ⋅ Pr 𝑤𝑜𝑟𝑘 | 𝑎𝑔𝑒, 𝑒𝑑𝑢 ⋅ Pr 𝑡𝑖𝑡𝑙𝑒 | 𝑒𝑑𝑢, 𝑤𝑜𝑟𝑘 ⋅ Pr 𝑖𝑛𝑐𝑜𝑚𝑒 | 𝑤𝑜𝑟𝑘, 𝑡𝑖𝑡𝑙𝑒 Quality of Bayesian network decides the quality of approximation  The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges  The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes  Function 𝐹: Linear vs. Logarithmic  Experiments  STEP 1: Choose a suitable Bayesian network 𝒩 ◦ must in a differentially private way  STEP 2: Compute conditional distributions implied by 𝒩 ◦ straightforward to do under differential privacy ◦ inject noise – Laplace mechanism  STEP 3: Generate synthetic data by sampling from 𝒩 ◦ post-processing: no privacy issues  Finding optimal 1-degree Bayesian network was solved in [Chow-Liu’68]. It is a DAG of maximum in-degree 1, and maximizes the sum of mutual information 𝐼 of its edges 𝐼(𝑋, 𝑌) , 𝑋,𝑌 :edge where 𝐼 𝑋, 𝑌 = 𝑦∈𝑌 𝑥∈𝑋 Pr 𝑥, 𝑦 Pr[𝑥, 𝑦] log Pr 𝑥 Pr 𝑦 .  Finding optimal 1-degree Bayesian network was solved in [Chow-Liu’68]. It is a DAG of maximum in-degree 1, and maximizes the sum of mutual information 𝐼 of its edges ⇔ finding the maximum spanning tree, where the weight of edge (𝑋, 𝑌) is mutual information 𝐼(𝑋, 𝑌).  Build a 1-degree BN for database 𝐴 𝐵 𝐶 𝐷 Alan 0 0 0 0 Bob 0 0 0 0 Cykie 1 1 1 0 David 0 0 0 0 Eric 1 1 0 0 Frank 1 1 0 0 George 0 0 0 0 Helen 1 1 1 0 Ivan 0 0 0 0 Jack 1 1 0 0  Start from a random attribute 𝐴 A C B D  Select next tree edge by its mutual information A 0.5 0.5 B 0.5 0.2 𝐴 Alan 0 0.3 Bob 0 𝐵 0 0 Cykie 1 1 1 David 0.5 0.5 0 Eric 1 0 0 1 0 Frank 1 1 0 George 0 0 0 0 Helen 1 1 1 0 Ivan 0 0 0 0 Jack 1 1 0 0 0 C D 𝐶 𝐷 0 0 0 candidates: 0 𝐴→𝐵 0 0𝐴 → 𝐶 0𝐴 → 𝐷  Select next tree edge by its mutual information A 𝑰=𝟏 B 𝑰 = 𝟎. 𝟒 C candidates: 𝐴→𝐵 𝐴→𝐶 𝐴→𝐷 𝑰=𝟎 D  Select next tree edge by its mutual information A C B D  Select next tree edge by its mutual information 𝑰=𝟎 B C 𝑰 = 𝟎. 𝟒 A candidates: 𝐴→𝐶 𝐴→𝐷 𝐵→𝐶 𝐵→𝐷 𝑰=𝟎 𝑰 = 𝟎. 𝟐 D  Select next tree edge by its mutual information A C DONE! B D  It is NP-hard to train the optimal 𝑘-degree Bayesian network, when 𝑘 > 1 [JMLR’04].  Most approximation algorithms are too complicated to be converted into private algorithms.  In our paper, we find a way to extend the Chow-Liu solution (1-degree) to higher degree cases.  In this talk, we focus on 1-degree cases for simplicity.  Do it under Differential Privacy! (Non-private) select the edge with maximum 𝐼  (Private) 𝐼 is data-sensitive -> the best edge is also data-sensitive  Solution: randomized edge selection! Databases 𝐷 Edges 𝑒 define 𝑞 𝐷, 𝑒 → 𝑅 How good edge 𝑒 is as the result of selection, given database 𝐷 𝜀 𝑞 𝐷, 𝑒 Return 𝑒 with probability: Pr[𝑒] ∝ exp ⋅ 2 Δ 𝑞 ′ , 𝑒) where Δ 𝑞 = max 𝑞 𝐷, 𝑒 − 𝑞(𝐷 ′ 𝐷,𝐷 ,𝑒 info noise 1  Do it under Differential Privacy! Problem solved?  Select edges with exponential mechanism NO ◦ define 𝑞(edge) = 𝐼(edge) ◦ we prove Δ 𝐼 = Θ(log 𝑛 /𝑛), where 𝑛 = |𝐷|. (Lemma 1) Sensitivity (noise scale) log 𝑛/𝑛 𝜀 large 𝐼 𝑒 info is too for 𝐼(𝑒) Pr 𝑒 ∝ exp ⋅ 2 log 𝑛/𝑛 noise  The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges  The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes  Function 𝐹: Linear vs. Logarithmic  Experiments 𝐼 and 𝐹 have a strong positive correlation Functions Range (scale of info) Sensitivity (scale of noise) 𝐼 Θ(1) Θ(log 𝑛 /𝑛) 𝐹 Θ(1) Θ(1/𝑛) IDEA: define score to agree with 𝐼 at maximum values and interpolate linearly in-between Pr[𝑥, 𝑦] how far? 1 𝐹=− 2 min Π:𝑜𝑝𝑡𝑖𝑚𝑎𝑙 Π: “optimal” dbns over 𝑋, 𝑌 that maximize 𝐼(𝑋, 𝑌) Π Pr 𝑥, 𝑦 − Π 1 Range of 𝐹: Θ(1) Sensitivity of 𝐹: Θ(1/𝑛) 1 𝐹=− 2 0.5 0.4 0.5 𝐼=1 min Π:𝑜𝑝𝑡𝑖𝑚𝑎𝑙 Pr 𝑥, 𝑦 − Π 0.5 0.2 0.3 𝐼 = 0.4 𝐹 = −0.2 1 0.5 1.6 0.5 𝐼=1 𝐼 𝐹 𝐹 and 𝐼 of random distributions correlation coefficient 𝑟 = 0.9472  The Problem: Private Data Release ◦ Differential Privacy ◦ Challenges  The Algorithm: PrivBayes ◦ Bayesian Network ◦ Details of PrivBayes  Function 𝐹: Linear vs. Logarithmic  Experiments Adult dataset  We use four datasets in our experiments ◦ Adult, NLTCS, TPC-E, BR2000  Adult dataset ◦ census data of 45,222 individuals ◦ 15 attributes: age, workclass, education, marital status, etc. ◦ tuple domain size (full-dimensional): about 252 Query: all 2-way marginals Query: all 3-way marginals Adult, 𝑌 = gender Adult, 𝑌 = education Query: build 4 classifiers Adult, 𝑌 = gender Adult, 𝑌 = education Query: build 4 classifiers  Differential privacy can be applied effectively for data release  Key ideas of the solution: ◦ Bayesian networks for dimension reduction ◦ carefully designed linear quality for exponential mechanism  Many open problems remain: ◦ extend to other forms of data: graph data, mobility data ◦ obtain alternate (workable) privacy definitions Thanks!  Privacy, accuracy, and consistency too: a holistic solution to contingency table release [PODS’07] ◦ incurs an exponential running time ◦ only optimized for low-dimensional marginals  Differentially private publication of sparse data [ICDT’12] ◦ achieves scalability, but no help for signal-to-noise problem  Differentially private spatial decompositions [ICDE’12] ◦ coarsens the histogram H to control nr. cells ◦ has some limits, e.g., range queries, ordinal domain  Assume that 𝑑𝑜𝑚 𝑋 ≤ |𝑑𝑜𝑚(𝑌)|. A distribution Pr[𝑥, 𝑦] maximizes the mutual information between 𝑋 and 𝑌 if and only if ◦ Pr 𝑥 = 1/|𝑑𝑜𝑚(𝑋)|, for any 𝑥 ∈ 𝑑𝑜𝑚(𝑋); ◦ For each 𝑦 ∈ 𝑑𝑜𝑚(𝑌), there is at most one 𝑥 ∈ 𝑑𝑜𝑚(𝑋) with Pr 𝑥, 𝑦 > 0.  two score functions for real 𝑥 ⟹ log(1 + 𝑥) and 𝑥  neighboring databases ⟹ 𝑥 and 𝑥 + Δ𝑥  Sensitivity (noise) ⟹ max of derivative ∞ and 1 query 𝑓 privacy budget 𝜀 noisy answer 𝑂 database differentially private algorithm 1. risk of privacy breach cumulates after answering multiple queries 2. It requires specific DP algorithm for every particular query user query 𝑓 noisy answer 𝑂 private data release privacy budget 𝜀 synthetic data Reusability: only access sensitive data once Generality: support most queries

Slides - Jun Zhang

Related documents

Products

Support

Slides - Jun Zhang

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib