Tuning Privacy-Utility Tradeoffs in Statistical Databases using Policies Ashwin Machanavajjhala ashwin @ cs.duke.edu Collaborators: Daniel Kifer (PSU), Bolin Ding (MSR), Xi He (Duke) Summer @ Census, 8/15/2013 1 Overview of the talk • An inherent trade-off between privacy (confidentiality) of individuals and utility of statistical analyses over data collected from individuals. • Differential privacy has revolutionized how we reason about privacy – Nice tuning knob ε for trading off privacy and utility Summer @ Census, 8/15/2013 2 Overview of the talk • However, differential privacy only captures a small part of the privacy-utility trade-off space – No Free Lunch Theorem – Differentially private mechanisms may not ensure sufficient utility – Differentially private mechanisms may not ensure sufficient privacy Summer @ Census, 8/15/2013 3 Overview of the talk • I will present a new privacy framework that allows data publishers to more effectively tradeoff privacy for utility – Better control on what to keep secret and who the adversaries are – Can ensure more utility than differential privacy in many cases – Can ensure privacy where differential privacy fails Summer @ Census, 8/15/2013 4 Outline • Background – Differential privacy • No Free Lunch [Kifer-M SIGMOD ’11] – No `one privacy notion to rule them all’ • Pufferfish Privacy Framework [Kifer-M PODS’12] – Navigating the space of privacy definitions • Blowfish: Practical privacy using policies Summer @ Census, 8/15/2013 [ongoing work] 5 Data Privacy Problem Utility: Privacy: No breach about any individual Server DB Individual 1 r1 Individual 2 r2 Summer @ Census, 8/15/2013 Individual 3 r3 Individual N rN 6 Data Privacy in the real world Application Data Collector Third Party (adversary) Private Information Function (utility) Medical Hospital Epidemiologist Disease Correlation between disease and geography Genome analysis Hospital Statistician/ Researcher Genome Correlation between genome and disease Advertising Google/FB/Y! Advertiser Social Recommendations Facebook Another user Summer @ Census, 8/15/2013 Clicks/Brows Number of clicks on an ad ing by age/region/gender … Friend links / profile Recommend other users or ads to users based on social network 7 Many definitions & several attacks K-Anonymity Sweeney et al. IJUFKS ‘02 L-diversity T-closeness Li et. al ICDE ‘07 Differential Privacy Machanavajjhala et. al TKDD ‘07 E-Privacy • Linkage attack • Background knowledge attack • Minimality /Reconstruction attack • de Finetti attack • Composition attack Machanavajjhala et. al VLDB ‘09 Dwork et. al ICALP ‘06 Summer @ Census, 8/15/2013 8 Differential Privacy For every pair of inputs that differ in one value D1 For every output … D2 O Adversary should not be able to distinguish between any D1 and D2 based on any O log Pr[A(D1) = O] Pr[A(D2) = O] Summer @ Census, 8/15/2013 < ε (ε>0) . 9 Algorithms • No deterministic algorithm guarantees differential privacy. • Random sampling does not guarantee differential privacy. • Randomized response satisfies differential privacy. Summer @ Census, 8/15/2013 10 Laplace Mechanism Query q Database D True answer q(D) q(D) + η Researcher Privacy depends on the λ parameter η h(η) α exp(-η / λ) Laplace Distribution – Lap(λ) 0.6 Mean: 0, Variance: 2 λ2 Summer @ Census, 8/15/2013 0.4 0.2 0 -10 -8 -6 -4 -2 0 2 4 11 6 8 10 Laplace Mechanism [Dwork et al., TCC 2006] Thm: If sensitivity of the query is S, then the following guarantees εdifferential privacy. λ = S/ε Sensitivity: Smallest number s.t. for any D,D’ differing in one entry, || q(D) – q(D’) ||1 ≤ S(q) Summer @ Census, 8/15/2013 12 Contingency tables Each tuple takes k=4 different values 2 2 2 8 D Count( Summer @ Census, 8/15/2013 , ) 13 Laplace Mechanism for Contingency Tables 2 + Lap(2/ε) 2 + Lap(2/ε) 2 + Lap(2/ε) 8 + Lap(2/ε) Mean : 8 Variance : 8/ε2 D Sensitivity = 2 Summer @ Census, 8/15/2013 14 Composition Property If algorithms A1, A2, …, Ak use independent randomness and each Ai satisfies εi-differential privacy, resp. Then, outputting all the answers together satisfies differential privacy with ε = ε1 + ε2 + … + ε k Privacy Budget Summer @ Census, 8/15/2013 15 Differential Privacy • Privacy definition that is independent of the attacker’s prior knowledge. • Tolerates many attacks that other definitions are susceptible to. – Avoids composition attacks – Claimed to be tolerant against adversaries with arbitrary background knowledge. • Allows simple, efficient and useful privacy mechanisms – Used in LEHD’s OnTheMap Summer @ Census, 8/15/2013 [M et al ICDE ‘08] 16 Outline • Background – Differential privacy • No Free Lunch [Kifer-M SIGMOD ’11] – No `one privacy notion to rule them all’ • Pufferfish Privacy Framework [Kifer-M PODS’12] – Navigating the space of privacy definitions • Blowfish: Practical privacy using policies Summer @ Census, 8/15/2013 [ongoing work] 17 Differential Privacy & Utility • Differentially private mechanisms may not ensure sufficient utility for many applications. • Sparse Data: Integrated Mean Square Error due to Laplace mechanism can be worse than returning a random contingency table for typical values of ε (around 1) • Social Networks Summer @ Census, 8/15/2013 [M et al PVLDB 2011] 18 Differential Privacy & Privacy • Differentially private algorithms may not limit the ability of an adversary to learn sensitive information about individuals when records in the data are correlated • Correlations across individuals occur in many ways: – Social Networks – Data with pre-released constraints – Functional Dependencies Summer @ Census, 8/15/2013 19 Laplace Mechanism and Correlations 2 + Lap(2/ε) 2 + Lap(2/ε) 4 2 + Lap(2/ε) 8 + Lap(2/ε) 10 4 10 Auxiliary marginals published for following reasons: D 1. 2. Legal: 2002 Supreme Court case Utah v. Evans Contractual: Advertisers must know exact demographics at coarse granularities Does Laplace mechanism still guarantee privacy? Summer @ Census, 8/15/2013 20 Laplace Mechanism and Correlations 22 ++ Lap(2/ε) Lap(2/ε) 22 ++ Lap(2/ε) Lap(2/ε) 4 22 ++ Lap(2/ε) Lap(2/ε) 8 + Lap(2/ε) 10 4 D Summer @ Census, 8/15/2013 Count ( Count ( Count ( Count ( 10 , , , , ) = 8 + Lap(2/ε) ) = 8 – Lap(2/ε) ) = 8 – Lap(2/ε) ) = 8 + Lap(2/ε) 21 Laplace Mechanism and Correlations 22 ++ Lap(1/ε) Lap(2/ε) Lap(2/ε) 22 ++ Lap(1/ε) 4 22 ++ Lap(1/ε) Lap(2/ε) 8 + Lap(2/ε) 10 4 10 Mean : 8 Variance : 8/ke2 D can reconstruct the table with high precision for large k Summer @ Census, 8/15/2013 22 No Free Lunch Theorem It is not possible to guarantee any utility in addition to privacy, without making assumptions about • the data generating distribution [Kifer-M SIGMOD ‘11] • the background knowledge available to an adversary [Dwork-Naor JPC ‘10] Summer @ Census, 8/15/2013 23 To sum up … • Differential privacy only captures a small part of the privacy-utility trade-off space – No Free Lunch Theorem – Differentially private mechanisms may not ensure sufficient privacy – Differentially private mechanisms may not ensure sufficient utility Summer @ Census, 8/15/2013 24 Outline • Background – Differential privacy • No Free Lunch [Kifer-M SIGMOD ’11] – No `one privacy notion to rule them all’ • Pufferfish Privacy Framework [Kifer-M PODS’12] – Navigating the space of privacy definitions • Blowfish: Practical privacy using policies Summer @ Census, 8/15/2013 [ongoing work] 25 Pufferfish Framework Summer @ Census, 8/15/2013 26 Pufferfish Semantics • What is being kept secret? • Who are the adversaries? • How is information disclosure bounded? – (similar to epsilon in differential privacy) Summer @ Census, 8/15/2013 27 Sensitive Information • Secrets: S be a set of potentially sensitive statements – “individual j’s record is in the data, and j has Cancer” – “individual j’s record is not in the data” • Discriminative Pairs: Mutually exclusive pairs of secrets. – (“Bob is in the table”, “Bob is not in the table”) – (“Bob has cancer”, “Bob has diabetes”) Summer @ Census, 8/15/2013 28 Adversaries • We assume a Bayesian adversary who is can be completely characterized by his/her prior information about the data – We do not assume computational limits • Data Evolution Scenarios: set of all probability distributions that could have generated the data ( … think adversary’s prior). – No assumptions: All probability distributions over data instances are possible. – I.I.D.: Set of all f such that: P(data = {r1, r2, …, rk}) = f(r1) x f(r2) x…x f(rk) Summer @ Census, 8/15/2013 29 Information Disclosure • Mechanism M satisfies ε-Pufferfish(S, Spairs, D), if Summer @ Census, 8/15/2013 30 Pufferfish Semantic Guarantee Posterior odds of s vs s’ Summer @ Census, 8/15/2013 Prior odds of s vs s’ 31 Pufferfish & Differential Privacy • Spairs: – six: record i takes the value x – • Attackers should not be able to significantly distinguish between any two values from the domain for any individual record. Summer @ Census, 8/15/2013 34 Pufferfish & Differential Privacy • Data evolution: – For all θ = [ f1, f2, f3, …, fk ] • Adversary’s prior may be any distribution that makes records independent Summer @ Census, 8/15/2013 35 Pufferfish & Differential Privacy • Spairs: – – six: record i takes the value x • Data evolution: – For all θ = [ f1, f2, f3, …, fk ] A mechanism M satisfies differential privacy if and only if it satisfies Pufferfish instantiated using Spairs and {θ} Summer @ Census, 8/15/2013 36 Summary of Pufferfish • A semantic approach to defining privacy – Enumerates the information that is secret and the set of adversaries. – Bounds the odds ratio of pairs of mutually exclusive secrets • Helps understand assumptions under which privacy is guaranteed – Differential privacy is one specific choice of secret pairs and adversaries • How should a data publisher use this framework? • Algorithms? Summer @ Census, 8/15/2013 37 Outline • Background – Differential privacy • No Free Lunch [Kifer-M SIGMOD ’11] – No `one privacy notion to rule them all’ • Pufferfish Privacy Framework [Kifer-M PODS’12] – Navigating the space of privacy definitions • Blowfish: Practical privacy using policies Summer @ Census, 8/15/2013 [ongoing work] 38 Blowfish Privacy • A special class of Pufferfish instantiations Both pufferfish and blowfish are marine fish of the Tetraodontidae family Summer @ Census, 8/15/2013 39 Blowfish Privacy • A special class of Pufferfish instantiations • Extends differential privacy using policies – Specification of sensitive information • Allows more utility – Specification of publicly known constraints in the data • Ensures privacy in correlated data • Satisfies the composition property Summer @ Census, 8/15/2013 40 Blowfish Privacy • A special class of Pufferfish instantiations • Extends differential privacy using policies – Specification of sensitive information • Allows more utility – Specification of publicly known constraints in the data • Ensures privacy in correlated data • Satisfies the composition property Summer @ Census, 8/15/2013 41 Sensitive Information • Secrets: S be a set of potentially sensitive statements – “individual j’s record is in the data, and j has Cancer” – “individual j’s record is not in the data” • Discriminative Pairs: Mutually exclusive pairs of secrets. – (“Bob is in the table”, “Bob is not in the table”) – (“Bob has cancer”, “Bob has diabetes”) Summer @ Census, 8/15/2013 42 Sensitive information in Differential Privacy • Spairs: – six: record i takes the value x – • Attackers should not be able to significantly distinguish between any two values from the domain for any individual record. Summer @ Census, 8/15/2013 43 Other notions of Sensitive Information • Medical Data – OK to infer whether individual is healthy or not. – E.g., (Bob is Healthy, Bob is Diabetes) is not a discriminative pair of secrets for any individual • Partitioned Sensitive Information: Summer @ Census, 8/15/2013 44 Other notions of Sensitive Information • Geospatial Data – Do not want the attacker to distinguish between “close-by” points in the space. – May distinguish between “far-away” points • Distance based Sensitive Information Summer @ Census, 8/15/2013 45 Generalization as a graph • Consider a graph G = (V, E), where V is the set of values that an individual’s record can take. • E encodes the set of discriminative pairs – Same for all records. Summer @ Census, 8/15/2013 47 Blowfish Privacy + “Policy of Secrets” • A mechanism M satisfy blowfish privacy wrt policy G if – For every set of outputs of the mechanism S – For every pair of datasets that differ in one record, with values x and y s.t. (x,y) ε E Summer @ Census, 8/15/2013 48 Blowfish Privacy + “Policy of Secrets” • A mechanism M satisfy blowfish privacy wrt policy G if – For every set of outputs of the mechanism S – For every pair of datasets that differ in one record, with values x and y s.t. (x,y) ε E • For any x and y in the domain, Shortest distance between x and y in G Summer @ Census, 8/15/2013 49 Blowfish Privacy + “Policy of Secrets” • A mechanism M satisfy blowfish privacy wrt policy G if – For every set of outputs of the mechanism S – For every pair of datasets that differ in one record, with values x and y s.t. (x,y) ε E • Adversary is allowed to distinguish between x and y that appear in different disconnected components in G Summer @ Census, 8/15/2013 50 Algorithms for Blowfish • Consider an ordered 1-D attribute – Dom = {x1,x2,x3,…,xd} – E.g., ranges of Age, Salary, etc. • Suppose our policy is: Adversary should not distinguish whether an individual’s value is xj or xj+1 . x1 x2 x3 Summer @ Census, 8/15/2013 xd 52 Algorithms for Blowfish • Suppose we want to release histogram privately – Number of individuals in each age range C(x1) x1 x2 C(x3) C(xd) x3 xd • Any differentially private algorithm also satisfies blowfish – Can use Laplace mechanism (with sensitivity 2) Summer @ Census, 8/15/2013 53 Ordered Mechanism • We can answer a different set of queries to get a different private estimator for the histogram. Sd … S3 S2 S1 C(x1) x1 x2 C(x3) C(xd) x3 xd Summer @ Census, 8/15/2013 54 Ordered Mechanism • We can answer each Si using Laplace mechanism … • … but sensitivity for all the queries is only 1 Sd … S3 S2 S1 C(x2) -1 C(x3) +1 x1 x2 x3 Summer @ Census, 8/15/2013 Changing one tuple from x2 to x3 only changes S2 xd 55 Ordered Mechanism • We can answer each Si using Laplace mechanism … • … but sensitivity for all the queries is only 1 Factor of 2 improvement Summer @ Census, 8/15/2013 56 Ordered Mechanism • In addition, we have the following constraint: • However, the noisy counts may not satisfy this constraint. • We can post-process the noisy counts to ensure this constraint: Summer @ Census, 8/15/2013 57 Ordered Mechanism • We can post-process the noisy counts to ensure this constraint: Order of magnitude improvement for large d Summer @ Census, 8/15/2013 58 Ordered Mechanism • By leveraging the weaker sensitive information in the policy, we can provide significantly better utility • Extends to more general policy specifications. • Ordered mechanisms and other blowfish algorithms are being tested on the synthetic data generator for LODES data product. Summer @ Census, 8/15/2013 59 Blowfish Privacy & Correlations • Differentially private mechanisms may not ensure privacy when correlations exist in the data. • Blowfish can handle constraints in the form of publicly known constraints. – Well know marginal counts in the data – Other dependencies • Privacy definition is similar to differential privacy with a modified notion of neighboring tables Summer @ Census, 8/15/2013 60 Other instantiations of Pufferfish • All blowfish instantiations are extensions of differential privacy using – Weaker notions of sensitive information – Allowing knowledge of constraints about the data – All blowfish mechanisms satisfy composition property • We can instantiate Pufferfish with other “realistic” adversary notions – Only prior distributions that are similar to the expected data distribution – Open question: Which definitions satisfy composition property? Summer @ Census, 8/15/2013 61 Summary • Differential privacy (and the tuning knob epsilon) is insufficient for trading off privacy for utility in many applications – Sparse data, Social networks, … • Pufferfish framework allows more expressive privacy definitions – Can vary sensitive information, adversary priors, and epsilon • Blowfish shows one way to create more expressive definitions – Can provide useful composable mechanisms • There is an opportunity to correctly tune privacy by using the above expressive privacy frameworks Summer @ Census, 8/15/2013 62 Thank you [M et al PVLDB’11] A. Machanavajjhala, A. Korolova, A. Das Sarma, “Personalized Social Recommendations – Accurate or Private?”, PVLDB 4(7) 2011 [Kifer-M SIGMOD’11] D. Kifer, A. Machanavajjhala, “No Free Lunch in Data Privacy”, SIGMOD 2011 [Kifer-M PODS’12] D. Kifer, A. Machanavajjhala, “A Rigorous and Customizable Framework for Privacy”, PODS 2012 [ongoing work] A. Machanavajjhala, B. Ding, X. He, “Blowfish Privacy: Tuning Privacy-Utility Trade-offs using Policies”, in preparation Summer @ Census, 8/15/2013 63