Foundations of Privacy Lecture 4 Lecturer: Moni Naor Recap of last week’s lecture • Differential Privacy • Sensitivity: – Global sensitivity of query q:Un→Rd GSq = maxD,D’ ||q(D) – q(D’)||1 – Local sensitivity of query q at point D LSq(D)= maxD’ |q(D) – q(D’)| – Smooth sensitivity Sf*(X)= maxY {LSf(Y)e- dist(x,y) } • Histograms • Differential privacy of median Histograms • Inputs x1, x2, ..., xn in domain U Domain U partitioned into d disjoint bins S1,…,Sd q(x1, x2, ..., xn) = (n1, n2, ..., nd) where nj = #{i : xi in j-th bin} Can view as d queries: qi counts # spoints in set Si For adjacent D,D’, only one answer can change it can change by 1 Global sensitivity of answer vector is 1 Sufficient to add Lap(1/ε) noise to each query, still get ε-privacy The Exponential Mechanism [McSherry Talwar] A general mechanism that yields • Differential privacy • May yield utility/approximation • Is defined and evaluated by considering all possible answers The definition does not yield an efficient way of evaluating it Application/original motivation: Approximate truthfulness of auctions • Collusion resistance • Compatibility Side bar: Digital Goods Auction • Some product with 0 cost of production • n individuals with valuation v1, v2, … vn • Auctioneer wants to maximize profit Example of the Exponential Mechanism • Data: xi = website visited by student i today Size of subset • Range: Y = {website names} • For each name y, let q(y, X) = #{i : xi = y} Goal: output the most frequently visited site • Procedure: Given X, Output website y with probability prop to eq(y,X) • Popular sites exponentially more likely than rare ones Website scores don’t change too quickly Setting • For input D 2 Un want to find r2R • Base measure on R - usually uniform • Score function q’: Un £ R R assigns any pair (D,r) a real value – Want to maximize it (approximately) The exponential mechanism – Assign output r2R with probability proportional to eq’(D,r) (r) Normalizing factor r eq’(D,r) (r) The exponential mechanism is private • Let = maxD,D’,r |q(D,r)-q(D’,r)| adjacent Claim: The exponential mechanism yields a 2¢¢ differentially private solution • Prob [output = r on input D] = eq’(D,r) (r)/r eq’(D,r) (r) • Prob [output = r on input D’] = eq’(D’,r) (r)/r eq’(D’,r) (r) Ratio is bounded by e e Laplace Noise as Exponential Mechanism • On query q:Un→R let q’(D,r) = -|q(D)-r| • Prob noise = y y e-y / 2 y e-y = /2 e-y Laplace distribution Y=Lap(b) has density function Pr[Y=y] =1/2b e-|y|/b -4 -3 -2 -1 0 1 2 3 4 5 Any Differentially Private Mechanism is an instance of the Exponential Mechanism • Let M be a differentially private mechanism Take q’(D,r) to be log Prob[M(D) =r] Remaining issue: Accuracy Private Ranking • Each element i 2 {1, … n} has a real valued score SD(i) based on a data set D. • Goal: Output k elements with highest scores. • Privacy • Data set D consists of n entries in domain D. – Differential privacy: Protects privacy of entries in D. • Condition: Insensitive Scores – for any element i, for any data sets D, D’ that differ in one entry:|SD(i)- SD’(i)| · 1 Approximate ranking • Let Sk be the kth highest score based on data set D. • An output list is -useful if: Soundness: No element in the output has score less than Sk - Completeness: Every element with score greater than Sk + is in the output. Score · Sk - Sk + · Score Sk - · Score · Sk + Two Approaches • Score perturbation Each input affects all scores – Perturb the scores of the elements with noise – Pick the top k elements in terms of noisy scores. – Fast and simple implementation Question: what sort of noise should be added? What sort of guarantees? • Exponential sampling – Run the exponential mechanism k times. – more complicated and slower implementation What sort of guarantees? Homework Exponential Mechanism: Simple Example (almost free) private lunch Database of n individuals, lunch options {1…k}, each individual likes or dislikes each option (1 or 0) Goal: output a lunch option that many like For each lunch option j2 [k], ℓ(j) is # of ind. who like j Exponential Mechanism: Output j with probability eεℓ(j) Actual probability: eεℓ(j)/(∑i eεℓ(i)) Normalizer Synthetic DB: Output is a DB Sanitizer Database answer 1 answer 3 answer 2 query 1, query 2, ... ? Synthetic DB: output also a DB (of entries from same universe X), user reconstructs answers by evaluating query on output DB Software and people compatible Consistent answers Answering More Queries Using exponential mechanism Differential Privacy for every set C of counting queries • Error is Õ(n2/3 log|C|) Remarkable Hope for rich private analysis of small DBs! • Quantitative: #queries >> DB size, • Qualitative: output of sanitizer -synthetic DBoutput is a DB itself Counting Queries Database D of size n • Queries with low sensitivity Query c Counting-queries C is a set of predicates c: U {0,1} Query: how many D participants satisfy c ? U Relaxed accuracy: answer query within α additive error w.h.p Not so bad: error anyway inherent in statistical analysis Assume all queries given in advance Non-interactive Utility and Privacy Can’t Always Be Achieved Simultaneously Impossibility results for counting queries: DB with n participants can’t have o(√n) error, O(n) queries [DiNi, DwMcTa07,DwYe08] In all these cases, strong privacy violation What can we do? almost entire DB compromised Huge DBs [Dwork Nissim] DB of size n >> # queries |C|: Add independent noise to answer on every query Noise per query ~ #queries For accuracy, need #queries ≤ n May be reasonable for huge internet-scale DBs, Privacy “for free” What about smaller DBs? DB of size n < #queries |C|, impossibility results: can’t have o(√n) error Error must be Ω(√n) The BLR Algorithm Blum Ligett Roth08 Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n) DB F gets picked w.p. / e-ε·dist(F,D) For DBs F and D dist(F,D) = maxq2C |q(F) – q(D)| Intuition: far away DBs get smaller probability The BLR Algorithm Idea: • In general: Do not use large DB – Sample and answer accordingly • DB of size m guaranteeing hitting each query with sufficient accuracy The BLR Algorithm: 2ε-Privacy Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n) DB F gets picked w.p. / e-ε·dist(F,D) For adjacent D,D’ for every F |dist(F,D) – dist(F,D’)| ≤ 1 Probability of F by D: e-ε·dist(F,D)/∑G of size m e-ε·dist(G,D) Probability of F by D’: numerator and denominator can change by eε-factor 2ε-privacy The BLR Algorithm: Error Õ(n2/3 log|C|) Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n) DB F gets picked w.p. / e-ε·dist(F,D) There exists Fgood of size m =Õ((n\α)2·log|C|) s.t. dist(Fgood,D) ≤ α Pr [Fgood] ~ e-εα For any Fbad with dist 2α, Pr [Fbad] ~ e-2εα Union bound: ∑ bad DB Fbad Pr [Fbad] ~ |U|me-2εα For α=Õ(n2/3log|C|), Pr [Fgood] >> ∑ Pr [Fbad] The BLR Algorithm: Running Time Algorithm on input DB D: Sample from a distribution on DBs of size m: (m < n) DB F gets picked w.p. / e-ε·dist(F,D) Generating the distribution by enumeration: Need to enumerate every size-m database, where m = Õ((n\α)2·log|C|) Running time ≈ 2·log|c|) Õ((n\α) |U| Conclusion Offline algorithm, 2ε-Differential Privacy for any set C of counting queries • Error α is Õ(n2/3 log|C|/ε) • Super-poly running time: 2·log|C|) Õ((n\α) |U| Can we Efficiently Sanitize? The good news If the universe is small, Can sanitize EFFICIENTLY Time poly(|C|,|U|) The bad news cannot do much better, namely sanitize in time: sub-poly(|C|) AND sub-poly(|U|) How Efficiently Can We Sanitize? |C| subpoly poly subpoly ? ? poly ? ? |U| Good news! The Good News: Can Sanitize When Universe is Small Efficient Sanitizer for query set C • DB size n ¸ Õ(|C|o(1) log|U|) • error is ~ n2/3 • Runtime poly(|C|,|U|) Output is a synthetic database Compare to [Blum Ligget Roth]: n ¸ Õ(log|C| log|U|), runtime super-poly(|C|,|U|) Recursive Algorithm C0=C C1 C2 Start with DB D and large query set C Repeatedly choose random subset Ci+1 of Ci: shrink query set by (small) factor Cb Recursive Algorithm C0=C C1 C2 Cb Start with DB D and large query set C Repeatedly choose random subset Ci+1 of Ci: shrink query set by (small) factor End recursion: sanitize D w.r.t. small query set Cb Output is good for all queries in small set Ci+1 Extract utility on almost-all queries in large set Ci Fix remaining “underprivileged” queries in large set Ci