CS 380S 0x1A Great Papers in Computer Security Vitaly Shmatikov http://www.cs.utexas.edu/~shmat/courses/cs380s/ C. Dwork Differential Privacy (ICALP 2006 and many other papers) Basic Setting DB= x1 x2 x3 xn-1 xn San ¢ ¢¢ query 1 answer 1 Users (government, query T researchers, answer T marketers, …) random coins slide 3 Examples of Sanitization Methods Input perturbation • Add random noise to database, release Summary statistics • Means, variances • Marginal totals • Regression coefficients Output perturbation • Summary statistics with noise Interactive versions of the above methods • Auditor decides which queries are OK, type of noise slide 4 Strawman Definition Assume x1, … ,xn are drawn i.i.d. from unknown distribution Candidate definition: sanitization is safe if it only reveals the distribution Implied approach: • Learn the distribution • Release description of distribution or re-sample points This definition is tautological • Estimate of distribution depends on data… why is it safe? slide 5 Clustering-Based Definitions Given sanitization S, look at all databases consistent with S brown Safe if no predicate is true for blond 2 all consistent databases 12 brown 14 k-anonymity • Partition D into bins • Safe if each bin is either empty, or contains at least k elements Cell bound methods • Release marginal sums blue 10 12 6 18 16 brown blue blond [0,12] [0,12] 12 brown [0,14] [0,16] 18 16 14 slide 6 Issues with Clustering Purely syntactic definition of privacy What adversary does this apply to? • Does not consider adversaries with side information • Does not consider probability • Does not consider adversarial algorithm for making decisions (inference) slide 7 Classical Intution for Privacy “If the release of statistics S makes it possible to determine the value [of private information] more accurately than is possible without access to S, a disclosure has taken place.” [Dalenius 1977] • Privacy means that anything that can be learned about a respondent from the statistical database can be learned without access to the database Similar to semantic security of encryption • Anything about the plaintext that can be learned from a ciphertext can be learned without the ciphertext slide 8 Problems with Classic Intuition Popular interpretation: prior and posterior views about an individual shouldn’t change “too much” • What if my (incorrect) prior is that every UTCS graduate student has three arms? How much is “too much?” • Can’t achieve cryptographically small levels of disclosure and keep the data useful • Adversarial user is supposed to learn unpredictable things about the database slide 9 Absolute Guarantee Unachievable [Dwork] Privacy: for some definition of “privacy breach,” distribution on databases, adversaries A, A’ such that Pr(A(San)=breach) – Pr(A’()=breach) ≤ • For reasonable “breach”, if San(DB) contains information about DB, then some adversary breaks this definition Example • Vitaly knows that Chad is 2 inches taller than the average Russian • DB allows computing average height of a Russian • This DB breaks Chad’s privacy according to this definition… even if his record is not in the database! slide 10 Differential Privacy DB= x1 x2 x3 xn-1 xn San ¢ ¢¢ query 1 answer 1 query T answer T random coins Adversary A Absolute guarantees are problematic • Your privacy can be “breached” (per absolute definition of privacy) even if your data is not in the database Relative guarantee: “Whatever is learned would be learned regardless of whether or not you participate” • Dual: Whatever is already known, situation won’t get worse slide 11 Indistinguishability DB= x1 x2 x3 xn-1 xn Differ in 1 row DB’= x1 x2 y3 xn-1 xn San ¢ ¢¢ query 1 answer 1 query T answer T random coins San ¢ ¢¢ query 1 answer 1 query T answer T transcript S Distance between distributions is at most transcript S’ random coins slide 12 Which Distance to Use? Problem: must be large • Any two databases induce transcripts at distance ≤ n • To get utility, need > 1/n Statistical difference 1/n is not meaningful! • Example: release a random point from the database – San(x1,…,xn) = ( j, xj ) for random j • For every i, changing xi induces statistical difference 1/n • But some xi is revealed with probability 1 – Definition is satisfied, but privacy is broken! slide 13 Formalizing Indistinguishability ? query 1 transcript answer S 1 Adversary A query 1 transcript answer S’ 1 Definition: San is -indistinguishable if A, DB, DB’ which differ in 1 row, sets of transcripts S p( San(DB) S ) (1 ± ) p( San(DB’) S ) Equivalently, S: p( San(DB) = S ) p( San(DB’)= S ) 1± slide 14 Laplacian Mechanism User Database Tell me f(x) f(x)+noise x1 … xn Intuition: f(x) can be released accurately when f is insensitive to individual entries x1, … xn Global sensitivity GSf = maxneighbors x,x’ ||f(x) – f(x’)||1 • Example: GSaverage = 1/n for sets of bits Theorem: f(x) + Lap(GSf/) is -indistinguishable Lipschitz constant of f • Noise generated from Laplace distribution slide 15 Sensitivity with Laplace Noise slide 16 Differential Privacy: Summary San gives -differential privacy if for all values of DB and Me and all transcripts t: Pr[ San (DB - Me) = t] Pr[ San (DB + Me) = t] ≤ e 1 Pr [t] slide 17