0x1A Great Papers in Computer Security Vitaly Shmatikov CS 380S

advertisement
CS 380S
0x1A Great Papers in
Computer Security
Vitaly Shmatikov
http://www.cs.utexas.edu/~shmat/courses/cs380s/
C. Dwork
Differential Privacy
(ICALP 2006 and many other papers)
Basic Setting
DB=
x1
x2
x3

xn-1
xn
San
¢ ¢¢
query 1
answer 1 Users

(government,
query T researchers,
answer T marketers, …)
random coins
slide 3
Examples of Sanitization Methods
Input perturbation
• Add random noise to database, release
Summary statistics
• Means, variances
• Marginal totals
• Regression coefficients
Output perturbation
• Summary statistics with noise
Interactive versions of the above methods
• Auditor decides which queries are OK, type of noise
slide 4
Strawman Definition
Assume x1, … ,xn are drawn i.i.d. from unknown
distribution
Candidate definition: sanitization is safe if it only
reveals the distribution
Implied approach:
• Learn the distribution
• Release description of distribution or re-sample points
This definition is tautological
• Estimate of distribution depends on data… why is it
safe?
slide 5
Clustering-Based Definitions
Given sanitization S, look at all databases
consistent with S
brown
Safe if no predicate is true for blond
2
all consistent databases
12
brown
14

k-anonymity
• Partition D into bins
• Safe if each bin is either empty, or
contains at least k elements
Cell bound methods
• Release marginal sums
blue
10 12
6 18
16
brown blue
blond


[0,12]
[0,12] 12
brown [0,14]
[0,16] 18

16
14
slide 6
Issues with Clustering
Purely syntactic definition of privacy
What adversary does this apply to?
• Does not consider adversaries with side information
• Does not consider probability
• Does not consider adversarial algorithm for making
decisions (inference)
slide 7
Classical Intution for Privacy
“If the release of statistics S makes it possible to
determine the value [of private information] more
accurately than is possible without access to S, a
disclosure has taken place.” [Dalenius 1977]
• Privacy means that anything that can be learned about
a respondent from the statistical database can be
learned without access to the database
Similar to semantic security of encryption
• Anything about the plaintext that can be learned from
a ciphertext can be learned without the ciphertext
slide 8
Problems with Classic Intuition
Popular interpretation: prior and posterior views
about an individual shouldn’t change “too much”
• What if my (incorrect) prior is that every UTCS
graduate student has three arms?
How much is “too much?”
• Can’t achieve cryptographically small levels of
disclosure and keep the data useful
• Adversarial user is supposed to learn unpredictable
things about the database
slide 9
Absolute Guarantee Unachievable
[Dwork]
Privacy: for some definition of “privacy breach,”
 distribution on databases,  adversaries A,  A’
such that Pr(A(San)=breach) – Pr(A’()=breach) ≤ 
• For reasonable “breach”, if San(DB) contains information
about DB, then some adversary breaks this definition
Example
• Vitaly knows that Chad is 2 inches taller than the
average Russian
• DB allows computing average height of a Russian
• This DB breaks Chad’s privacy according to this
definition… even if his record is not in the database!
slide 10
Differential Privacy
DB=
x1
x2
x3

xn-1
xn
San
¢ ¢¢
query 1
answer 1

query T
answer T
random coins
Adversary A
 Absolute guarantees are problematic
• Your privacy can be “breached” (per absolute definition of privacy)
even if your data is not in the database
 Relative guarantee: “Whatever is learned would be learned
regardless of whether or not you participate”
• Dual: Whatever is already known, situation won’t get worse
slide 11
Indistinguishability
DB=
x1
x2
x3

xn-1
xn
Differ in 1 row
DB’=
x1
x2
y3

xn-1
xn
San
¢ ¢¢
query 1
answer 1

query T
answer T
random coins
San
¢ ¢¢
query 1
answer 1

query T
answer T
transcript
S
Distance
between
distributions
is at most 
transcript
S’
random coins
slide 12
Which Distance to Use?
Problem:  must be large
• Any two databases induce transcripts at distance ≤ n
• To get utility, need  > 1/n
Statistical difference 1/n is not meaningful!
• Example: release a random point from the database
– San(x1,…,xn) = ( j, xj ) for random j
• For every i, changing xi induces
statistical difference 1/n
• But some xi is revealed with probability 1
– Definition is satisfied, but privacy is broken!
slide 13
Formalizing Indistinguishability
?
query 1
transcript
answer
S 1
Adversary A
query 1
transcript
answer
S’ 1
Definition: San is -indistinguishable if
 A,  DB, DB’ which differ in 1 row,  sets of transcripts S
p( San(DB)  S )  (1 ± ) p( San(DB’)  S )
Equivalently,  S:
p( San(DB) = S )
p( San(DB’)= S )
 1±
slide 14
Laplacian Mechanism
User
Database
Tell me f(x)
f(x)+noise
x1
…
xn
 Intuition: f(x) can be released accurately when f is
insensitive to individual entries x1, … xn
 Global sensitivity GSf = maxneighbors x,x’ ||f(x) – f(x’)||1
• Example: GSaverage = 1/n for sets of bits
 Theorem: f(x) + Lap(GSf/) is -indistinguishable
Lipschitz
constant of f
• Noise generated from Laplace distribution
slide 15
Sensitivity with Laplace Noise
slide 16
Differential Privacy: Summary
San gives -differential privacy if for all values of
DB and Me and all transcripts t:
Pr[ San (DB - Me) = t]
Pr[ San (DB + Me) = t]

≤ e  1
Pr [t]
slide 17
Download