From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

advertisement
From Idiosyncratic to Stereotypical:
Toward Privacy in Public Databases
Shuchi Chawla, Cynthia Dwork,
Frank McSherry, Adam Smith,
Larry Stockmeyer, Hoeteck Wee
Database Privacy
 Census data – a prototypical example


Individuals provide information
Census bureau publishes sanitized records
 Privacy is legally mandated; what utility can we achieve?
 Our Goal:



2
What do we mean by preservation of privacy?
Characterize the trade-off between privacy and utility
– disguise individual identifying information
– preserve macroscopic properties
Develop a “good” sanitizing procedure with theoretical
guarantees
Shuchi Chawla
An outline of this talk
 A mathematical formalism




What do we mean by privacy?
Prior work
An abstract model of datasets
Isolation; Good sanitizations
 A candidate sanitization


A brief overview of results
General argument for privacy of n-point datasets
 Open issues and concluding remarks
3
Shuchi Chawla
Privacy… a philosophical view-point
 [Ruth Gavison] … includes protection from being
brought to the attention of others …



Matches intuition; inherently desirable
Attention invites further loss of privacy
Privacy is assured to the extent that one blends in with
the crowd
 Appealing definition; can be converted into a
precise mathematical statement!
4
Shuchi Chawla
Database Privacy
 Statistical approaches


Alter the frequency (PRAN/DS/PERT) of particular features,
while preserving means.
Additionally, erase values that reveal too much
 Query-based approaches



involve a permanent trusted third party
Query monitoring: dissallow queries that breach privacy
Perturbation: Add noise to the query output
[Dinur Nissim’03, Dwork Nissim’04]
 Statistical perturbation + adversarial analysis

5
[Evfimievsky et al ’03] combine statistical techniques with
analysis similar to query-based approaches
Shuchi Chawla
Everybody’s First Suggestion
 Learn the distribution, then output:


A description of the distribution, or,
Samples from the learned distribution
 Want to reflect facts on the ground

6
Statistically insignificant facts can be important for
allocating resources
Shuchi Chawla
A geometric view
 Abstraction :



Points in a high dimensional metric space – say R d;
drawn i.i.d. from some distribution
Points are unlabeled; you are your collection of attributes
Distance is everything
 Real Database (RDB) – private
n unlabeled points in d-dimensional space.
 Sanitized Database (SDB) – public
n’ new points possibly in a different space.
7
Shuchi Chawla
The adversary or Isolator
 Using SDB and auxiliary information (AUX), outputs a point q
 q “isolates” a real point x, if it is much closer to x than to x’s
neighbors,
i.e., if B(q,c) contains less than T RDB points
 T-radius of x – distance to its T-nearest neighbor
 x is “safe” if x > (T-radius of x)/(c-1)
B(q, cx) contains x’s entire T-neighborhood
c
x

q
c – privacy parameter; eg. 4
large T and small c is good
(c-1) 
8
Shuchi Chawla
A good sanitization
 Sanitizing algorithm compromises privacy if the adversary
is able to considerably increase his probability of isolating
a point by looking at its output
 A rigorous (and too ideal) definition
D I  I ’
w.o.p RDB 2R Dn
aux z  x 2 RDB :
| Pr[I(SDB,z) isolates x] – Pr[I ’(z) isolates x] | · /n
 Definition of  can be forgiving, say, 2-(d) or (1 in a 1000)
 Quantification over x : If aux reveals info about some x,
the privacy of some other y should still be preserved
 Provides a framework for describing the power of a
sanitization method, and hence for comparisons
9
Shuchi Chawla
The Sanitizer
 The privacy of x is linked to its T-radius
 Randomly perturb it in proportion to its T-radius
 x’ = San(x) R S(x,T-rad(x))
 Intuition:


10
We are blending x in with its crowd
If the number of dimensions (d) is large, there are “many”
pre-images for x’. The adversary cannot conclusively pick
any one.
We are adding random noise with mean zero to x, so
several macroscopic properties should be preserved.
Shuchi Chawla
Results on privacy.. An overview
Distribution
Num. of
points
Revealed to adversary
Auxiliary information
Uniform on surface of
sphere
2
Both sanitized points
Distribution,
1-radius
Uniform over a
bounding box or
surface of sphere
n
One sanitized point, all
other real points
Distribution, all real
points
n sanitized points
Distribution
Gaussian
2o(d)
Gaussian
2(d)
11
Work under progress
Shuchi Chawla
Results on utility… An overview
Distributional/
Worst-case
Objective
Worst-case
Find K clusters
minimizing largest
diameter
-
Distributional
Find k maximum
likelihood clusters
Mixture of k
Gaussians
12
Assumptions
Result
Optimal diameter as well as
approximations increase by
at most a factor of 3
Correct clustering with
high probability as long as
means are pairwise
sufficiently far
Shuchi Chawla
A special case - one sanitized point
 RDB = {x1,…,xn}
 The adversary is given n-1 real points x2,…,xn and
one sanitized point x’1 ; T = 1; c=4; “flat” prior
 Recall:
x’1 2R S(x1,|x1-y|)
where y is the nearest neighbor of x1
 Main idea:
Consider the posterior distribution on x1
Show that the adversary cannot isolate a large
probability mass under this distribution
13
Shuchi Chawla
A special case - one sanitized point
 Let Z = { pR d | p is a legal pre-image for x’1 }
Q = { p | if x1=p then x1 is isolated by q }
|p-q| · 1/3 |p-x’1|
 We show that Pr[ Q∩Z | x’1 ] ≤ 2-(d) Pr[ Z | x’1 ]
Pr[x1 in Q∩Z | x’1 ] =
prob mass contribution from Q∩Z / contribution from Z
= 21-d /(1/4)
x3
x5
Z
x’1
q
x2
14
Q∩Z
Q
x6
x4
Shuchi Chawla
Contribution from Z
 Pr[x1=p | x’1]  Pr[x’1 | x1=p]  1/rd (r = |x’1-p|)

Increase in r  x’1 gets randomized over a larger area –
proportional to rd. Hence the inverse dependence.
 Pr[x’1 | x12 S]  sS 1/rd  solid angle subtended at x’1
 Z subtends a solid angle equal to at least half a sphere
at x’1
x3
x5
Z
x’1
r
x2
15
p S
x4
x6
Shuchi Chawla
Contribution from Q Å Z
 The ellipsoid is roughly as far from x’1 as its longest
radius
 Contribution from ellipsoid is  2-d x total solid angle
 Therefore, Pr[x1 2 QÅZ] / Pr[x1 2 Z]  2-d
x3
x5
Z
x’1
q
r
x2
16
Q∩Z
Q
r
x6
x4
Shuchi Chawla
The general case… n sanitized points
 Initial intuition is wrong:


Privacy of x1 given x1’ and all the other points in the
clear does not imply privacy of x1 given x1’ and
sanitizations of others!
Sanitization is non-oblivious – Other sanitized points
reveal information about x, if x is their nearest
neighbor
 Where we are now



17
Consider some example of safe sanitization (not necessarily
using perturbations)
 Density regions? Histograms?
Relate perturbations to the safe sanitization
Uniform distribution; histogram over fixed-size cells
 exponentially low probability of isolation
Shuchi Chawla
Future directions
 Extend the privacy argument to other “nice”
distributions

For what distributions is there no meaningful privacy—
utility trade-off?
 Characterize acceptable auxiliary information

Think of auxiliary information as an a priori distribution
 The low-dimensional case – Is it inherently
impossible?
 Discrete-valued attributes

Our proofs require a “spread” in all attributes
 Extend the utility argument to other interesting
macroscopic properties – e.g. correlations
18
Shuchi Chawla
Conclusions
 A first step towards understanding the privacyutility trade-off
 A general and rigorous definition of privacy
 A work in progress!
19
Shuchi Chawla
Questions?
20
Shuchi Chawla
Download