ppt

advertisement
Privacy without Noise
Yitao Duan
NetEase Youdao R&D
Beijing China
duan@rd.netease.com
CIKM 2009
The Problem
• Given a database d, consisted of records
about individual users, wish to release
some statistical information f(d) without
compromising individual’s privacy
Our Results
• Main stream approach relies on additive noise.
We show that this alone is neither sufficient, nor,
for some type of queries, necessary for privacy
• The inherent uncertainty associated with
unknown quantities is enough to provide the
same privacy without external noise
• Provide the first mathematical proof, and
conditions, for the widely accepted heuristic that
aggregates are private
Preliminaries
• A database is d  D n , D is an arbitrary domain
• di is drawn i.i.d. from a public distribution
• Hamming distance H(d, d') between two
databases d, d' = the number of entries on
which they differ
• Query:
n
f (d ) 
 g (d
i
)
i 1
g(di)=[g1(di),…,gm(di)]T, gj(di): D [0, 1]
The Power of Addition
• A large number of popular algorithms can be run with
addition-only steps
– Linear algorithms: voting and summation, nonlinear algorithm:
regression, classification, SVD, PCA, k-means, ID3, EM etc
– All algorithms in the statistical query model
– Many other gradient-based numerical algorithms
• Addition-only framework has very efficient private
implementation in cryptography and admits efficient
zero-knowledge proofs (ZKPs)
Notions of Privacy
• But what do we mean by privacy?
• I don’t know how much you weigh but I can find
out its highest digit is 2
• Or, I don’t know whether you drink
or not but I can find that drinking
people are happier
• The definition must meet people’s expectation
• And allow for rigorous mathematical reasoning
Differential Privacy
The risk to my privacy should not substantially
increase as a result of participating in a
statistical database:
DEFINITION
(DIFFERENT
An algorithm
A
f
[11, 10] ).   ,   0 ,
IAL PRIVACY
gives (  ,  )  differenti al privacy w ith respect
to a query function
f if for all S  Range( A ), for all d , d that
f
'
differ by one record
Pr[ A (d)  S ]  exp(  ) Pr[ A (d )  S ]  
f
f
'
Differential Privacy
A gives (  ,  ) -differential privacy if for all
values of DB and Me and all transcripts t:
 Pr[ A ( DB  Me )  t ]
 
Pr 
 e   1
 Pr[ A ( DB  Me )  t ]

Pr [t]
Differential Privacy
• No perceptible risk is incurred by joining DB.
• Any info adversary can obtain, it could obtain
without Me (my data).
Pr [t]
Differential Privacy w/ Additive
Noise
Noise
f(d)
Σ
Response
Noise must be: (1) independently generated for each query; (2) has
sufficiently large variance. Can be Laplace, Gaussian, Binomial
But …
The variance of independent noise can be reduced via averaging.
Fix:
Restrict the total number of queries, i.e., the dimensionality of f,(to m)
But It Is Not effective
dj
m queries
m queries
2m queries
If a user profile is shared among multiple databases, one could get
more queries about the user than differential privacy allows
And It Is Not Necessary Either
• There is another source of
randomness that could
provide similar protection
as external noise – the
data itself
• Some functions are
insensitive to small
perturbation to the input
Aggregates of n Random Variables
• Probability theory has many established
results on the asymptotic behavior of
aggregates of n random variables
• Under certain conditions, when n is
sufficiently large, the aggregates converge
in some way to a distribution independent
of the individual samples except for a few
distributional parameters.
Central Limit Theorem
THEOREM
(MULTIDIME
be i.i.d. random vectors in R
matrix V. If Y 

n
i1
NSIONAL CLT[19].
m
with EX i   and finite covariance
X i , then (Y - n  ) /
tion to m - dimensiona l gaussian
covariance V .
Let X 1 ,..., X n
n converges in distribu -
distributi on with zero mean and
Differential Privacy: An individual’s
Perspective
LEMMA 1. A mechanism
is (  ,  )  private if  d  D ,  k 
n
{1,..., n }, each element of g k is independen tly perturbed
additive random noise following
variance 
2
with
gaussian distributi on with
 2 m log( 2 m /  ) /  .
2
2
• Privacy is defined in terms of perturbation to
individual data record
• Existing solutions achieve this via external noise
• Each element is independently perturbed
Sum Queries
• With sum queries, when n is large, for each k, the
quantity
n
k 
 g (d
i
)
i 1 , i  k
converges in distribution to gaussian (CLT)
• Since f ( d )  g ( d k )   k for every k, can Δk provide
similar protection?
• Compared against Lemma 1, the difference is
that the perturbations to each element of g(dk) are
not independent
Privacy without Noise
x2
x2
σ
σ
g(dk)
g(dk)
x1
x1
(a) Independent and (b) non-independent gaussian perturbations
in 2-dimensional case. (b) has variance σ2 along its minor axis.
Note how the perturbation in (b) “envelops” that in (a).
Main Result
THEOREM
(MAIN).
Let a i  g(d i )  R . Assuming
m
a 1 ,..., a n are i.i.d. with E[ a i ]   and finite covariance
matrix V.
The summatio n is (  ,  )  private
if n is
sufficient ly large and
2 m log( 2 m /  )
2
 min (V ) 
( n  1) 
2
where  min (V ) is the smallest eigenvalue of V
A Simple Necessary Condition
• Suppose we have answered k queries
which are all deemed safe
• For the (k+1)-th query to be safe, the
condition is
• Adding a new row is
A Simple Necessary Condition
• We know σk+1(
)=0
• xk+1 must be “large” enough to perturb the
singular value away from 0 by sufficient
amount. Using matrix perturbation theory
(Weyl theorem), we have
Query Auditing
• Instead of perturbing the responses, query
auditing restricts the queries that can
cause privacy breach
q
q(d) or DENY
• Must be careful with denials
Simulatability
• Key idea: if the adversary can
simulate the output of the auditor
using only public information, then
nothing more is leaked
• Denials: if the decision to deny or
grant query answers is based on
information that can be
approximated by the adversary,
then the decision itself does not
reveal more info
Simulatable Query Auditing
• Previous schemes achieve simulatablity
by not using the data
• Using our condition to verify privacy in
online query auditing is simulatable
• Even though the data is used in the
decision making process, the information
is still simulatable
Simulatable Query Auditing
The auditor:
The simulator:
Simulatable Query Auditing
• Using law of large numbers, and Weyl’s
theorem (again!), we can prove that when n is
large,
Pr[| ˆ   |  ]  1  
'
for any
'
  0,   0
Issue of Shared Records
• We are not totally immune to this
vulnerability, but our privacy condition is
actually stronger than simply restricting the
number of queries, even though we do not
add noise
• An adversary gets less information about
individual records from the same number
of queries
More info: duan@rd.netease.com
Full version of the paper:
http://bid.berkeley.edu/projects/p4p/papers
/pwn-full.pdf
Download