Slides

advertisement
Foundations of Privacy
Lecture 4
Lecturer: Moni Naor
Recap of last week’s lecture
• Differential Privacy
• Sensitivity:
– Global sensitivity of query q:Un→Rd
GSq = maxD,D’ ||q(D) – q(D’)||1
– Local sensitivity of query q at point D
LSq(D)= maxD’ |q(D) – q(D’)|
– Smooth sensitivity
Sf*(X)= maxY {LSf(Y)e- dist(x,y) }
• Histograms
• Differential privacy of median
Histograms
• Inputs x1, x2, ..., xn in domain U
Domain U partitioned into d disjoint bins S1,…,Sd
q(x1, x2, ..., xn) = (n1, n2, ..., nd) where
nj = #{i : xi in j-th bin}
Can view as d queries: qi counts # spoints in set Si
For adjacent D,D’, only one answer can change it can change by 1
Global sensitivity of answer vector is 1
Sufficient to add Lap(1/ε) noise to each
query, still get ε-privacy
The Exponential Mechanism
[McSherry Talwar]
A general mechanism that yields
• Differential privacy
• May yield utility/approximation
• Is defined and evaluated by considering all possible answers
The definition does not yield an efficient way of evaluating it
Application/original motivation:
Approximate truthfulness of auctions
• Collusion resistance
• Compatibility
Side bar: Digital Goods Auction
• Some product with 0 cost of production
• n individuals with valuation v1, v2, … vn
• Auctioneer wants to maximize profit
Example of the Exponential Mechanism
• Data: xi = website visited by student i today
Size of subset
• Range: Y = {website names}
• For each name y, let q(y, X) = #{i : xi = y}
Goal: output the most frequently visited site
• Procedure: Given X, Output website y with
probability prop to eq(y,X)
• Popular sites exponentially more likely than rare ones
Website scores don’t change too quickly
Setting
• For input D 2 Un want to find r2R
• Base measure  on R - usually uniform
• Score function q’: Un £ R  R
assigns any pair (D,r) a real value
– Want to maximize it (approximately)
The exponential mechanism
– Assign output r2R with probability proportional to
eq’(D,r) (r)
Normalizing factor r eq’(D,r) (r)
The exponential mechanism is private
• Let  = maxD,D’,r |q(D,r)-q(D’,r)|
adjacent
Claim: The exponential mechanism yields a
2¢¢ differentially private solution
• Prob [output = r on input D]
= eq’(D,r) (r)/r eq’(D,r) (r)
• Prob [output = r on input D’]
= eq’(D’,r) (r)/r eq’(D’,r) (r)
Ratio is
bounded by
e e
Laplace Noise as Exponential Mechanism
• On query q:Un→R let q’(D,r) = -|q(D)-r|
• Prob noise = y
y
e-y / 2 y e-y =  /2 e-y
Laplace distribution Y=Lap(b) has density function
Pr[Y=y] =1/2b e-|y|/b
-4
-3
-2
-1
0
1
2
3
4
5
Any Differentially Private Mechanism is an
instance of the Exponential Mechanism
• Let M be a differentially private mechanism
Take q’(D,r) to be log Prob[M(D) =r]
Remaining issue: Accuracy
Private Ranking
• Each element i 2 {1, … n} has a real valued score
SD(i) based on a data set D.
• Goal: Output k elements with highest scores.
• Privacy
• Data set D consists of n entries in domain D.
– Differential privacy: Protects privacy of entries in D.
• Condition: Insensitive Scores
– for any element i, for any data sets D, D’ that differ in
one entry:|SD(i)- SD’(i)| · 1
Approximate ranking
• Let Sk be the kth highest score based on data set D.
• An output list is  -useful if:
Soundness: No element in the output has score less than
Sk - 
Completeness: Every element with score greater than Sk
+  is in the output.
Score · Sk - 
Sk +  · Score
Sk -  · Score · Sk + 
Two Approaches
• Score perturbation
Each input affects all scores
– Perturb the scores of the elements with noise
– Pick the top k elements in terms of noisy scores.
– Fast and simple implementation
Question: what sort of noise should be added?
What sort of guarantees?
• Exponential sampling
– Run the exponential mechanism k times.
– more complicated and slower implementation
What sort of guarantees?
Homework
Exponential Mechanism: Simple Example
(almost free) private lunch
Database of n individuals, lunch options {1…k},
each individual likes or dislikes each option (1 or 0)
Goal: output a lunch option that many like
For each lunch option j2 [k], ℓ(j) is # of ind. who like j
Exponential Mechanism:
Output j with probability eεℓ(j)
Actual probability: eεℓ(j)/(∑i eεℓ(i))
Normalizer
Synthetic DB: Output is a DB
Sanitizer
Database
answer 1
answer 3
answer 2
query 1,
query 2,
...
?
Synthetic DB: output also a DB (of entries from same
universe X), user reconstructs answers by evaluating query
on output DB
Software and people compatible
Consistent answers
Answering More Queries
Using exponential mechanism
Differential Privacy for every set C of counting queries
• Error is Õ(n2/3 log|C|)
Remarkable
Hope for rich private analysis of small DBs!
• Quantitative: #queries >> DB size,
• Qualitative:
output of sanitizer -synthetic DBoutput is a DB itself
Counting Queries
Database D
of size n
• Queries with low sensitivity
Query c
Counting-queries
C is a set of predicates c: U  {0,1}
Query: how many D participants satisfy c ?
U
Relaxed accuracy:
answer query within α additive error w.h.p
Not so bad: error anyway inherent in statistical analysis
Assume all queries given in advance
Non-interactive
Utility and Privacy Can’t Always Be
Achieved Simultaneously
Impossibility results for counting queries:
DB with n participants
can’t have o(√n) error, O(n) queries
[DiNi, DwMcTa07,DwYe08]
In all these cases, strong privacy violation
What can we do?
almost entire DB
compromised
Huge DBs [Dwork Nissim]
DB of size n >> # queries |C|:
Add independent noise to answer on every query
Noise per query ~ #queries
For accuracy, need #queries ≤ n
May be reasonable for huge internet-scale DBs,
Privacy “for free”
What about smaller DBs?
DB of size n < #queries |C|,
impossibility results:
can’t have o(√n) error
Error must be Ω(√n)
The BLR Algorithm
Blum Ligett Roth08
Algorithm on input DB D:
Sample from a distribution on DBs of size m: (m < n)
DB F gets picked w.p. / e-ε·dist(F,D)
For DBs F and D
dist(F,D) = maxq2C |q(F) – q(D)|
Intuition: far away DBs get smaller probability
The BLR Algorithm
Idea:
• In general: Do not use large DB
– Sample and answer accordingly
• DB of size m guaranteeing hitting each query with
sufficient accuracy
The BLR Algorithm: 2ε-Privacy
Algorithm on input DB D:
Sample from a distribution on DBs of size m: (m < n)
DB F gets picked w.p. / e-ε·dist(F,D)
For adjacent D,D’ for every F
|dist(F,D) – dist(F,D’)| ≤ 1
Probability of F by D:
e-ε·dist(F,D)/∑G of size m e-ε·dist(G,D)
Probability of F by D’:
numerator and denominator can change by eε-factor
 2ε-privacy
The BLR Algorithm: Error Õ(n2/3 log|C|)
Algorithm on input DB D:
Sample from a distribution on DBs of size m: (m < n)
DB F gets picked w.p. / e-ε·dist(F,D)
There exists Fgood of size m =Õ((n\α)2·log|C|) s.t.
dist(Fgood,D) ≤ α
Pr [Fgood] ~ e-εα
For any Fbad with dist 2α, Pr [Fbad] ~ e-2εα
Union bound: ∑ bad DB Fbad Pr [Fbad] ~ |U|me-2εα
For α=Õ(n2/3log|C|), Pr [Fgood] >> ∑ Pr [Fbad]
The BLR Algorithm: Running Time
Algorithm on input DB D:
Sample from a distribution on DBs of size m: (m <
n)
DB F gets picked w.p. / e-ε·dist(F,D)
Generating the distribution by enumeration:
Need to enumerate every size-m database,
where m = Õ((n\α)2·log|C|)
Running time ≈
2·log|c|)
Õ((n\α)
|U|
Conclusion
Offline algorithm, 2ε-Differential Privacy for any
set C of counting queries
• Error α is Õ(n2/3 log|C|/ε)
• Super-poly running time:
2·log|C|)
Õ((n\α)
|U|
Can we Efficiently Sanitize?
The good news
If the universe is small, Can sanitize
EFFICIENTLY
Time poly(|C|,|U|)
The bad news
cannot do much better, namely sanitize in time:
sub-poly(|C|) AND sub-poly(|U|)
How Efficiently Can We Sanitize?
|C|
subpoly
poly
subpoly
?
?
poly
?
?
|U|
Good news!
The Good News: Can Sanitize When
Universe is Small
Efficient Sanitizer for query set C
• DB size n ¸ Õ(|C|o(1) log|U|)
• error is ~ n2/3
• Runtime poly(|C|,|U|)
Output is a synthetic database
Compare to [Blum Ligget Roth]:
n ¸ Õ(log|C| log|U|), runtime super-poly(|C|,|U|)
Recursive Algorithm
C0=C
C1
C2
Start with DB D and large query set C
Repeatedly choose random subset Ci+1 of Ci:
shrink query set by (small) factor
Cb
Recursive Algorithm
C0=C
C1
C2
Cb
Start with DB D and large query set C
Repeatedly choose random subset Ci+1 of Ci:
shrink query set by (small) factor
End recursion: sanitize D w.r.t. small query set Cb
Output is good for all queries in small set Ci+1
Extract utility on almost-all queries in large set Ci
Fix remaining “underprivileged” queries in large set Ci
Download