Private Analysis of Data Sets

advertisement
Private Analysis of Data
Sets
Benny Pinkas
HP Labs, Princeton
A story
We’re experiencing
Here too..
I
can’t
find
a
pattern
a lot of fraud lately…
Neither
can I..
But,
what
about
to
recognize
fraudshare
in advance..
Maybe
we should
information..
• Patients’ privacy
• Business secrets
Have you heard of “Secure
This is all “theory”.
function evaluation” ?
It can’t be efficient.
2
New Opportunities for Interaction
Between
– Enterprises, and government agencies holding
sensitive data.
– P2P users
– Mobile wireless crowds (PDAs, cell phones)
• What about privacy?
• A bidirectional approach:
– Finding what is actually needed
– Designing useful and efficient cryptographic
tools
3
Cryptographic Protocols for Privacy
Preserving Computation
Input:
Output:
As if…
x
F(x,y) and nothing else
y
y
x
F(x,y)
F(x,y)
4
Does the trusted party
scenario make sense?
y
x
F(x,y)
F(x,y)
• We cannot hope for more privacy
• Does the trusted party scenario make sense?
• Are the parties motivated to submit their
true inputs?
• Can they tolerate the disclosure of F(x,y)?
• If so, we can implement the scenario without a
trusted party.
5
Secure Function Evaluation [Yao,GMW,BGW]
• F(x,y) – A public function.
• Represented as a Boolean circuit C(x,y).
Input:
Output:
x
C(x,y)
and nothing else
y
nothing
Implementation:
• O(|X|) “oblivious transfers”. O(|C|) communication.
• Pretty efficient for small circuits! (but what about
larger circuits?)
6
An equality circuit
1 if x=y
0 otherwise
=
AND
=
=
x1 y1 x2 y2
x y
=
xn yn
7
Cryptographic methods vs.
randomization methods
overhead
Cryptographic methods
Our goal…
inaccuracy
lack of
privacy
Randomization methods
[statistical disclosure, AS]
8
Examples of Simple Privacy Preserving
Primitives (with reasonable solutions)
• Is X = Y?
Is X > Y?
• What is X  Y? What is median of X  Y?
• Auctions (negotiations). Many parties, private
bids. Compute the winning bidder and the sale
price, but nothing else. [NPS]
• Voting
• Add privacy to data mining algs (ID3 – [LP])
9
Private Set Intersection
with
Mike Freedman, NYU
Kobbi Nissim, MSR
Applications of Set Intersection
Government
agency B
Government
agency A
People on welfare
Expensive car buyers
Compute intersection
and nothing else
11
Computing the Intersection
• Private Equality Test (PET)
– Alice: x. Bob: y.
– Output: 1 iff x=y
– Privacy preserving solutions:
• Cannot use hash functions alone
• Yao, [FNW], [NP]
• Generalization: list intersection
– X = x1, …, xn
Y = y1, …, yn
12
The basic tool: Homomorphic
Encryption
• Semantically secure public key encryption
• Given Enc(M1), ENC(M2), can compute
(without knowing the decryption key)
– Enc(M1+M2)
– Enc(c· M1) for any constant c.
– I.e. Enc(a0)+Enc(a1)x+…+Enc(an)xn = Enc(P(x))
• Examples: El Gamal, Paillier, DJ.
13
The Scenario
• Client: X = x1, …, xn
• Server: Y = y1, …, yn
• Output:
– Client learns X  Y.
– Server learns nothing.
14
The Protocol
• Client defines a polynomial of degree
n whose roots are x1,…,xn
– P(y) = (x1-y)·(x2-y)·…·(xn-y)
n + … + a y + a
a
y
= n
1
0
• Sends to server homomorphic
encryptions of coefficients
– Enc(an),…, Enc(a0)
• (only the client can decrypt)
15
…The Protocol
• Server uses homomorphic properties
to compute
y Enc( r·P(y) + y)
(r is random)
• If yXY result is Enc(r·0+y)=Enc(y),
otherwise result is Enc(random).
• Server sends (permuted) results to C.
• C decrypts, compares to its list.
16
Security
• Bad server? The server only sees
semantically secure encryptions. Learning
about C’s input = breaking enc.
• Bad client? The client can, given only the
output XY, simulate her “view” in the
protocol. (I.e. she generates encryptions of
items in XY, and of random items.)
17
Efficiency
• Client encrypts and decrypts n values
• Communication is O(n)
• Server:
– For each input computes Enc(r·P(y)+y),
i.e. n exponentiations.
– Total O(n2) exponentiations
– Can use hashing to reduce overhead to
O(n lnln n).
18
Is Approximation easier?
• Can we approximate size of intersection (i.e.
scalar product) with sublinear overhead?
• Lower bound: 
– Approximating |XY| within 1  ε factor requires
Ω(n) communication (constant ε).
– True even for randomized algorithms.
– Proof: reduction to Razborov’s lower bound for
Disjointness.
• Upper bound: protocols with matching
overhead.
19
Secure Computation of the
Kth-ranked element
with
Gagan Aggarwal, Stanford
Nina Mishra, HPL
Secure Computation of the
Kth-ranked element
• Inputs:
– A: SA
B: SB
– Large sets of unique items (D).
– There’s also the multi-party scenario
• Output: x  SA  SB
s.t. |{y | y<x, ySASB}| = k-1
• Median: k = (|SA| + |SB|) / 2
21
Motivation
• Basic statistical analysis of
distributed data
• E.g. histogram of salaries in
competing business in the same area
• Sometimes the parties might want to
hide the size of their inputs
22
Some information is always
revealed
• The Kth-ranked element reveals some
information
• Suppose SA = x1,…,x1000
– Median of SA  SB = x400
• Party A now learns that SB contains
at least 200 elements smaller than
x400
• But she shouldn’t learn more
23
Results, and previous work
• Previous work: generic constructions –
overhead at least linear in k.
• New results:
– Two-party: log k secure comparisons of log
D bit numbers.
– Multi-party: log D simple computations with
log D bit numbers.
24
An (insecure) two-party median protocol
SA
LA
mA
RA
mA < mB
SB
LB
mB
RB
LA lies below the median, RB lies above the median.
New median is same as original median.
Recursion  Need log n rounds
(suppose each set contains 2i items)
25
Secure two-party median protocol
A finds median
of SA, call it mA
B finds median
of SB, call it mB
YES
mA < mB
Secure comparison
(e.g. a small circuit)
NO
A deletes xєSA
s.t. x < mA.
B deletes xєSB
s.t. x > mB.
A deletes xєSA
s.t. x > mA.
B deletes xєSB
s.t. x < mB.
26
Proof of security
• Simulation: Given the protocol’s output, each party
can simulate the execution of the protocol
SA
First comparison: mA < mB
median
Second comparison: mA > mB
27
Arbitrary inputs, arbitrary k
-
SA
K
SB
+ +
2i
Now, compute the median of two sets of size k
Size should be a power of 2
median of new inputs = kth element of original inputs
28
Conclusions
• Efficient privacy preserving primitives for
basic tasks
• Open problems
– Intersection: approximate matching?
– Median: clustering?
• Theory and applications can and should
interact
– Tools from the theory of cryptography (e.g. SFE)
can be used in applications
– Applications can benefit from rigorous analysis
• There’s a lot more to be done…
29
Download