Private Analysis of Data Sets Benny Pinkas HP Labs, Princeton A story We’re experiencing Here too.. I can’t find a pattern a lot of fraud lately… Neither can I.. But, what about to recognize fraudshare in advance.. Maybe we should information.. • Patients’ privacy • Business secrets Have you heard of “Secure This is all “theory”. function evaluation” ? It can’t be efficient. 2 New Opportunities for Interaction Between – Enterprises, and government agencies holding sensitive data. – P2P users – Mobile wireless crowds (PDAs, cell phones) • What about privacy? • A bidirectional approach: – Finding what is actually needed – Designing useful and efficient cryptographic tools 3 Cryptographic Protocols for Privacy Preserving Computation Input: Output: As if… x F(x,y) and nothing else y y x F(x,y) F(x,y) 4 Does the trusted party scenario make sense? y x F(x,y) F(x,y) • We cannot hope for more privacy • Does the trusted party scenario make sense? • Are the parties motivated to submit their true inputs? • Can they tolerate the disclosure of F(x,y)? • If so, we can implement the scenario without a trusted party. 5 Secure Function Evaluation [Yao,GMW,BGW] • F(x,y) – A public function. • Represented as a Boolean circuit C(x,y). Input: Output: x C(x,y) and nothing else y nothing Implementation: • O(|X|) “oblivious transfers”. O(|C|) communication. • Pretty efficient for small circuits! (but what about larger circuits?) 6 An equality circuit 1 if x=y 0 otherwise = AND = = x1 y1 x2 y2 x y = xn yn 7 Cryptographic methods vs. randomization methods overhead Cryptographic methods Our goal… inaccuracy lack of privacy Randomization methods [statistical disclosure, AS] 8 Examples of Simple Privacy Preserving Primitives (with reasonable solutions) • Is X = Y? Is X > Y? • What is X Y? What is median of X Y? • Auctions (negotiations). Many parties, private bids. Compute the winning bidder and the sale price, but nothing else. [NPS] • Voting • Add privacy to data mining algs (ID3 – [LP]) 9 Private Set Intersection with Mike Freedman, NYU Kobbi Nissim, MSR Applications of Set Intersection Government agency B Government agency A People on welfare Expensive car buyers Compute intersection and nothing else 11 Computing the Intersection • Private Equality Test (PET) – Alice: x. Bob: y. – Output: 1 iff x=y – Privacy preserving solutions: • Cannot use hash functions alone • Yao, [FNW], [NP] • Generalization: list intersection – X = x1, …, xn Y = y1, …, yn 12 The basic tool: Homomorphic Encryption • Semantically secure public key encryption • Given Enc(M1), ENC(M2), can compute (without knowing the decryption key) – Enc(M1+M2) – Enc(c· M1) for any constant c. – I.e. Enc(a0)+Enc(a1)x+…+Enc(an)xn = Enc(P(x)) • Examples: El Gamal, Paillier, DJ. 13 The Scenario • Client: X = x1, …, xn • Server: Y = y1, …, yn • Output: – Client learns X Y. – Server learns nothing. 14 The Protocol • Client defines a polynomial of degree n whose roots are x1,…,xn – P(y) = (x1-y)·(x2-y)·…·(xn-y) n + … + a y + a a y = n 1 0 • Sends to server homomorphic encryptions of coefficients – Enc(an),…, Enc(a0) • (only the client can decrypt) 15 …The Protocol • Server uses homomorphic properties to compute y Enc( r·P(y) + y) (r is random) • If yXY result is Enc(r·0+y)=Enc(y), otherwise result is Enc(random). • Server sends (permuted) results to C. • C decrypts, compares to its list. 16 Security • Bad server? The server only sees semantically secure encryptions. Learning about C’s input = breaking enc. • Bad client? The client can, given only the output XY, simulate her “view” in the protocol. (I.e. she generates encryptions of items in XY, and of random items.) 17 Efficiency • Client encrypts and decrypts n values • Communication is O(n) • Server: – For each input computes Enc(r·P(y)+y), i.e. n exponentiations. – Total O(n2) exponentiations – Can use hashing to reduce overhead to O(n lnln n). 18 Is Approximation easier? • Can we approximate size of intersection (i.e. scalar product) with sublinear overhead? • Lower bound: – Approximating |XY| within 1 ε factor requires Ω(n) communication (constant ε). – True even for randomized algorithms. – Proof: reduction to Razborov’s lower bound for Disjointness. • Upper bound: protocols with matching overhead. 19 Secure Computation of the Kth-ranked element with Gagan Aggarwal, Stanford Nina Mishra, HPL Secure Computation of the Kth-ranked element • Inputs: – A: SA B: SB – Large sets of unique items (D). – There’s also the multi-party scenario • Output: x SA SB s.t. |{y | y<x, ySASB}| = k-1 • Median: k = (|SA| + |SB|) / 2 21 Motivation • Basic statistical analysis of distributed data • E.g. histogram of salaries in competing business in the same area • Sometimes the parties might want to hide the size of their inputs 22 Some information is always revealed • The Kth-ranked element reveals some information • Suppose SA = x1,…,x1000 – Median of SA SB = x400 • Party A now learns that SB contains at least 200 elements smaller than x400 • But she shouldn’t learn more 23 Results, and previous work • Previous work: generic constructions – overhead at least linear in k. • New results: – Two-party: log k secure comparisons of log D bit numbers. – Multi-party: log D simple computations with log D bit numbers. 24 An (insecure) two-party median protocol SA LA mA RA mA < mB SB LB mB RB LA lies below the median, RB lies above the median. New median is same as original median. Recursion Need log n rounds (suppose each set contains 2i items) 25 Secure two-party median protocol A finds median of SA, call it mA B finds median of SB, call it mB YES mA < mB Secure comparison (e.g. a small circuit) NO A deletes xєSA s.t. x < mA. B deletes xєSB s.t. x > mB. A deletes xєSA s.t. x > mA. B deletes xєSB s.t. x < mB. 26 Proof of security • Simulation: Given the protocol’s output, each party can simulate the execution of the protocol SA First comparison: mA < mB median Second comparison: mA > mB 27 Arbitrary inputs, arbitrary k - SA K SB + + 2i Now, compute the median of two sets of size k Size should be a power of 2 median of new inputs = kth element of original inputs 28 Conclusions • Efficient privacy preserving primitives for basic tasks • Open problems – Intersection: approximate matching? – Median: clustering? • Theory and applications can and should interact – Tools from the theory of cryptography (e.g. SFE) can be used in applications – Applications can benefit from rigorous analysis • There’s a lot more to be done… 29