Foundations of Privacy Lecture 1 Lecturer: Moni Naor What is Privacy? Extremely overloaded term Hard to define “Privacy is a value so complex, so entangled in competing and contradictory dimensions, so engorged with various and distinct meanings, that I sometimes despair whether it can be usefully addressed at all.” Robert C. Post, Three Concepts of Privacy, 89 Geo. L.J. 2087 (2001). Privacy is like oxygen – you only feel it when it is gone What is Privacy? Extremely overloaded term • “the right to be let alone” - Samuel D. Warren and Louis D. Brandeis, The Right to Privacy, Harv. L. Rev. (1890) • “our concern over our accessibility to others: the extent to which we are known to others, the extent to which others have physical access to us, and the extent to which we are the subject of others attention. • - Ruth Gavison, “Privacy and the Limits of the Law,” Yale Law Journal (1980) What is Privacy? Extremely overloaded term • Photojournalism • Census data Louis Brandeis and Samuel Warren: • Huge databases collected by companies The Right to Privacy, Harvard Law Rev. 1890 – Data deluge – Example: “Ravkav” Mandatory participation • Public Surveillance Information – CamerasMust not reveal individual data – RFIDs • Social Networks Official Description The availability of fast and cheap computers coupled with massive storage devices has enabled the collection and mining of data on a scale previously unimaginable. This opens the door to potential abuse regarding individuals' information. There has been considerable research exploring the tension between utility and privacy in this context. The goal is to explore techniques and issues related to data privacy. In particular: • Definitions of data privacy • Techniques for achieving privacy • Limitations on privacy in various settings. • Privacy issues in specific settings Planned Topics Privacy of Data Analysis • Differential Privacy – Definition and Properties – Statistical databases – Dynamic data • Privacy of learning algorithms • Privacy of genomic data Interaction with cryptography • SFE • Voting • Entropic Security • Data Structures • Everlasting Security • Privacy Enhancing Tech. – Mixed nets Office: Ziskind 248 Phone: 3701 E-mail: moni.naor@ Course Information Foundation of Privacy - Spring 2010 Instructor: Moni Naor When: Mondays, 11:00--13:00 (2 points) Where: Ziskind 1 • Course web page: www.wisdom.weizmann.ac.il/~naor/COURSE/foundations_of_privacy.html • Prerequisites: familiarity with algorithms, data structures, probability theory, and linear algebra, at an undergraduate level; a basic course in computability is assumed. • Requirements: – Participation in discussion in class • Best: read the papers ahead of time – Homework: There will be several homework assignments • Homework assignments should be turned in on time (usually two weeks after they are given)! – Class Project and presentation – Exam : none planned Projects • Report on a paper • Apply a notion studied to some known domain • Checking the state of privacy is some setting Cryptography and Privacy Extremely relevant - but does not solve the privacy problem Secure function Evaluation • How to distributively compute a function f(X1, X2, …,Xn), – where Xj known to party j. • E.g., = sum(a,b,c, …) – Parties should only learn final output () • Many results depending on – – – – More worried what to compute Number of players than how to compute Means of communication The power and model of the adversary How the function is represented Example: Securely Computing Sums 0 · Xi · P-1. Want to compute Xi X1 Y1 X2 Y2 X3 Y3 X4 Y4 X5 Y5 mod P Party 1 selects r 2R [0..P-1]. Sends Y1 = X1+r Party i received Yi-1 and sends Yi = Yi-1+ Xi Party 1 received Yn and announces = Xi = Yn-r Is this Protocol Secure? To talk rigorously about cryptographic security: • Specify the Power of the Adversary – Access to the data/system – Computational power? – “Auxiliary” information? If it controls two players - insecure • Define a Break of the System – What is compromise – What is a “win” for the adversary? Can be all powerful here The Simulation Paradigm A protocol is considered secure if: • For every adversary (of a certain type) There exists a simulator that outputs an indistinguishable ``transcript” . Examples: • Encryption • Zero-knowledge • Secure function evaluation Power of analogy SFE: Simulating the ideal model A protocol is considered secure if: • For every adversary there exists a simulator operating in the ``ideal” (trusted party) model that outputs an indistinguishable transcript. Breaking = distinguishing! Major result: “Any function f that can be evaluated using polynomial resources can be securely evaluated using polynomial resources” The Problem with SFE SFE does not imply privacy: • The problem is with ideal model – E.g., = sum(a,b) – Each player learns only what can be deduced from and her own input to f – if and a yield b, so be it. Need ways of talking about leakage even in the ideal model Statistical Data Analysis Huge social benefits from analyzing large collections of data: Finding correlations E.g. medical: genotype/phenotype correlations Providing better services WHAT ABOUT PRIVACY? Improve web search results, fit ads to queries Publishing Official Statistics Census, contingency tables • Datamining Better Privacy Better Data Clustering, learning association rules, decision trees, separators, principal component analysis However: data contains confidential information Example of Utility Cholera cases Suspected pump John Snow’s map Cholera cases in London 1854 epidemic Modern Privacy of Data Analysis Is public analysis of private data a meaningful/achievable Goal? The holy grail: Get utility of statistical analysis while protecting privacy of every individual participant Ideally: “privacy-preserving” sanitization allows reasonably accurate answers to meaningful information Sanitization: Traditional View Curator/ Sanitizer Data A Output Trusted curator can access DB of sensitive information, should publish privacy-preserving sanitized version Traditional View: Interactive Model query 1 Sanitizer query 2 Data Multiple queries, chosen adaptively ? Sanitization: Traditional View Curator/ Sanitizer Data How to sanitize Anonymization? A Output Auxiliary Information • Information from any source other than the statistical database – – – – – Other databases, including old releases of this one Newspapers General comments from insiders Government reports, census website Inside information from a different organization • Eg, Google’s view, if the attacker/user is a Google employee Linkage Attacks: Malicious Use of Aux Info The Netflix Prize • Netflix Recommends Movies to its Subscribers – Seeks improved recommendation system – Offered $1,000,000 for 10% improvement • Not concerned here with how this is measured – Published training data Prize won in September 2009 “BellKor's Pragmatic Chaos team” From the Netflix Prize Rules Page… • “The training data set consists of more than 100 million ratings from over 480 thousand randomly-chosen, anonymous customers on nearly 18 thousand movie titles.” • “The ratings are on a scale from 1 to 5 (integral) stars. To protect customer privacy, all personal information identifying individual customers has been removed and all customer ids have been replaced by randomly-assigned ids. The date of each rating and the title and year of release for each movie are provided.” Netflix Data Release [Narayanan-Shmatikov 2008] • Ratings for subset of movies and users • Usernames replaced with random IDs • Some additional perturbation Credit: Arvind Narayanan via Adam Smith A Source of Auxiliary Information • Internet Movie Database (IMDb) – Individuals may register for an account and rate movies – Need not be anonymous • Probably want to create some web presence – Visible material includes ratings, dates, comments Use Public Reviews from IMDb.com Alice Bob Charlie Danielle Erica Frank Anonymized NetFlix data = Credit: Arvind Narayanan via Adam Smith Public, incomplete IMDB data Alice Bob Charlie Danielle Erica Frank Identified NetFlix Data De-anonymizing the Netflix Dataset Results of which 2 may be completely wrong • “With 8 movie ratings and dates that may have a 3-day error, 96% of Netflix subscribers whose records have been released can be uniquely identified in the dataset.” • “For 89%, 2 ratings and dates are enough to reduce the set of plausible records to 8 out of almost 500,000, which can then be inspected by a human for further deanonymization.” Consequences? Settled, March 2010 – Learn about movies that IMDB users didn’t want to tell the world about... Sexual orientation, religious beliefs Video Privacy Protection Act 1988 – Subject of current lawsuits Credit: Arvind Narayanan via Adam Smith AOL Search History Release (2006) • 650,000 users, 20 Million queries, 3 months • AOL’s goal: – provide real query logs from real users • Privacy? – “Identifying information” replaced with random identifiers – But: different searches by the same user still linked 30 AOL Search History Release (2006) Name: Thelma Arnold Age: 62 Widow Residence: Lilburn, GA 31 Other Successful Attacks • Against anonymized HMO records [Sweeny 98] – Proposed K-anonymity • Against K-anonymity [MGK06] – Proposed L-diversity • Against L-diversity [XT07] – Proposed M-Invariance • Against all of the above [GKS08] “Composition” Attacks [Ganta-Kasiviswanathan-Smith, KDD 2008] Individuals Curators Hospital A statsA Hospital B statsB Attac ker sensitive information • Example: two hospitals serve overlapping populations What if they independently release “anonymized” statistics? • Composition attack: Combine independent releases 33 “Composition” Attacks [Ganta-Kasiviswanathan-Smith, KDD 2008] Individuals Curators “Adam has either diabetes or high blood pressure” Hospital A statsA Hospital B statsB Attac ker sensitive information “Adam has either diabetes or emphyzema” • Example: two hospitals serve overlapping populations What if they independently release “anonymized” statistics? • Composition attack: Combine independent releases 34 “Composition” Attacks [Ganta-Kasiviswanathan-Smith, KDD 2008] • “IPUMS” census data set. 70,000 people, randomly split into 2 pieces with overlap 5,000. With popular technique (k-anonymity, k=30) for each database, can learn “sensitive” variable for 40% of individuals 35 Analysis of Social Network Graphs • “Friendship” Graph – Nodes correspond to users – Users may list others as “friend,” creating an edge • Edges are annotated with directional information • Hypothetical Research Question – How frequently is the “friend” designation reciprocated? Attack • Replace node names/labels with random identifiers • Permits analysis of the structure of the graph • Privacy hope: randomized identifiers make it hard/impossible to identify nodes with specific individuals, – thereby hiding the privacy of who is connected to whom • Disastrous! [Blum Dwork K07] – Vulnerable to active and passive attacks Flavor of Active Attack Connections: Targets: “Steve” and “Jerry” Attack Contacts: A and B Finding A and B allows finding Steve and Jerry A S B J Flavor of Active Attack Magic Step Isolate lightly linked-in subgraphs from rest of graph Special structure of subgraph permits finding A, B A S B J Why Settle for Ad Hoc Notions of Privacy? Dalenius, 1977: • Anything that can be learned about a respondent from the statistical database can be learned without access to the database – Captures possibility that “I” may be an extrovert – The database doesn’t leak personal information – Adversary is a user GoldwasserMicali 1982 • Analogous to Semantic Security for Crypto – Anything that can be learned from the ciphertext can be learned without the ciphertext – Adversary is an eavesdropper Computational Security of Encryption Semantic Security Whatever Adversary A can compute on encrypted string X 0,1n, so can A’ that does not see the encryption of X, yet simulates A’s knowledge with respect to X A selects: • Distribution Dn on 0,1n • Relation R(X,Y) - computable in probabilistic polynomial time For every pptm A there is an pptm A’ so that for all pptm relation R for XR Dn PrR(X,A(E(X)) - PrR(X,A’()) is negligible Outputs of A and A’ are indistinguishable even for a tester who knows X A: Dn A’: Dn E(X) X 2R Dn . A X A’ Y X R Y R ¼ Making it Slightly less Vague Cryptographic Rigor Applied to Privacy • Define a Break of the System – What is compromise – What is a “win” for the adversary? • Specify the Power of the Adversary – Access to the data – Computational power? – “Auxiliary” information? • Conservative/Paranoid by Nature – Protect against all feasible attacks In full generality: Dalenius Goal Impossible – Database teaches smoking causes cancer – I smoke in public – Access to DB teaches that I am at increased risk for cancer • But what about cases where there is significant knowledge about database distribution Outline • The Framework • A General Impossibility Result – Dalenius’ goal cannot be achieved in a very general sense • The Proof – Simplified – General case Two Models San Database ? Sanitized Database Non-Interactive: Data are sanitized and released Two Models San ? Database Interactive: Multiple Queries, Adaptively Chosen Auxiliary Information Common theme in many privacy horror stories: • Not taking into account side information – Netflix challenge: not taking into account IMDb [Narayanan-Shmatikov] The Database SAN(DB) =remove names The auxiliary information Not learning from DB With access to the database DB Without access to the database DB San A Auxiliary Information San A’ Auxiliary Information There is some utility of DB that legitimate users should learn • Possible breach of privacy • Goal: users learn the utility without the breach Not learning from DB With access to the database DB Without access to the database DB San A Auxiliary Information San A’ Auxiliary Information Want: anything that can be learned about an individual from the database can be learned without access to the database • 8 D 8 A 9 A’ whp DB 2R D 8 auxiliary information z |Prob [A(z) $ DB wins] – Prob[A’(z) wins]| is small Illustrative Example for Difficulty Want: anything that can be learned about a respondent from the database can be learned without access to the database • More Formally 8D 8A 9A’ whp DB 2R D 8 auxiliary information z |Probability [A(z) $ DB wins] – Probability [A’(z) wins]| is small Example: suppose height of individual is sensitive information – Average height in DB not known a priori • Aux z = “Adam is 5 cm shorter than average in DB” – A learns average height in DB, hence, also Adam’s height – A’ does not Defining “Win”: The Compromise Function Notion of privacy compromise Adv y DB Compromise? Privacy compromise should be non trivial: 0/1 •Should not be possible to find privacy breach from auxiliary information alone Privacy breach should exist: •Given DB there should be y that is a privacy breach •Should be possible to find y efficiently Privacy breach D Basic Concepts • Distribution on (Finite) Databases D – Something about the database must be unknown – Captures knowledge about the domain • E.g., rows of database correspond to owners of 2 pets • Privacy Mechanism San(D, DB) – Can be interactive or non-interactive – May have access to the distribution D • Auxiliary Information Generator AuxGen(D, DB) – Has access to the distribution and to DB – Formalizes partial knowledge about DB • Utility Vector w – Answers to k questions about the DB – (Most of) utility vector can be learned by user – Utility: Must inherit sufficient min-entropy from source D Impossibility Theorem: Informal • For any* distribution D on Databases DB • For any* reasonable privacy compromise decider C. • Fix any useful* privacy mechanism San Tells us information we did not know Then • There is an auxiliary info generator AuxGen and an adversary A z=AuxGen(DB) Such that • For all adversary simulators A’ Finds a compromise [A(z) $ San( DB)] wins, but [A’(z)] does not win Impossibility Theorem Fix any useful* privacy mechanism San and any reasonable privacy compromise decider C. Then There is an auxiliary info generator AuxGen and an adversary A such that for “all” distributions D and all adversary simulators A’ Pr[A(D, San(D,DB), AuxGen(D, DB)) wins] - Pr[A’(D, AuxGen(D, DB)) wins] ≥ for suitable, large, . The probability spaces are over choice of DB 2R D and the coin flips of San, AuxGen, A, and A’ To completely specify: need assumption on the entropy of utility vector W and how well SAN(W) behaves Strategy • The auxiliary info generator will provide a hint that together with the utility vector w will yield the privacy breach. • Want AuxGen to work without knowing D just DB – Find privacy breach y and encode in z – Make sure z alone does not give y. Only with w • Complication: is the utility vector w – Completely learned by the user? – Or just an approximation? Entropy of Random Sources • Source: – Probability distribution X on {0,1}n. – Contains some “randomness”. • Measure of “randomness” {0,1}n – Shannon entropy: H(X) = - ∑ x Γ Px (x) log Px (x) • Represents how much we can compress X on the average But even a high entropy source may have a point with prob 0.9 – min-entropy: Hmin(X) = - log max x Γ Px (x) • Represents the most likely value of X Definition: X is a k-source if H1(X) ¸ k . i.e. Pr[X=x] · 2-k for all x Min-entropy • Definition: X is a k-source if H1(X) ¸ k. i.e. Pr[X=x] · 2-k for all x • Examples: – Bit-fixing: some k coordinates of X uniform, rest fixed • or even depend arbitrarily on others. – Unpredictable Source: 8 i2[n], b1, ..., bi-12 {0,1}, b1, ..., bi-1] · 1-k/n – Flat k-source: Uniform over S µ {0,1}n, |S|=2k k/n· Prob[Xi =1| X1, X2, … Xi-1= • Fact every k-source is convex combination of flat ones. Min-Entropy and Statistical Distance For a probability distribution X over {0,1}n H1(X) = - log maxx Pr[X = x] Represents the probability of the most likely value of X X is a k-source if H1(X) ¸ k Statistical distance: ¢(X,Y) = a |Pr[X=a] – Pr[Y=a]| Want to be close to uniform distribution: Extractors Universal procedure for “purifying” an imperfect source Definition: Ext: {0,1}n £ {0,1}d ! {0,1}ℓ is a (k,)-extractor if: for any k-source X ¢(Ext(X, Ud), Uℓ) · k-source of length n x “seed” d random bits s EXT ℓ almost-uniform bits 2k strings {0,1}n Strong extractors Output looks random even after seeing the seed. Definition: Ext is a (k,) strong extractor if Ext’(x,s)= s ◦ Ext(x,s) is a (k,)-extractor • i.e. 8 k-sources X, for a 1- ’ frac. of s 2 {0,1}d Ext(X,s) is -close to Uℓ. Extractors from Hash Functions • Leftover Hash Lemma [ILL89]: universal (pairwise independent) hash functions yield strong extractors – output length: ℓ = k-O(1) ℓ = k – 2log(1/) – seed length: d = O(n) Example: Ext(x,(a,b))=first ℓ bits of a¢x+b in GF[2n] • Almost pairwise independence: – seed length: d= O(log n+k) Suppose w Learned Completely AuxGen and A share a secret: w San DB Aux Gen z AuxGen(DB) • Find privacy breach y of DB of length ℓ • Find w from DB w – A • C 0/1 simulate A Choose s2R{0,1}d and compute Ext(w,s) Set z = (s, Ext(w,s)©y) Suppose w Learned Completely AuxGen and A share a secret: w San DB Aux Gen z w DB A C z = (s, Ext(w,s) © y) Aux Gen z A’ C 0/1 Technical Conditions: Hmin (W|y) ≥ |y| and |y| “safe” 0/1 Why is it a compromise? AuxGen and A share a secret: w San DB Aux Gen z Why doesn’t A’ learn y: • For each possible value of y (s, Ext(w,s)) is -close to uniform • Hence: (s, Ext(w,s) © y) is -close to uniform w A C z = (s, Ext(w,s) © y) 0/1 Need Hmin(W) ¸ 3ℓ+O(1) Technical Conditions: Hmin (W|y) ≥ |y| and |y| “safe” To complete the proof • Handle the case where not all of w is retrieved