Privacy in Data Management Sharad Mehrotra 1 Privacy - definitions Generic - Privacy is the interest that individuals have in sustaining a 'personal space', free from interference by other people and organizations. Information Privacy - The degree to which an individual can determine which personal information is to be shared with whom and for what purpose. - The evolving relationship between technology and the legal right to, or public expectation of privacy in the collection and sharing of data Identity privacy (anonymity) - Anonymity of an element (belonging to a set) refers to the property of that element of not being identifiable within the set, i.e., being indistinguishable from the other elements of the set 2 Means of achieving privacy Information Security is the process of protecting data from unauthorized access, use, disclosure, destruction, modification, or disruption. Enforcing security in information processing applications: 1. 2. 3. 4. Law Access control Data encryption Data transformation – statistical disclosure control Techniques used depend on - Application semantics/functionality requirements Nature of data Privacy requirement/metrics Privacy is contextual 3 Overview Study the nature of privacy in context of data-centric applications 1. Privacy-preserving data publishing for data mining applications 2. Secure outsourcing of data: “Database as A Service (DAS)” 3. Privacy-preserving implementation of pervasive spaces 4. Secure data exchange and sharing between multiple parties 4 Privacy-Preserving / Anonymmized Data Publishing 5 Why Anonymize? For Data Sharing Give real(istic) data to others to study without compromising privacy of individuals in the data Allows third-parties to try new analysis and mining techniques not thought of by the data owner For Data Retention and Usage Various requirements prevent companies from retaining customer information indefinitely E.g. Google progressively anonymizes IP addresses in search logs Internal sharing across departments (e.g. billing marketing) 6 Why Privacy? Data subjects have inherent right and expectation of privacy “Privacy” is a complex concept (beyond the scope of this tutorial) What exactly does “privacy” mean? When does it apply? Could there exist societies without a concept of privacy? Concretely: at collection “small print” outlines privacy rules Most companies have adopted a privacy policy E.g. AT&T privacy policy att.com/gen/privacy-policy?pid=2506 Significant legal framework relating to privacy UN Declaration of Human Rights, US Constitution HIPAA, Video Privacy Protection, Data Protection Acts 7 Case Study: US Census Raw data: information about every US household Who, where; age, gender, racial, income and educational data Why released: determine representation, planning How anonymized: aggregated to geographic areas (Zip code) Broken down by various combinations of dimensions Released in full after 72 years Attacks: no reports of successful deanonymization Recent attempts by FBI to access raw data rebuffed Consequences: greater understanding of US population Affects representation, funding of civil projects Rich source of data for future historians and genealogists 8 Case Study: Netflix Prize Raw data: 100M dated ratings from 480K users to 18K movies Why released: improve predicting ratings of unlabeled examples How anonymized: exact details not described by Netflix All direct customer information removed Only subset of full data; dates modified; some ratings deleted, Movie title and year published in full Attacks: dataset is claimed vulnerable [Narayanan Shmatikov 08] Attack links data to IMDB where same users also rated movies Find matches based on similar ratings or dates in both Consequences: rich source of user data for researchers unclear if attacks are a threat—no lawsuits or apologies yet 9 Case Study: AOL Search Data Raw data: 20M search queries for 650K users from 2006 Why released: allow researchers to understand search patterns How anonymized: user identifiers removed All searches from same user linked by an arbitrary identifier Attacks: many successful attacks identified individual users Ego-surfers: people typed in their own names Zip codes and town names identify an area NY Times identified 4417749 as 62yr old GA widow [Barbaro Zeller 06] Consequences: CTO resigned, two researchers fired Well-intentioned effort failed due to inadequate anonymization 10 Three Abstract Examples “Census” data recording incomes and demographics Schema: (SSN, DOB, Sex, ZIP, Salary) Tabular data—best represented as a table “Video” data recording movies viewed Schema: (Uid, DOB, Sex, ZIP), (Vid, title, genre), (Uid, Vid) Graph data—graph properties should be retained “Search” data recording web searches Schema: (Uid, Kw1, Kw2, …) Set data—each user has different set of keywords Each example has different anonymization needs 11 Models of Anonymization Interactive Model (akin to statistical databases) Data owner acts as “gatekeeper” to data Researchers pose queries in some agreed language Gatekeeper gives an (anonymized) answer, or refuses to answer “Send me your code” model Data owner executes code on their system and reports result Cannot be sure that the code is not malicious Offline, aka “publish and be damned” model Data owner somehow anonymizes data set Publishes the results to the world, and retires Our focus in this tutorial – seems to model most real releases 12 Objectives for Anonymization Prevent (high confidence) inference of associations Prevent inference of salary for an individual in “census” Prevent inference of individual’s viewing history in “video” Prevent inference of individual’s search history in “search” All aim to prevent linking sensitive information to an individual Prevent inference of presence of an individual in the data set Satisfying “presence” also satisfies “association” (not vice-versa) Presence in a data set can violate privacy (eg STD clinic patients) Have to model what knowledge might be known to attacker Background knowledge: facts about the data set (X has salary Y) Domain knowledge: broad properties of data (illness Z rare in men) 13 Utility Anonymization is meaningless if utility of data not considered The empty data set has perfect privacy, but no utility The original data has full utility, but no privacy What is “utility”? Depends what the application is… For fixed query set, can look at max, average distortion Problem for publishing: want to support unknown applications! Need some way to quantify utility of alternate anonymizations 14 Measures of Utility Define a surrogate measure and try to optimize Often based on the “information loss” of the anonymization Simple example: number of rows suppressed in a table Give a guarantee for all queries in some fixed class Hope the class is representative, so other uses have low distortion Costly: some methods enumerate all queries, or all anonymizations Empirical Evaluation Perform experiments with a reasonable workload on the result Compare to results on original data (e.g. Netflix prize problems) Combinations of multiple methods Optimize for some surrogate, but also evaluate on real queries 15 Definitions of Technical Terms Identifiers–uniquely identify, e.g. Social Security Number (SSN) Step 0: remove all identifiers Was not enough for AOL search data Quasi-Identifiers (QI)—such as DOB, Sex, ZIP Code Enough to partially identify an individual in a dataset DOB+Sex+ZIP unique for 87% of US Residents [Sweeney 02] Sensitive attributes (SA)—the associations we want to hide Salary in the “census” example is considered sensitive Not always well-defined: only some “search” queries sensitive In “video”, association between user and video is sensitive SA can be identifying: bonus may identify salary… 16 Summary of Anonymization Motivation Anonymization needed for safe data sharing and retention Many legal requirements apply Various privacy definitions possible Primarily, prevent inference of sensitive information Under some assumptions of background knowledge Utility of the anonymized data needs to be carefully studied Different data types imply different classes of query 17 Privacy issues in data outsourcing (DAS) and cloud computing applications 18 Motivation 19 20 21 22 Example: DAS - Secure outsourcing of data management DB Internet Data owner/Client Issues: Server Service Provider Confidential information in data needs to be protected Features – support queries on data: SQL, keyword based search-queries, XPath queries etc. Performance - Bulk of work to be done on server, reduce communication overhead, client-side storage and postprocessing of solutions. 23 Security model for DAS applications Adversaries (A): Inside attackers: authorized users with malicious intent Outside attackers: hackers, snoopers Attack models: Passive attacks: A wants to learn confidential information Active attacks: A wants to learn confidential information + actively modifies data and/or queries Trust on server: Untrusted: normal hardware, data & computation visible Semi-trusted: trusted co-processors + limited storage Trusted: All hardware is trusted & tamper-proof 24 Secure data storage & querying in DAS DB Internet Service Provider Server Data owner/Client R ssn name credit rating salary age 780 John bad 34K 32 876 Mary good 29K 40 : : : : Security concern: “ssn” “salary” & “credit rating” is confidential How to execute queries on encrypted data? Encrypt the sensitive column values e.g. Select * from R where salary [25K, 35K] Trivial solution: retrieve all rows to client, decrypt them and check for predicate We can do better Use secure indices for query evaluation on server 25 Data storage • Encrypt the rows • Partition salary values into buckets Client side meta-data • Index the etuples by their bucket-labels buckets B0 0 B1 20 B2 30 B3 40 50 Server side data RS :Server side Table (encrypted + indexed) R: Original Table (plain text) ssn name sex credit rating sal age 345 Tom Male Bad 34k 32 876 Mary Female Good 29k 40 234 Jerry Male Good 45k 34 780 John Male Bad 39k 33 Encrypt etuple bucket (^#&*%T%&4&7ERGTty^Q!%^&* B2 &^$^G@UG^g&@^&&#G@@#(GW B1 &*#($T%#$@$R@@$#@^FG$%& B3 &*#($T%#$@$R@@$#@^FG$%& B2 26 Querying encrypted data Select * from R where sal [25K, 35K] Client side data Client-side query B0 0 Server-side query B1 20 B2 30 Select etuple from RS where bucket = B1 ∨ B2 B3 40 50 False positive Server side Table (encrypted + indexed) Client side Table (plain text) RS R ssn name sex credit rating sal age 345 Tom Male Bad 34k 32 876 Mary Female Good 29k 40 234 Jerry Male Good 45k 34 780 John Male Bad 39k 33 etuple bucket (^#&*%T%&4&7ERGTty^Q!%^&* B2 &^$^G@UG^g&@^&&#G@@#(GW B1 &*#($T%#$@$R@@$#@^FG$%& B3 &*#($T%#$@$R@@$#@^FG$%& B2 27 Problems to address Security analysis Goal: To hide away the confidential information in data from server-side adversaries (DB admins etc.) Quantitative measures of disclosure-risk Quality of partitioning (bucketization) Data partitioning schemes Cost measures Tradeoff Balancing the two competing goals of security & performance Continued later … 28 Privacy in Cloud Computing What is cloud computing? Many definition exist Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. [NIST] Clouds are a large pool of easily usable and accessible virtualized resources (such as hardware, development platforms and/or services). These resources can be dynamically re-configured to adjust to a variable load (scale), allowing also for an optimum resource utilization. This pool of resources is typically exploited by a pay-per-use model in which guarantees are offered by the Infrastructure Provider by means of customized service-level agreements. [Luis M. Vaquero et al., Madrid Spain] 29 Privacy in Cloud Computing Actors Service Providers Service Users Provide software services (Ex: Google, Yahoo, Microsoft, IBM, etc…) Personal, business, government Infrastructure Providers Provide the computing infrastructure required to host services Three cloud services Cloud Software as a Service (SaaS) Cloud Platform as a Service (PaaS) Use provider’s applications over a network Deploy customer-created applications to a cloud Cloud Infrastructure as a Service (IaaS) Rent processing, storage, network capacity, and other fundamental computing resources 30 Privacy in Cloud Computing Examples of cloud computing services Web-based email Photo storing Spreadsheet applications File transfer Online medical record storage Social network applications 31 Privacy in Cloud Computing Privacy issues in cloud computing Cloud increases security and privacy risks Data Create enormous risks for data privacy Creation, storage, communication – exponential rate Data replicated across large geographic distances Data contain personal identifiable information Data stored at untrusted hosts lost of control of sensitive data Risk of sharing sensitive data with marketing Other problem: technology ahead of law Does the user or the hosting company own the data? Can the host deny a user access to their own data? If the host company goes out of business, what happens to the users' data it holds? How does the host protect the user's data? 32 Privacy in Cloud Computing Solutions The cloud does not offer any privacy Awareness Some effort Effort ACM Cloud Computing Security Workshop, November, 2009 ACM Symposium on Cloud Computing, June, 2010 Privacy in cloud computing at UCI Recently lunched a project on privacy-preservation in cloud computing General approach: personal privacy middleware 33 Privacy preservation in Pervasive Spaces 34 Privacy in data sharing and exchange 40 Extra material 41 Example: Detecting a pre-specified set of events No ordinary coffee room, one that is monitored ! There are rules that apply If rule is violated, penalties may be imposed But all is not unfair: individuals have right to privacy ! Just like a coffee room !! ”Till an individual has not had more than his quota of coffee, his identity will not be revealed” 42 Issues to be addressed Modeling pervasive spaces: How to capture events of interest E.g., “Tom had his 4th cup of coffee for the day” Privacy goal: Guarantee anonymity to individuals What are the necessary and sufficient conditions? Solution Design should satisfy the necessary and sufficient conditions Practical/scalable 43 Basic events, Composite events & Rules Model of pervasive space: Pervasive Space with sensors Composite event: one or more sequence of basic events Rule: (Composite event, Action) Rules apply to groups of individuals, e.g.: Coffee room rules apply to everyone Server room rule applies to everyone except administrators etc. Stream of basic events A stream of basic events : : ek:<Bill, coffee-room, coffee-maker, exit> : : e2:<Tom, coffee-room, coffee-cup, dispense> e1:<Tom, coffee-room, *, enter> 44 Composite-events & automaton templates Composite-event templates “A student drinks more than 3 cups of coffee” e1 ≡ <u ∈ STUDENT, coffee_room, coffee_cup, dispense> ¬e1 S0 e1 ¬e1 e1 1 “A student tries to access the IBM machine in the server room” e1 ≡ <u ∈ STUDENT,server_room,*, entry> e2 ≡ <ū, server_room, *, exit> e3 ≡ <ū, server_room, IBM-mc, loginattempt> ¬e1 2 e1 3 e1 ¬(e3 V e2) S0 e1 1 e3 SF e2 45 SF System architecture & adversary Secure Sensor node (SSN) Secure Sensor node (SSN) Server Rules DB State Information ¬e1 S0 S0 S0 Basic Assumptions about SSNs e1 ¬e1 e1 ¬e1 e1 ¬e1 1 1 1 e1 ¬e1 e1 ¬e1 e1 ¬e1 2 2 2 e1 ¬e1 e1 ¬e1 e1 3 3 3 e1 e1 e1 SF SF SF Thin trusted middleware to obfuscate origin of events Trusted hardware (Sensors are tamper-proof) Secure data capture & generation of basic events by SSN Limited computation + storage capacity: can carry out encryption/decryption with secret key common to all SSNs, automaton transition 46 Privacy goal & Adversary’s knowledge Ensure k-anonymity for each individual (k-anonymity is achieved when each individual is indistinguishable from at least k-1 other individuals associated with the space ) Passive adversary (A): Server-side snooper who wants to deduce the identity of the individual associated with a basic-event A knows all rules of the space & automaton structures A can observe all server-side activities A has unlimited computation power Minimum requirement to ensure anonymity: State information (automatons) are always kept encrypted on server 47 Basic protocol SERVER Return automatons that (possibly) match e (encrypted match) Store updated automatons SECURE SENSOR NODE Generate basic event e Encrypted query for automatons that make transition on e Decrypt automatons, advance the state of automatons if necessary associate encrypted label with new state. Write-back encrypted automatons Question: Does encryption ensure anonymity? NO! pattern of automaton access may reveal identity 48 Example R1 R2 R3 U enters kitchen U takes coffee U enters kitchen U opens fridge Applies to Tom Tom enters Kitchen 3 firings U enters kitchen U opens microwave Applies to Bill Bill enters Kitchen 2 firings R1 R2 U enters kitchen U takes coffee U enters kitchen U opens fridge On an event, the # rows retrieved from state table can disclose the identity of the individual 49 Characteristic access patterns of automatons The set of rules applicable to an individual maybe unique potentially identify the individual x Tom enters kitchen Tom takes coffee Rules applicable to TOM Characteristic patterns of x P1: {x,y,z} {x y} y Tom leaves coffee pot empty Tom enters kitchen Tom leaves fridge open z Tom enters kitchen Tom opens fridge Characteristic patterns of y P2: {x,y,z} {x,y} {y} P3: {x,y,z} {y,z} {y} Characteristic patterns of z P4: {x,y,z} {y z} The characteristic access patterns of rows can potentially reveal the identity of the automaton in spite of encryption 50 Solution scheme Formalized the notion of indistinguishability of automatons in terms of their access patterns Identified “event clustering” as a mechanism for inducing indistinguishability for achieving k-anonymity Proved the difficulty of checking for k-anonymity Characterized the class of event-clustering schemes that achieve kanonymity Proposed an efficient clustering algorithm to minimize average execution overhead for protocol Implemented a prototype system Challenges: Designing a truly secure sensing-infrastructure is challenging Key management issues Are there other interesting notions of privacy in pervasive space? 51