and
in
Ragib Hasan
Johns Hopkins University en.600.412 Spring 2011
Lecture 8
04/04/2011
Goal: Examine techniques for ensuring data privacy in computations outsourced to a cloud
Review Assignment #7: (Due 4/11)
Roy et al., Airavat: Security and Privacy for
MapReduce, NSDI 2010
4/4/2011 en.600.412 Spring 2011 Lecture 8 | JHU | Ragib Hasan
Recap: Cloud Forensics (Bread &
Butter paper from ASIACCS 2010)
• Strengths?
• Weaknesses?
• Ideas?
4/4/2011 en.600.412 Spring 2011 Lecture 8 | JHU | Ragib Hasan
• Information Privacy is the interest an individual has in controlling, or at least significantly influencing, the handling of data about themselves.
• Confidentiality is the legal duty of individuals who come into the possession of information about others, especially in the course of particular kinds of relationships with them.
4/4/2011 en.600.412 Spring 2011 Lecture 8 | JHU | Ragib Hasan
Problem of making large datasets public
Model:
– One party owns the dataset
– Another party wants to run some computations on it
– A third party may take data from the first party, run functions (from the second party) on the data, and provide the results to the second party
Problem:
– How can the data provider ensure the confidentiality and privacy of their sensitive data?
4/4/2011 en.600.412 Spring 2011 Lecture 8 | JHU | Ragib Hasan
Problem of making large datasets public
• Massachusetts Insurance Database
– DB was anonymized, with only birthdate, sex, and zip code made available to public
– Latanya Sweeny of CMU took the DB and voter records, and pinpointed the MA Governor’s record
• Netflix Prize Database
– DB was anonymized, with user names replaced with random IDs
– Narayanan et al. used Netflix DB and imDB data to de-anonymize users
4/4/2011 en.600.412 Spring 2011 Lecture 8 | JHU | Ragib Hasan
Differential Privacy schemes can ensure privacy of statistical queries
• Differential privacy aims to provide means to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records.
• Informally, given the output of a computation or a query, an attacker cannot tell whether any particular value was in the input data set.
4/4/2011 en.600.412 Spring 2011 Lecture 8 | JHU | Ragib Hasan
Securing MapReduce for Privacy and
Confidentiality
• Paper:
– Roy et al., Airavat: Security and Privacy for
MapReduce
– Goal: Secure MapReduce to provide confidentiality and privacy assurances for sensitive data
4/4/2011 en.600.412 Spring 2011 Lecture 8 | JHU | Ragib Hasan
• Data providers: own data sets
• Computation provider: provides MapReduce code
• Airavat Framework: Cloud provider where the MapReduce code is run on uploaded data
4/4/2011 en.600.412 Spring 2011 Lecture 8 | JHU | Ragib Hasan
• Assets: Sensitive data or outputs
• Attacker model:
– Cloud provider (where Airavat is Run) is trustworthy
– Computation provider (user who queries, provides
Mapper and Reducer functions) can be malicious
• Functions provided by the Computation provider can be malicious.
• Cloud provider does not perform code analysis on usergenerated functions
– Data provider is trustworthy
4/4/2011 en.600.412 Spring 2011 Lecture 8 | JHU | Ragib Hasan
• MapReduce is a widely used and deployed distributed computation model
• Input data is divided into chunks
• Mapper nodes run a mapping function on a chunk and output a set of <key, value> pairs
• Reducer nodes combine values related to a particular key based on a function, and output to a file
4/4/2011 en.600.412 Spring 2011 Lecture 8 | JHU | Ragib Hasan
• Goal: Ensure privacy of source data
• Concept used: Differential privacy – ensure that no sensitive data is leaked.
• Method used: Adds random Laplacian noise to outputs
4/4/2011 en.600.412 Spring 2011 Lecture 8 | JHU | Ragib Hasan
• Goal: Prevent malicious users from preparing sensitive functions that leak data.
• Concept used: Functional sensitivity - How much the output changes when a single element is included/removed from inputs
– More sensitivity: more information is leaked
• How is used? :
– Airavat requires CPs to give range of possible output values.
– This is used to determine sensitivity of CP-written mapper functions.
4/4/2011 en.600.412 Spring 2011 Lecture 8 | JHU | Ragib Hasan
• Goal: Prevent users from sending many brute force queries and try to reveal the input data.
• Concept used: Privacy budget (defined by data provider)
• How used:
– Data sources set privacy budget for data.
– Each time a query is run, the budget is decreased, and
– Once the budget is used up, user cannot run more queries.
4/4/2011 en.600.412 Spring 2011 Lecture 8 | JHU | Ragib Hasan
• Mappers are provided by computation provider, and hence are not trusted
• Reducers are provided by Airavat. They are trusted
– Airavat only supports a small set of reducers.
• Keys must be pre-declared by CP (why?)
• Airavat generates enough noise to assure differential privacy of values
• Range enforcers ensure that output values from mappers lie within declared range
4/4/2011 en.600.412 Spring 2011 Lecture 8 | JHU | Ragib Hasan
Security via Mandatory Access Control
• In MAC, Operating System enforces access control at each access
• Access control rights cannot be overridden by users
• Airavat uses SELinux – a special Linux distribution that supports MAC (developed by
NSA)
4/4/2011 en.600.412 Spring 2011 Lecture 8 | JHU | Ragib Hasan
• Each data object and process is tagged showing the trust level of the object
• Data providers can set a declassify bit for their data, in which case the result will be released when there is no differential privacy violation
4/4/2011 en.600.412 Spring 2011 Lecture 8 | JHU | Ragib Hasan
• Airavat was implemented on Hadoop and
Hadoop FS.
4/4/2011 en.600.412 Spring 2011 Lecture 8 | JHU | Ragib Hasan
Cynthia Dwork defines Differential Privacy, interesting blog post that gives high level view of differential privacy. http://www.ethanzuckerman.com/blog/2010/09/29/cynthia-dwork-definesdifferential-privacy/
4/4/2011 en.600.412 Spring 2011 Lecture 8 | JHU | Ragib Hasan