Database Laboratory Regular Seminar 2013-08-05 TaeHoon Kim Contents 1. Introduction 2. Related work 3. Problem Statement 4. Distributed Anonymization 5. R-Tree Generalization 6. Performance Analysis 7. Conclusion /21 1. Introduction Cloud computing is a long dreamed vision of Computing Cloud consumers can remotely store their data into the cloud To enjoy the on-demand though quality applications and services from a shared pool of configurable computing resources Successful third party cases Examples of success cases on EC2 include Nimbus Health[2] Manages patient medical records Examples of success cases on ShareThis[3] A social content-sharing network that has shared 340 million items across 30,000 web sites 3 /21 1. Introduction Vulnerable data privacy Unfortunately, such data sharing is subject to constraints impose by privacy of individuals Researchers have show that attackers could effectively target and observe information Consistent with related works on cloud security[4][6][7][8] Third party clouds[9] To protect data privacy, the sensitive information of individuals should be preserved Partition-based privacy preserving data publishing techniques K-anonymity, (a,k)anonymity, l-diversity, t-closeness, m-invarance, etc.. 4 /21 1. Introduction Privacy preserving data publishing for single dataset has been extensively studied Generalization, suppression, perturbation Xiong et al,[5] Data anonymization for horizontally partitioned datasets A distributed anonymization protocol Only gave a uniform approach that exerts the same level of protection for all data providers How to design a new distributed anonymization protocol over cloud servers Propose a new distributed anonymization protocol We design an algorithm which inserts data object into an R-Tree for anonymization on top of the k-anonymity and l-diversity principle 5 /21 2. Related Work Privacy preserving data publishing K-anonymity[11], (a,k)anonymity[12], l-diversity[13], t-closeness[30], m-invarance[14] designed a criteria for judging whether a published dataset provides a certain privacy preservation In this study, our distributed anonymization protocol is built top of the k-anonimity and l-diversity principle We propose new anonymization algorithm by inserting all the data object into an R-tree to achieve high quality generalization 6 /21 2. Related Work Distributed anonymization solutions Naïve solution Each data provider to implement data anonymization independently Since the data is anonymized before integration, main drawback of this solution is that it will cause low data utility Assumes the existence of a third party that can trusted by all data providers Trusted third party is not always feasible Compromise of the server by attackers could lead to a complete privacy loss for all participating parties and data subject 7 /21 2. Related Work Jiang et al.,[26] presented a two-party framework along with and application Zhong et al.[27] proposed provably private solutions without disclosing data from one site to the other Xiong et al.[25] presented a distributed anonymization protocol In contrast to the above work, our work is aimed at outsourcing data provider provider’s private dataset to cloud servers for data sharing 8 /21 3. Problem Statement The union of all local databases denoted as microdata set D as given in Definition 1 Each site produces a local anonymized databases di* Meets its own privacy principle ki since data providers have different privacy requirements for publishing 9 /21 3. Problem Statement Node1 Node2 Node3 Each site produces a local anonymized database di* The union of all local anonymized database forms a virtual database D* = (d1 ∪ d2 ∪ d3)* 10 /21 3. Problem Statement(Goal) Privacy for Data Objects Based on Anonymity k-anonymity[11][19] l-diversity[13] A set of k records to be indistinguishable from each other based on a quasi-identifier group(sensitive attribute group) each equivalence class contains at least l diverse sensitive values Privacy between Data providers Our second privacy goal is to avoid the attack between data providers, in which individual dataset reveal nothing about data to the other data providers apart from the virtual anonymized database We use distributed anonymization algorithm to build a virtual Kanonymous database and ensure the locally anonymized table di* to be ki-anonymous – Use R-tree 11 /21 4. Distribute Anonymization Protocol The main idea of the distributed anonymization protocol is to use secure multi-servers computation protocols to realize the Rtree generalization method for the cloud settings Notation I : d-dimensional rectangle which is the bounding box of the QI group’s QI values Num : the total number of data objects in the equivalence class 12 /21 4. Distribute Anonymization Example of generalization Equivalence class(QI group) of Node0 from [11-13][52005300] to [11-30][5200-5300] Equivalence class(QI group) of Node1 from [73-80][52005300] to [65-80][5400-5500] Equivalence class(QI group) of Node2 from [65-76][52005300] to [65-80][5400-5500] 13 /21 4. Distribute Anonymization Example of Split Process When e3 is inserted, the R-tree node splits into two group, e1 and e3 into one group When the r4 comes, e1 and e3 will be split into one group, e2 e4 into other At last, e5 comes, e2 and e4 in one group and e5 the other 14 /21 5. R-Tree Generalization Index structure Leaf node (I, SI) – I : d-dimensional rectangle which is the bounding box of the QI group’s QI values – SI : sensitive information for a tuple Non-leaf node (I, childPointer) – I : covers all rectangles in the lower nodes entries – childPointer : the address of a lower node in the R-tree I I childPointer SI I SI 15 /21 5. R-Tree Generalization Insertion At the root level, the algorithm choose the entry whose rectangle needs the least area enlargement to cover a, so R1 is selected for its rectangle dose not need to be enlarged, while the rectangle of R2 needs to expand considerably Node Splitting(when leaf node occurs overflow) Picks two seeds from the entries that would get the largest area enlargement when covered by a single rectangle One at a time is chosen to be put in one of the two groups 16 /21 6. Performance Analysis Experimental environment Amazon’s EC2 platform Implement in Java 1.6.0.13 and run on set of EC2 computing units Each computing unit is a small instance of EC2 with 1.7GHz Xeon processor 1.7GB memory, and 160 Hard disk Computing units are connected via 250Mbps network links We use three different dataset with Uniform, Gaussian and Zipf distribution to evaluate our distributed anonymization scheme 17 /21 6. Performance Analysis Dataset and Setup All the 100K tuples is located in one centralized database Data are distributed among the 10 nodes and we use the distributed anonymizaion approach presented in Section 4 R-tree generalization algorithm was used to generalize the database to be K-anonymous DM(discernibility metric) assigns each tuple ri* in D* a penalty which is determined by the size of the equivalence class containing it 18 /21 6. Performance Analysis Absolute error = | actual – estimate | Actual is the correct range query answer number Estimate is the number of candidate set computed from the anonymous table 19 /21 Conclusion Two direction have presented A distributed anonymization protocol for privacy-preserving data publishing from multiple data providers in a cloud system. A new anonymization algorithm using R-Tree index structure Future work Developing a protocol toolkit incorporating more privacy principle like differential privacy Building indexes based on anonymized cloud data to offer more efficient and reliable data analysis 20 /21 Q/A Thank you for listening my presentation 21 /21