Distributed Anonymization for Multiple Data

advertisement
Database Laboratory
Regular Seminar
2013-08-05
TaeHoon Kim
Contents
1.
Introduction
2.
Related work
3.
Problem Statement
4.
Distributed Anonymization
5.
R-Tree Generalization
6.
Performance Analysis
7.
Conclusion
/21
1. Introduction

Cloud computing is a long dreamed vision of Computing

Cloud consumers can remotely store their data into the cloud


To enjoy the on-demand though quality applications and services from a
shared pool of configurable computing resources
Successful third party cases

Examples of success cases on EC2 include Nimbus Health[2]


Manages patient medical records
Examples of success cases on ShareThis[3]

A social content-sharing network that has shared 340 million items across
30,000 web sites
3
/21
1. Introduction

Vulnerable data privacy

Unfortunately, such data sharing is subject to constraints impose by
privacy of individuals


Researchers have show that attackers could effectively target and
observe information


Consistent with related works on cloud security[4][6][7][8]
Third party clouds[9]
To protect data privacy, the sensitive information of individuals
should be preserved

Partition-based privacy preserving data publishing techniques

K-anonymity, (a,k)anonymity, l-diversity, t-closeness, m-invarance, etc..
4
/21
1. Introduction

Privacy preserving data publishing for single dataset has been
extensively studied


Generalization, suppression, perturbation
Xiong et al,[5]


Data anonymization for horizontally partitioned datasets
A distributed anonymization protocol


Only gave a uniform approach that exerts the same level of protection
for all data providers
How to design a new distributed anonymization protocol over
cloud servers


Propose a new distributed anonymization protocol
We design an algorithm which inserts data object into an R-Tree for
anonymization on top of the k-anonymity and l-diversity principle
5
/21
2. Related Work

Privacy preserving data publishing




K-anonymity[11], (a,k)anonymity[12], l-diversity[13], t-closeness[30],
m-invarance[14]
designed a criteria for judging whether a published dataset provides
a certain privacy preservation
In this study, our distributed anonymization protocol is built top of
the k-anonimity and l-diversity principle
We propose new anonymization algorithm by inserting all the data
object into an R-tree to achieve high quality generalization
6
/21
2. Related Work

Distributed anonymization solutions

Naïve solution




Each data provider to implement data anonymization independently
Since the data is anonymized before integration, main drawback of this
solution is that it will cause low data utility
Assumes the existence of a third party that can trusted by all data
providers
Trusted third party is not always feasible

Compromise of the server by attackers could lead to a complete privacy
loss for all participating parties and data subject
7
/21
2. Related Work




Jiang et al.,[26] presented a two-party framework along with and
application
Zhong et al.[27] proposed provably private solutions without
disclosing data from one site to the other
Xiong et al.[25] presented a distributed anonymization protocol
In contrast to the above work, our work is aimed at outsourcing
data provider provider’s private dataset to cloud servers for data
sharing
8
/21
3. Problem Statement


The union of all local databases denoted as microdata set D as
given in Definition 1
Each site produces a local anonymized databases di*

Meets its own privacy principle ki since data providers have different
privacy requirements for publishing
9
/21
3. Problem Statement
Node1
Node2
Node3
Each site produces a local anonymized database di*
The union of all local anonymized database forms
a virtual database D* = (d1 ∪ d2 ∪ d3)*
10
/21
3. Problem Statement(Goal)

Privacy for Data Objects Based on Anonymity

k-anonymity[11][19]


l-diversity[13]


A set of k records to be indistinguishable from each other based on a
quasi-identifier group(sensitive attribute group)
each equivalence class contains at least l diverse sensitive values
Privacy between Data providers

Our second privacy goal is to avoid the attack between data
providers, in which individual dataset reveal nothing about data to
the other data providers apart from the virtual anonymized database

We use distributed anonymization algorithm to build a virtual Kanonymous database and ensure the locally anonymized table di* to be
ki-anonymous
– Use R-tree
11
/21
4. Distribute Anonymization

Protocol


The main idea of the
distributed anonymization
protocol is to use secure
multi-servers computation
protocols to realize the Rtree generalization method
for the cloud settings
Notation


I : d-dimensional rectangle
which is the bounding box
of the QI group’s QI values
Num : the total number of
data objects in the
equivalence class
12
/21
4. Distribute Anonymization

Example of generalization



Equivalence class(QI group)
of Node0 from [11-13][52005300] to [11-30][5200-5300]
Equivalence class(QI group)
of Node1 from [73-80][52005300] to [65-80][5400-5500]
Equivalence class(QI group)
of Node2 from [65-76][52005300] to [65-80][5400-5500]
13
/21
4. Distribute Anonymization

Example of Split Process



When e3 is inserted, the R-tree node splits into two group, e1 and e3
into one group
When the r4 comes, e1 and e3 will be split into one group, e2 e4 into
other
At last, e5 comes, e2 and e4 in one group and e5 the other
14
/21
5. R-Tree Generalization

Index structure

Leaf node

(I, SI)
– I : d-dimensional rectangle which is the bounding box of the QI group’s QI
values
– SI : sensitive information for a tuple

Non-leaf node

(I, childPointer)
– I : covers all rectangles in the lower nodes entries
– childPointer : the address of a lower node in the R-tree
I
I
childPointer
SI
I
SI
15
/21
5. R-Tree Generalization

Insertion


At the root level, the algorithm choose the entry whose rectangle
needs the least area enlargement to cover a, so R1 is selected for its
rectangle dose not need to be enlarged, while the rectangle of R2
needs to expand considerably
Node Splitting(when leaf node occurs overflow)


Picks two seeds from the entries that would get the largest area
enlargement when covered by a single rectangle
One at a time is chosen to be put in one of the two groups
16
/21
6. Performance Analysis

Experimental environment






Amazon’s EC2 platform
Implement in Java 1.6.0.13 and run on set of EC2 computing units
Each computing unit is a small instance of EC2 with 1.7GHz Xeon
processor 1.7GB memory, and 160 Hard disk
Computing units are connected via 250Mbps network links
We use three different dataset with Uniform, Gaussian and Zipf
distribution to evaluate our distributed anonymization scheme
17
/21
6. Performance Analysis

Dataset and Setup

All the 100K tuples is located in one centralized database



Data are distributed among the 10 nodes and we use the distributed
anonymizaion approach presented in Section 4
R-tree generalization algorithm was used to generalize the database
to be K-anonymous
DM(discernibility metric) assigns each tuple ri* in D* a penalty which
is determined by the size of the equivalence class containing it
18
/21
6. Performance Analysis

Absolute error = | actual – estimate |


Actual is the correct range query answer number
Estimate is the number of candidate set computed from the
anonymous table
19
/21
Conclusion

Two direction have presented



A distributed anonymization protocol for privacy-preserving data
publishing from multiple data providers in a cloud system.
A new anonymization algorithm using R-Tree index structure
Future work


Developing a protocol toolkit incorporating more privacy principle
like differential privacy
Building indexes based on anonymized cloud data to offer more
efficient and reliable data analysis
20 /21
Q/A

Thank you for listening my presentation
21
/21
Download