Personalized Anonymity Algorithm Using Clustering Techniques Available at

advertisement
Journal of Computational Information Systems 7:3 (2011) 924-931
Available at http://www.Jofcis.com
Personalized Anonymity Algorithm Using Clustering Techniques
Pingshui WANG†
College of Management Science and Engineering, Anhui University of Finance & Economics, Bengbu 233030, China
Abstract
Anonymity is an essential technique to preserve individual privacy information in data releasing setting. However, most
existing anonymous methods focus on the universal approach that exerts the same amount of preservation for all
individuals. The result is that the released data may offer insufficient protection to a subset of people, while applying
excessive privacy control to another subset. In addition, those methods using generalization and suppression techniques
suffer from high information loss and low usability mainly due to reliance on pre-defined generalization hierarchies or total
order imposed on each attribute domain. To address this issue, we develop a new personalized anonymity algorithm using
clustering techniques. Experimental results show that the proposed method can reduce the information loss and improve the
usability of the released data while satisfying everybody’s privacy preserving requirements.
Keywords: Privacy Preservation; Personalized Anonymity; Clustering
1. Introduction
With the rapid development in Internet, data storage and data processing technologies, individual privacy
information is being collected by various organizations for research purposes. However, if individuals can
be uniquely identified in the released data then their private information would be disclosed. Releasing data
about individuals without revealing their sensitive information is an important problem. To avoid the
identification of records in released data, uniquely identifying information like names and social security
numbers are removed from the table. However, this first sanitization still does not ensure the privacy of
individuals in the data. Privacy preservation is the prevention of cross-references and inferences that could
reveal confidential information in released data. Anonymity provides a relative guarantee that the identity
of individuals cannot be discovered. In recent years, a new definition of privacy called k-anonymity has
gained popularity. The k-anonymity model, proposed by Sweeney [1], is a simple and practical
privacy-preserving approach and has drawn considerable interest from research community, and a number
of effective algorithms have been proposed [2-14]. The k-anonymity model ensures that each record in the
table is identical to at least (k-1) other records with respect to the quasi-identifier attributes. Therefore, no
privacy related information can be inferred from the k-anonymity protected table with high confidence.
The most existing anonymity methods focus on a universal approach that exerts the same amount of
preservation for all individuals. The consequence is that the released data may offer insufficient protection
to a subset of people, while applying excessive privacy control to another subset. In addition, these methods
based on generalization and suppression techniques suffer from high information loss and low usability
mainly due to reliance on pre-defined generalization hierarchies or total order imposed on each attribute
domain. To address this issue, we develop a new personalized anonymity algorithm using clustering
†
Corresponding author.
Email addresses: pshwang@163.com (Pingshui WANG).
1553-9105/ Copyright © 2011 Binary Information Press
March, 2011
925
P. Wang. /Journal of Computational Information Systems 7:3 (2011) 924-931
techniques. Experimental results show that our method can reduce the information loss and improve the
usability of the released data while satisfying personal privacy preserving requirements.
The rest of this paper is organized as follows. In section 2, we introduce the related work. In section 3,
we present our method of personalized anonymity using clustering techniques. In section 4, we analyze the
performance of our method by extensive experiments. Section 5 contains the conclusions and future work.
2. The Related Work
2.1. k-Anonymity
In order to preserve privacy, Sweeney [1] proposed the k-anonymity model which achieves k-anonymity
using generalization and suppression techniques, so that, any individual is indistinguishable from at least
(k-1) other ones with respect to the quasi-identifier attributes in the released dataset. For example, table 2 is
a 2-anonymous table of table 1. Generalization involves replacing a value with a less specific but
semantically consistent value. For example, the date of birth could be generalized to a range such as year of
birth, so as to reduce the risk of identification. Suppression involves not releasing a value at all. In recent
years, numerous algorithms have been proposed for implementing k-anonymity by generalization and
suppression.
Table 1 Original Table
Name
Race
Birth
Sex
Zip
Disease
Alice
Blank
1965-3-18
F
02141
gastric ulcer
Helen
Blank
1965-5-1
F
02142
dyspepsia
David
Blank
1966-6-10
M
02135
pneumonia
Bob
Blank
1966-7-15
M
02137
bronchitis
Jane
White
1968-3-20
F
02139
flu
Paul
White
1968-4-1
F
02138
cancer
Table 2 2-Anonymous Table of Table 1
Race
Birth
Sex
Zip
Disease
Blank
1965
F
0214*
gastric ulcer
Blank
1965
F
0214*
dyspepsia
Blank
1966
M
0213*
pneumonia
Blank
1966
M
0213*
bronchitis
White
1968
F
0213*
flu
White
1968
F
0213*
cancer
2.2. Personalized Anonymity
The k-anonymity methods mainly focus on a universal approach that exerts the same amount of
preservation for all individuals, without catering for their concrete needs. The consequence is that the
released data may offer insufficient protection to a subset of people, while applying excessive privacy
control to another subset. Motivated by this, Xiao and Tao [8] presented a new generalization framework
based on the concept of personalized anonymity which overcomes the above problems. In the following we
introduce briefly the principle of the personalized anonymity.
The basic idea of the personalized anonymity is that a person can specify the degree of privacy
preservation for her/his sensitive values. For example, figure 1 demonstrates a simple taxonomy of attribute
Disease. The taxonomy is accessible by the public, and organizes all diseases as leaves of a tree. An
intermediate node carries a name summarizing the diseases in its subtree. Each person can decide her/his
926
P. Wang. /Journal of Computational Information Systems 7:3 (2011) 924-931
sensitive values appeared in the released data. Their technique performs the minimum generalization for
satisfying everybody’s requirements, and thus, retains the largest amount of information from the original
data.
Table 3 An Example of Guarding Node
Record No.
Name
Race
Birth
Sex
Zip
Disease
Guarding Node
1
Alice
Blank
1965-3-18
F
02141
gastric ulcer
stomach disease
2
Helen
Blank
1965-5-1
F
02142
dyspepsia
dyspepsia
3
David
Blank
1966-6-10
M
02135
pneumonia
respiratory infection
4
Bob
Blank
1966-7-15
M
02137
bronchitis
Bronchitis
5
Jane
White
1968-3-20
F
02139
flu
φ
6
Paul
White
1968-4-1
F
02138
cancer
other
A personal preference can be easily solicited from an individual when s/he is supplying her/his data. In
the personalized anonymity approach, a preference is formulated through a “guarding node” in the
taxonomy. As an example, for record 1 in Table 3, Alice may specify node stomach-disease (the “guarding
node” for her privacy). Thus, nobody should be able to infer that she suffered from any disease (i.e., gastric
ulcer, dyspepsia, or gastritis) in the subtree of the node with significant confidence. In other words, in
Alice’s opinion, allowing the public to associate her with dyspepsia or gastritis is as serious as revealing
her true disease. On the other hand, for record 5 in Table 3, Jane may specify φ , which is an implicit node
underneath all the leaves of the taxonomy. The empty-set preference implies that s/he is willing to release
her/his actual diagnosis result flu; therefore, record 5 can be released directly. In general, flu may not be
“sensitive” for many people, such that it is often not necessary to apply any privacy protection to this value.
any illness
respiratory infection
pneumonia
bronchitis
stomach disease
flu dyspepsia gastritis
gastric ulcer
other
aids … cancer
Fig. 1 The Taxonomy of Attribute Disease
Let S be a table storing private information about a set of individuals. The attributes in S are
classified in 4 categories: (i) an identifier attribute Ai which uniquely identifies a person, and must be
removed when S is released to the public, (ii) a sensitive attribute A s (e.g., Disease in Table 3), whose
values may be confidential for an individual (subject to her/his preferences), (iii) d quasi-identifier (QI)
attributes A1qi , A2qi , , Adqi , whose values can be released, but may reveal a personal identify with the aid of
external information (Race, Birth, Sex, Zip in Table 3), and (iv) other attributes that are not relevant to our
discussion. All the attributes have finite domains and each categorical attribute is accompanied by a
taxonomy, which indicates the publicly-known hierarchy among the possible values of the attribute.
The objective is to compute a releasable table S * such that (i) it contains all the attributes of S except
i
A , (ii) it has a generalized record for every record in S , (iii) it preserves as much information of S as
possible, and (iv) its releasing does not cause any privacy breach. To describe the personal privacy
requirements, some concepts are defined in the following [8]:
Definition 1 ( A s subtree). For any node x in the taxonomy of A s , ST ( x) represents its subtree,
which includes x itself, and the part of the taxonomy under it. A record r ∈ S defines an association
between an individual o (identified by r. Ai ) and a sensitive value v = r. A s , which is denoted as { o , v }.
To formulate her/his privacy preference, o specifies a guarding node as follows:
927
P. Wang. /Journal of Computational Information Systems 7:3 (2011) 924-931
Definition 2 (Guarding node). For a record r ∈ S , its guarding node r.GN is a node on the path from
the root to r. A s in the taxonomy of A s .
Through r.GN , o indicates that she/he does not want the public to associate her/him with any leaf
s
A value in ST (r.GN ) . Specifically, assume that ST (r.GN ) contains x leaf values v1 , v2 ,..., v x . The
privacy requirement of r.GN is breached if an adversary thinks that any of the associations { o , v1 }, ... ,
{ o , v x } exists in S .
Definition 3 (Breach probability). For a record r ∈ S , its breach probability Pbreach (r ) equals the
probability that an adversary can infer from S * that any of the associations { o , v1 }, ..., { o , v x } exists in S ,
where v1 , v2 ,..., v x are the leaf values in ST (r.GN ) .
The released table S * should guarantee that, for all r ∈ S , Pbreach (r ) is at most pbreach , which is a
system parameter specifying the amount of confidentiality control. Table 3 demonstrates the guarding
nodes selected by the individuals. Guarding nodes depend entirely on personal preferences, and are not
determined by the sensitive values.
3. Personalized Anonymity Algorithm Using Clustering Techniques
The key idea underlying our method is that the personalized anonymity problem can be viewed as a
clustering problem [10]. Clustering is the problem of partitioning a set of objects into groups such that
objects in the same group are more similar to each other than objects in other groups with respect to some
defined similarity criteria.
3.1. Personalized k -Member Clustering Problem
Typical clustering problems require that a specific number of clusters be found in solutions. However, the
personalized anonymity problem does not have a constraint on the number of clusters; instead, it requires
that each cluster contains at least k records and each sensitive attribute value satisfies personal privacy
requirements. Thus, we pose the personalized anonymity problem as a clustering problem, referred to as
personalized k-member clustering problem.
Definition 4 (Personalized k -member clustering problem). The personalized k -member clustering
problem is to find a set of clusters from a given set of n records such that each cluster contains at least k
( k ≤ n ) data points and that the sum of all intra-cluster distances is minimized. Formally, let S be a set of
n records, k be the specified anonymization parameter and pbreach be a system parameter specifying
the amount of confidentiality control. Then the optimal solution of the personalized k -member clustering
problem is a set of clusters E = {e1 ,..., em } such that:
(1) ∀i ≠ j ∈ {1,..., m} , ei ∩ e j = φ ;
(2) U i =1,..., m ei = S ;
(3) ∀ei ∈ E , ei ≥ k ;
(4) Σ l =1,..., m el ⋅ MAX i , j =1,..., el Δ( p(l , i ), p(l , j )) is minimized;
(5) ∀r ∈ S , Pbreach (r ) ≤ pbreach ;
(6) Each sensitive attribute value satisfies personal privacy requirements.
Where e is the size of cluster e , p(l , i ) represents the i -th data point in cluster el , and Δ( x, y ) is
the distance between two data points x and y .
3.2. Distance and Cost Functions
Distance functions are used to measure the dissimilarities among data points, which are usually determined
by the type of data being clustered. The cost function that the clustering problem tries to minimize is
928
P. Wang. /Journal of Computational Information Systems 7:3 (2011) 924-931
defined by the specific objective of the clustering problem. In this section, we describe our distance and
cost functions which have been specifically tailored for the k -anonymity problem [10].
Definition 5 (Distance between two numeric values). Let D be a finite numeric domain. Then the
normalized distance between two values v i , v j ∈ D is defined as:
δ N (v i , v j ) = v i − v j / D .
(1)
Where D is the domain size measured by the difference between the maximum and minimum values in D .
For categorical attributes, the difference is no longer applicable as most of the categorical domains
cannot be enumerated in any specific order. However, some domains may have some semantic
relationships among the values. In such domains, it is desirable to define the distance functions based on
the existing relationships. Such relationships can be easily captured in a taxonomy tree. For example,
Figure. 1 illustrates a natural taxonomy tree for the Disease attribute. Furthermore, we can specify the level
distance taking specific instance into consideration. The distance between lower levels is smaller than the
higher ones. We now define the distance function for categorical values as follows:
Definition 6 (Distance between two categorical values). Let D be a categorical domain and T D be
a taxonomy tree defined for D . We assume that the distance is 1 from bottom to top level of the taxonomy
tree. Then the normalized distance between two values vi , v j ∈ D is defined as:
δ C (vi , v j ) = W (Λ (vi , v j )) .
(2)
Where Λ(vi , v j ) is the subtree rooted at the lowest common ancestor of vi and v j , and W ( T )
represents the sum of the distance between levels of tree T .
Example 1. Consider attribute Disease and its taxonomy tree in Fig. 1. We assume the distance between
level 1 and level 2 equals 0.7, and the distance between level 2 and level 3 is 0.3. Then the distance
between dyspepsia and gastritis is 0.3 = 0.3, while the distance between dyspepsia and flu is 0.3+0.7 = 1.
Definition 7 (Distance between two records). Let QT = {N1 ,..., N m , C1 ,..., Cn } be the quasi-identifier
of table S , where N i (i = 1,..., m) is an attribute with a numeric domain and C j ( j = 1,..., n) is an
attribute with a categorical domain, and A s is a sensitive attribute with a categorical domain. The distance
of two records r1 , r2 ∈ S is defined as:
∑wδ
Δ(r1 , r2 ) =
i N ( r1[ N i ], r2 [ N i ]) +
i =1,..., m
∑w δ
j C ( r1[C j ], r2 [C j ]) +
ws δC (r1[ A s ], r2 [ A s ]) .
(3)
j =1,..., n
Where ri [ A] represents the value of attribute A in ri , wi ( w j / ws ) represents the weigh of attribute
N i ( C j / A s ) , and
∑
m
1
wi +
∑w
n
1
j
+ ws = 1 .
Definition 8 (Information loss). Let e = {r1 , , rc } be a cluster where the quasi-identifier consists of
numeric attributes N1 ,..., N m and categorical attributes C1 ,..., Cn . Let TCi be the taxonomy tree defined
for the domain of categorical attribute C i . Let MIN N i and MAX N i be the min and max values in e with
respect to attribute N i , and let ∪ Ci be the union set of values in e with respect to attribute C i . Let
STt k .GN be the subtree of guarding node of the record tk . Then the amount of information loss occurred
by generalizing e, denoted by IL(e) , is defined as:
( MAX N i − MIN N i )
IL(e) = e ⋅ (
+
Ni
i =1,..., m
∑
Where e is the number of records in e , N i
∑ W (Λ (U
j =1,..., n
Cj
))) +
∑W (ST
k =1,..., e
t k .GN )
.
(4)
represents the size of numeric domain N i , Λ(∪ C j ) is
the subtree rooted at the lowest common ancestor of every value in ∪ C j , and W ( T ) is the sum of the
distance between levels of tree T .
929
P. Wang. /Journal of Computational Information Systems 7:3 (2011) 924-931
Definition 9 (Total information loss). Let E be the set of all equivalence classes in the anonymized
table S * . Then the amount of total information loss of S * is defined as:
Total − IL( S * ) =
IL(e) .
(5)
∑
e∈E
3.3. Personalized Anonymity Algorithm Using Clustering Techniques
Based on the above concepts, we propose an new algorithm for computing a personalized anonymous table
S * with small Total − IL( S * ) which guarantees Pbreach (r ) ≤ pbreach for ∀r ∈ S .
Our anonymization algorithm includes three steps, as is shown in figure 2. The first phase produces a set
of QI-groups denoted by temp through applying clustering on S . Then, the second step produces
anonymous QI-groups by applying QI-generalization on temp, using a set of generalization functions
G = { f1 , f 2 ,..., f d } on the d QI-attributes, respectively. Finally, the third step produces the final releasable
table S * by performing SA-generalization on the resulting QI-groups, employing a specialized
generalization function for each QI-group. Hence, the quality of S * depends on (i) the method of
clustering, (ii) the choice of generalization functions G , and (iii) the effectiveness of SA-generalization.
Please refer to reference [13] for the specific algorithm about SA-generalization.
Algorithm Clustering_Personalized_Anonymity
Input: a set of records S, a threshold value k, and the guarding nodes of all records.
Output: the releasable dataset S*.
1.
temp = φ ; r = a randomly picked record from S;
2.
while ( |S| ≥ k )
a)
r = the furthest record from r;
b)
S = S – {r};
c)
e = {r};
d)
while( |e| < k )
i.
r = find_best_record(S, e);
ii.
S = S – {r};
iii.
e = e ∪ {r};
e)
temp = temp ∪ {e};
while ( |S| ≠ 0 )
a)
r = a randomly picked record from S;
b)
S = S – {r};
c)
e = find_best_cluster(temp, r);
d)
e = e ∪ {r};
temp’ =QI-generalization(temp);
S*=SA-generalization (temp’);
return S*.
3.
4.
5.
6.
Fig. 2 Algorithm for Personalized Anonymity Using Clustering Techniques
4. Experimental Results
The main goal of the experiments was to investigate the performance of our algorithm in terms of data
quality and execution time. To accurately evaluate our approach, we compared our implementation with
another algorithm, namely the generalization algorithm proposed in [8].
4.1. Experimental Setup
The experiments were performed on a machine with Intel(R) Core(TM)2 Duo CPU T5450 1.67GHz,2.0GB
RAM, Windows XP, MATLAB7.0, and Visual C + + 6.0.
We ran our experiments on the IPUMS database. The dataset contains a relation with 100k records, each
storing the information of an American adult. Before the experiments, we removed records with missing
values and chose only six original attributes. For k-anonymity, we considered {age, education, gender,
P. Wang. /Journal of Computational Information Systems 7:3 (2011) 924-931
930
marital status, and occupation} as the quasi-identifier. Among these, age and education were treated as
numeric attributes while the other three attributes were treated as categorical attributes. In addition to that,
we also considered income as the sensitive attribute. Taxonomy trees for each attribute are described in
Table 4. Records are randomly divided into 3 groups which account for 10%, 30%, and 60% of the dataset,
respectively. For each record in the first (or second) group, its guarding node is the parent of its sensitive
value (or is φ ). The guarding nodes of the records in the last group are their sensitive values.
Table 4 Experimental Data Information
Attribute
Distinct values
Tree level
Sensitive?
age
education
gender
marital status
occupation
income
8
15
2
6
8
10
3
3
2
3
3
4
No
No
No
No
No
Yes
4.2. Data Quality
In this section, we report experimental results on our algorithm and the generalization algorithm for data
quality. Figure 3 shows the Total-IL costs of the two algorithms for increasing values of k . As the figure
illustrating, our algorithm results in lower cost of the Total-IL for k values size from 10 to 100. The
superiority of our algorithms over the generalization algorithm results from the fact that the generalization
algorithm only anonymizes the dataset by generalization techniques.
Fig. 3 k-Value and Information Loss Metric
4.3. Execution Time
The Execution Time of the two algorithms for different k values is shown in Figure 4. The Execution
time for our algorithm is more than that of generalization algorithm for all k values. The reason is that,
our algorithm searches the whole dataset space in generating QI-groups. Even though the execution time
for our algorithm is higher than the generalization algorithm, we believe that it is still acceptable in
practice as anonymity is often considered an off-line procedure.
Fig. 4 k-Value and Execution Time
931
P. Wang. /Journal of Computational Information Systems 7:3 (2011) 924-931
5. Conclusions
The existing anonymous methods are inadequate because they cannot guarantee privacy preservation in all
cases, and often suffer from unnecessary information loss due to perform excessive generalization. In this
paper, we studied the essential principle of personalized anonymity and proposed a new personalized
anonymity algorithm by transforming the anonymity problem to the clustering problem. We also defined
two important elements of clustering, that is, distance and cost functions, which are specifically tailored for
the k-anonymity problem. The proposed personalized anonymity algorithm can support individual to
specify her/his privacy preserving preference while protecting sensitive attribute disclosure. The extensive
experimental results show that the proposed algorithm is effective and may cause less information loss
during the generalization process. However, the proposed algorithm paper is not optimal, in the sense that it
does not necessarily achieve the lowest information loss. Finding an approximate optimal solution is our
future work [15].
Acknowledgments
This work is partially supported by National Natural Science Foundation of China under Grant No.
71071001, and it is also partially supported by Natural Science Foundation of Anhui Province of China
under Grant No. 11040606M140.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
L. Sweeney. K-anonymity: A Model for Protecting Privacy. International Journal of Uncertainty, Fuzziness and
Knowledge-based Systems,10(5): 557-570, 2002.
R. Bayardo, R. Agrawal. Data Privacy Through Optimal k-Anonymization. In Proceedings the 21st International
Conference on Data Engineering, pages 217-228, 2005.
K. Lefevre, J. Dewittd, R. Ramakrishnan. Incognito: Efficient Full-Domain k-Anonymity. In Proceedings of the
2005 ACM SIGMOD International Conference on Management of Data, pages 49-60, 2005.
X. K. Xiao, Y. F. Tao. Anatomy: Simple and Effective Privacy Preservation. In Proceedings of Very Large Data
Bases (VLDB) Conference, pages 139-150, 2006.
A. Machanavajjhala, J. Gehrke, D. Kifer. l-Diversity: Privacy Beyond k-Anonymity. ACM Transactions on
Knowledge Discovery from Data,1(1): 24-35, 2007.
N.H. Li, T.C. Li. t-Closeness: Privacy beyond k-Anonymity and l-Diversity. In Proceedings of the 23rd
International Conference on Data Engineering, pages 106-115, 2007.
X.K. Xiao, Y.F. Tao. M-Invariance: Towards Privacy Preserving Re-Publication of Dynamic Datasets. In
Proceedings of the ACM Conference on Management of Data (SIGMOD), pages 689-700, 2007.
X.K. Xiao, Y.F. Tao. Personalized Privacy Preservation. In Proceedings of the ACM Conference on Management
of Data (SIGMOD), pages 229-240, 2006.
X.J. Ye, Y.W. Zhang, M. Liu. A Personalized (a,k)-Anonymity Model . In Proceedings of the 9th International
Conference on Web-Age Information Management. Piscataway, pages 341-348, 2008.
J.W. Byun, A. Kamra, E. Bertino, et al. Efficient k-Anonymization Using Clustering Techniques. In Proceedings
of DASFAA 2007, pages 188-200, 2007.
G. Loukides, J.H. Shao. An Efficient Clustering Algorithm for k-Anonymisation. International Journal of
Computer Science And Technology,23(2):188-202, 2008.
J.L. Lin, M.C. Wei. Genetic Algorithm-Based Clustering Approach for k-Anonymization. International Journal
of Expert Systems with Applications,36(6):9784-9792, 2009.
L.J. Lu, X.J. Ye. An Improved Weighted-Feature Clustering Algorithm for k-Anonymity. In Proceedings of the
5th International Conference on Information Assurance and Security, pages 415-419, 2009.
Z.H. Wang, J. Xu, W. Wang, et al. Clustering-Based Approach for Data Anonymization. Chinese Journal of
Software,21(4):680-693, 2010.
Y.H. Yu, W.Y. Bai. Integrated Privacy Protection and Access Control over Outsourced Database Services.
Journal of Computational Information Systems, 6 (8): 2767-2777, 2010.
Download