Journal of Computational Information Systems 7:3 (2011) 924-931 Available at http://www.Jofcis.com Personalized Anonymity Algorithm Using Clustering Techniques Pingshui WANG† College of Management Science and Engineering, Anhui University of Finance & Economics, Bengbu 233030, China Abstract Anonymity is an essential technique to preserve individual privacy information in data releasing setting. However, most existing anonymous methods focus on the universal approach that exerts the same amount of preservation for all individuals. The result is that the released data may offer insufficient protection to a subset of people, while applying excessive privacy control to another subset. In addition, those methods using generalization and suppression techniques suffer from high information loss and low usability mainly due to reliance on pre-defined generalization hierarchies or total order imposed on each attribute domain. To address this issue, we develop a new personalized anonymity algorithm using clustering techniques. Experimental results show that the proposed method can reduce the information loss and improve the usability of the released data while satisfying everybody’s privacy preserving requirements. Keywords: Privacy Preservation; Personalized Anonymity; Clustering 1. Introduction With the rapid development in Internet, data storage and data processing technologies, individual privacy information is being collected by various organizations for research purposes. However, if individuals can be uniquely identified in the released data then their private information would be disclosed. Releasing data about individuals without revealing their sensitive information is an important problem. To avoid the identification of records in released data, uniquely identifying information like names and social security numbers are removed from the table. However, this first sanitization still does not ensure the privacy of individuals in the data. Privacy preservation is the prevention of cross-references and inferences that could reveal confidential information in released data. Anonymity provides a relative guarantee that the identity of individuals cannot be discovered. In recent years, a new definition of privacy called k-anonymity has gained popularity. The k-anonymity model, proposed by Sweeney [1], is a simple and practical privacy-preserving approach and has drawn considerable interest from research community, and a number of effective algorithms have been proposed [2-14]. The k-anonymity model ensures that each record in the table is identical to at least (k-1) other records with respect to the quasi-identifier attributes. Therefore, no privacy related information can be inferred from the k-anonymity protected table with high confidence. The most existing anonymity methods focus on a universal approach that exerts the same amount of preservation for all individuals. The consequence is that the released data may offer insufficient protection to a subset of people, while applying excessive privacy control to another subset. In addition, these methods based on generalization and suppression techniques suffer from high information loss and low usability mainly due to reliance on pre-defined generalization hierarchies or total order imposed on each attribute domain. To address this issue, we develop a new personalized anonymity algorithm using clustering † Corresponding author. Email addresses: pshwang@163.com (Pingshui WANG). 1553-9105/ Copyright © 2011 Binary Information Press March, 2011 925 P. Wang. /Journal of Computational Information Systems 7:3 (2011) 924-931 techniques. Experimental results show that our method can reduce the information loss and improve the usability of the released data while satisfying personal privacy preserving requirements. The rest of this paper is organized as follows. In section 2, we introduce the related work. In section 3, we present our method of personalized anonymity using clustering techniques. In section 4, we analyze the performance of our method by extensive experiments. Section 5 contains the conclusions and future work. 2. The Related Work 2.1. k-Anonymity In order to preserve privacy, Sweeney [1] proposed the k-anonymity model which achieves k-anonymity using generalization and suppression techniques, so that, any individual is indistinguishable from at least (k-1) other ones with respect to the quasi-identifier attributes in the released dataset. For example, table 2 is a 2-anonymous table of table 1. Generalization involves replacing a value with a less specific but semantically consistent value. For example, the date of birth could be generalized to a range such as year of birth, so as to reduce the risk of identification. Suppression involves not releasing a value at all. In recent years, numerous algorithms have been proposed for implementing k-anonymity by generalization and suppression. Table 1 Original Table Name Race Birth Sex Zip Disease Alice Blank 1965-3-18 F 02141 gastric ulcer Helen Blank 1965-5-1 F 02142 dyspepsia David Blank 1966-6-10 M 02135 pneumonia Bob Blank 1966-7-15 M 02137 bronchitis Jane White 1968-3-20 F 02139 flu Paul White 1968-4-1 F 02138 cancer Table 2 2-Anonymous Table of Table 1 Race Birth Sex Zip Disease Blank 1965 F 0214* gastric ulcer Blank 1965 F 0214* dyspepsia Blank 1966 M 0213* pneumonia Blank 1966 M 0213* bronchitis White 1968 F 0213* flu White 1968 F 0213* cancer 2.2. Personalized Anonymity The k-anonymity methods mainly focus on a universal approach that exerts the same amount of preservation for all individuals, without catering for their concrete needs. The consequence is that the released data may offer insufficient protection to a subset of people, while applying excessive privacy control to another subset. Motivated by this, Xiao and Tao [8] presented a new generalization framework based on the concept of personalized anonymity which overcomes the above problems. In the following we introduce briefly the principle of the personalized anonymity. The basic idea of the personalized anonymity is that a person can specify the degree of privacy preservation for her/his sensitive values. For example, figure 1 demonstrates a simple taxonomy of attribute Disease. The taxonomy is accessible by the public, and organizes all diseases as leaves of a tree. An intermediate node carries a name summarizing the diseases in its subtree. Each person can decide her/his 926 P. Wang. /Journal of Computational Information Systems 7:3 (2011) 924-931 sensitive values appeared in the released data. Their technique performs the minimum generalization for satisfying everybody’s requirements, and thus, retains the largest amount of information from the original data. Table 3 An Example of Guarding Node Record No. Name Race Birth Sex Zip Disease Guarding Node 1 Alice Blank 1965-3-18 F 02141 gastric ulcer stomach disease 2 Helen Blank 1965-5-1 F 02142 dyspepsia dyspepsia 3 David Blank 1966-6-10 M 02135 pneumonia respiratory infection 4 Bob Blank 1966-7-15 M 02137 bronchitis Bronchitis 5 Jane White 1968-3-20 F 02139 flu φ 6 Paul White 1968-4-1 F 02138 cancer other A personal preference can be easily solicited from an individual when s/he is supplying her/his data. In the personalized anonymity approach, a preference is formulated through a “guarding node” in the taxonomy. As an example, for record 1 in Table 3, Alice may specify node stomach-disease (the “guarding node” for her privacy). Thus, nobody should be able to infer that she suffered from any disease (i.e., gastric ulcer, dyspepsia, or gastritis) in the subtree of the node with significant confidence. In other words, in Alice’s opinion, allowing the public to associate her with dyspepsia or gastritis is as serious as revealing her true disease. On the other hand, for record 5 in Table 3, Jane may specify φ , which is an implicit node underneath all the leaves of the taxonomy. The empty-set preference implies that s/he is willing to release her/his actual diagnosis result flu; therefore, record 5 can be released directly. In general, flu may not be “sensitive” for many people, such that it is often not necessary to apply any privacy protection to this value. any illness respiratory infection pneumonia bronchitis stomach disease flu dyspepsia gastritis gastric ulcer other aids … cancer Fig. 1 The Taxonomy of Attribute Disease Let S be a table storing private information about a set of individuals. The attributes in S are classified in 4 categories: (i) an identifier attribute Ai which uniquely identifies a person, and must be removed when S is released to the public, (ii) a sensitive attribute A s (e.g., Disease in Table 3), whose values may be confidential for an individual (subject to her/his preferences), (iii) d quasi-identifier (QI) attributes A1qi , A2qi , , Adqi , whose values can be released, but may reveal a personal identify with the aid of external information (Race, Birth, Sex, Zip in Table 3), and (iv) other attributes that are not relevant to our discussion. All the attributes have finite domains and each categorical attribute is accompanied by a taxonomy, which indicates the publicly-known hierarchy among the possible values of the attribute. The objective is to compute a releasable table S * such that (i) it contains all the attributes of S except i A , (ii) it has a generalized record for every record in S , (iii) it preserves as much information of S as possible, and (iv) its releasing does not cause any privacy breach. To describe the personal privacy requirements, some concepts are defined in the following [8]: Definition 1 ( A s subtree). For any node x in the taxonomy of A s , ST ( x) represents its subtree, which includes x itself, and the part of the taxonomy under it. A record r ∈ S defines an association between an individual o (identified by r. Ai ) and a sensitive value v = r. A s , which is denoted as { o , v }. To formulate her/his privacy preference, o specifies a guarding node as follows: 927 P. Wang. /Journal of Computational Information Systems 7:3 (2011) 924-931 Definition 2 (Guarding node). For a record r ∈ S , its guarding node r.GN is a node on the path from the root to r. A s in the taxonomy of A s . Through r.GN , o indicates that she/he does not want the public to associate her/him with any leaf s A value in ST (r.GN ) . Specifically, assume that ST (r.GN ) contains x leaf values v1 , v2 ,..., v x . The privacy requirement of r.GN is breached if an adversary thinks that any of the associations { o , v1 }, ... , { o , v x } exists in S . Definition 3 (Breach probability). For a record r ∈ S , its breach probability Pbreach (r ) equals the probability that an adversary can infer from S * that any of the associations { o , v1 }, ..., { o , v x } exists in S , where v1 , v2 ,..., v x are the leaf values in ST (r.GN ) . The released table S * should guarantee that, for all r ∈ S , Pbreach (r ) is at most pbreach , which is a system parameter specifying the amount of confidentiality control. Table 3 demonstrates the guarding nodes selected by the individuals. Guarding nodes depend entirely on personal preferences, and are not determined by the sensitive values. 3. Personalized Anonymity Algorithm Using Clustering Techniques The key idea underlying our method is that the personalized anonymity problem can be viewed as a clustering problem [10]. Clustering is the problem of partitioning a set of objects into groups such that objects in the same group are more similar to each other than objects in other groups with respect to some defined similarity criteria. 3.1. Personalized k -Member Clustering Problem Typical clustering problems require that a specific number of clusters be found in solutions. However, the personalized anonymity problem does not have a constraint on the number of clusters; instead, it requires that each cluster contains at least k records and each sensitive attribute value satisfies personal privacy requirements. Thus, we pose the personalized anonymity problem as a clustering problem, referred to as personalized k-member clustering problem. Definition 4 (Personalized k -member clustering problem). The personalized k -member clustering problem is to find a set of clusters from a given set of n records such that each cluster contains at least k ( k ≤ n ) data points and that the sum of all intra-cluster distances is minimized. Formally, let S be a set of n records, k be the specified anonymization parameter and pbreach be a system parameter specifying the amount of confidentiality control. Then the optimal solution of the personalized k -member clustering problem is a set of clusters E = {e1 ,..., em } such that: (1) ∀i ≠ j ∈ {1,..., m} , ei ∩ e j = φ ; (2) U i =1,..., m ei = S ; (3) ∀ei ∈ E , ei ≥ k ; (4) Σ l =1,..., m el ⋅ MAX i , j =1,..., el Δ( p(l , i ), p(l , j )) is minimized; (5) ∀r ∈ S , Pbreach (r ) ≤ pbreach ; (6) Each sensitive attribute value satisfies personal privacy requirements. Where e is the size of cluster e , p(l , i ) represents the i -th data point in cluster el , and Δ( x, y ) is the distance between two data points x and y . 3.2. Distance and Cost Functions Distance functions are used to measure the dissimilarities among data points, which are usually determined by the type of data being clustered. The cost function that the clustering problem tries to minimize is 928 P. Wang. /Journal of Computational Information Systems 7:3 (2011) 924-931 defined by the specific objective of the clustering problem. In this section, we describe our distance and cost functions which have been specifically tailored for the k -anonymity problem [10]. Definition 5 (Distance between two numeric values). Let D be a finite numeric domain. Then the normalized distance between two values v i , v j ∈ D is defined as: δ N (v i , v j ) = v i − v j / D . (1) Where D is the domain size measured by the difference between the maximum and minimum values in D . For categorical attributes, the difference is no longer applicable as most of the categorical domains cannot be enumerated in any specific order. However, some domains may have some semantic relationships among the values. In such domains, it is desirable to define the distance functions based on the existing relationships. Such relationships can be easily captured in a taxonomy tree. For example, Figure. 1 illustrates a natural taxonomy tree for the Disease attribute. Furthermore, we can specify the level distance taking specific instance into consideration. The distance between lower levels is smaller than the higher ones. We now define the distance function for categorical values as follows: Definition 6 (Distance between two categorical values). Let D be a categorical domain and T D be a taxonomy tree defined for D . We assume that the distance is 1 from bottom to top level of the taxonomy tree. Then the normalized distance between two values vi , v j ∈ D is defined as: δ C (vi , v j ) = W (Λ (vi , v j )) . (2) Where Λ(vi , v j ) is the subtree rooted at the lowest common ancestor of vi and v j , and W ( T ) represents the sum of the distance between levels of tree T . Example 1. Consider attribute Disease and its taxonomy tree in Fig. 1. We assume the distance between level 1 and level 2 equals 0.7, and the distance between level 2 and level 3 is 0.3. Then the distance between dyspepsia and gastritis is 0.3 = 0.3, while the distance between dyspepsia and flu is 0.3+0.7 = 1. Definition 7 (Distance between two records). Let QT = {N1 ,..., N m , C1 ,..., Cn } be the quasi-identifier of table S , where N i (i = 1,..., m) is an attribute with a numeric domain and C j ( j = 1,..., n) is an attribute with a categorical domain, and A s is a sensitive attribute with a categorical domain. The distance of two records r1 , r2 ∈ S is defined as: ∑wδ Δ(r1 , r2 ) = i N ( r1[ N i ], r2 [ N i ]) + i =1,..., m ∑w δ j C ( r1[C j ], r2 [C j ]) + ws δC (r1[ A s ], r2 [ A s ]) . (3) j =1,..., n Where ri [ A] represents the value of attribute A in ri , wi ( w j / ws ) represents the weigh of attribute N i ( C j / A s ) , and ∑ m 1 wi + ∑w n 1 j + ws = 1 . Definition 8 (Information loss). Let e = {r1 , , rc } be a cluster where the quasi-identifier consists of numeric attributes N1 ,..., N m and categorical attributes C1 ,..., Cn . Let TCi be the taxonomy tree defined for the domain of categorical attribute C i . Let MIN N i and MAX N i be the min and max values in e with respect to attribute N i , and let ∪ Ci be the union set of values in e with respect to attribute C i . Let STt k .GN be the subtree of guarding node of the record tk . Then the amount of information loss occurred by generalizing e, denoted by IL(e) , is defined as: ( MAX N i − MIN N i ) IL(e) = e ⋅ ( + Ni i =1,..., m ∑ Where e is the number of records in e , N i ∑ W (Λ (U j =1,..., n Cj ))) + ∑W (ST k =1,..., e t k .GN ) . (4) represents the size of numeric domain N i , Λ(∪ C j ) is the subtree rooted at the lowest common ancestor of every value in ∪ C j , and W ( T ) is the sum of the distance between levels of tree T . 929 P. Wang. /Journal of Computational Information Systems 7:3 (2011) 924-931 Definition 9 (Total information loss). Let E be the set of all equivalence classes in the anonymized table S * . Then the amount of total information loss of S * is defined as: Total − IL( S * ) = IL(e) . (5) ∑ e∈E 3.3. Personalized Anonymity Algorithm Using Clustering Techniques Based on the above concepts, we propose an new algorithm for computing a personalized anonymous table S * with small Total − IL( S * ) which guarantees Pbreach (r ) ≤ pbreach for ∀r ∈ S . Our anonymization algorithm includes three steps, as is shown in figure 2. The first phase produces a set of QI-groups denoted by temp through applying clustering on S . Then, the second step produces anonymous QI-groups by applying QI-generalization on temp, using a set of generalization functions G = { f1 , f 2 ,..., f d } on the d QI-attributes, respectively. Finally, the third step produces the final releasable table S * by performing SA-generalization on the resulting QI-groups, employing a specialized generalization function for each QI-group. Hence, the quality of S * depends on (i) the method of clustering, (ii) the choice of generalization functions G , and (iii) the effectiveness of SA-generalization. Please refer to reference [13] for the specific algorithm about SA-generalization. Algorithm Clustering_Personalized_Anonymity Input: a set of records S, a threshold value k, and the guarding nodes of all records. Output: the releasable dataset S*. 1. temp = φ ; r = a randomly picked record from S; 2. while ( |S| ≥ k ) a) r = the furthest record from r; b) S = S – {r}; c) e = {r}; d) while( |e| < k ) i. r = find_best_record(S, e); ii. S = S – {r}; iii. e = e ∪ {r}; e) temp = temp ∪ {e}; while ( |S| ≠ 0 ) a) r = a randomly picked record from S; b) S = S – {r}; c) e = find_best_cluster(temp, r); d) e = e ∪ {r}; temp’ =QI-generalization(temp); S*=SA-generalization (temp’); return S*. 3. 4. 5. 6. Fig. 2 Algorithm for Personalized Anonymity Using Clustering Techniques 4. Experimental Results The main goal of the experiments was to investigate the performance of our algorithm in terms of data quality and execution time. To accurately evaluate our approach, we compared our implementation with another algorithm, namely the generalization algorithm proposed in [8]. 4.1. Experimental Setup The experiments were performed on a machine with Intel(R) Core(TM)2 Duo CPU T5450 1.67GHz,2.0GB RAM, Windows XP, MATLAB7.0, and Visual C + + 6.0. We ran our experiments on the IPUMS database. The dataset contains a relation with 100k records, each storing the information of an American adult. Before the experiments, we removed records with missing values and chose only six original attributes. For k-anonymity, we considered {age, education, gender, P. Wang. /Journal of Computational Information Systems 7:3 (2011) 924-931 930 marital status, and occupation} as the quasi-identifier. Among these, age and education were treated as numeric attributes while the other three attributes were treated as categorical attributes. In addition to that, we also considered income as the sensitive attribute. Taxonomy trees for each attribute are described in Table 4. Records are randomly divided into 3 groups which account for 10%, 30%, and 60% of the dataset, respectively. For each record in the first (or second) group, its guarding node is the parent of its sensitive value (or is φ ). The guarding nodes of the records in the last group are their sensitive values. Table 4 Experimental Data Information Attribute Distinct values Tree level Sensitive? age education gender marital status occupation income 8 15 2 6 8 10 3 3 2 3 3 4 No No No No No Yes 4.2. Data Quality In this section, we report experimental results on our algorithm and the generalization algorithm for data quality. Figure 3 shows the Total-IL costs of the two algorithms for increasing values of k . As the figure illustrating, our algorithm results in lower cost of the Total-IL for k values size from 10 to 100. The superiority of our algorithms over the generalization algorithm results from the fact that the generalization algorithm only anonymizes the dataset by generalization techniques. Fig. 3 k-Value and Information Loss Metric 4.3. Execution Time The Execution Time of the two algorithms for different k values is shown in Figure 4. The Execution time for our algorithm is more than that of generalization algorithm for all k values. The reason is that, our algorithm searches the whole dataset space in generating QI-groups. Even though the execution time for our algorithm is higher than the generalization algorithm, we believe that it is still acceptable in practice as anonymity is often considered an off-line procedure. Fig. 4 k-Value and Execution Time 931 P. Wang. /Journal of Computational Information Systems 7:3 (2011) 924-931 5. Conclusions The existing anonymous methods are inadequate because they cannot guarantee privacy preservation in all cases, and often suffer from unnecessary information loss due to perform excessive generalization. In this paper, we studied the essential principle of personalized anonymity and proposed a new personalized anonymity algorithm by transforming the anonymity problem to the clustering problem. We also defined two important elements of clustering, that is, distance and cost functions, which are specifically tailored for the k-anonymity problem. The proposed personalized anonymity algorithm can support individual to specify her/his privacy preserving preference while protecting sensitive attribute disclosure. The extensive experimental results show that the proposed algorithm is effective and may cause less information loss during the generalization process. However, the proposed algorithm paper is not optimal, in the sense that it does not necessarily achieve the lowest information loss. Finding an approximate optimal solution is our future work [15]. Acknowledgments This work is partially supported by National Natural Science Foundation of China under Grant No. 71071001, and it is also partially supported by Natural Science Foundation of Anhui Province of China under Grant No. 11040606M140. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] L. Sweeney. K-anonymity: A Model for Protecting Privacy. International Journal of Uncertainty, Fuzziness and Knowledge-based Systems,10(5): 557-570, 2002. R. Bayardo, R. Agrawal. Data Privacy Through Optimal k-Anonymization. In Proceedings the 21st International Conference on Data Engineering, pages 217-228, 2005. K. Lefevre, J. Dewittd, R. Ramakrishnan. Incognito: Efficient Full-Domain k-Anonymity. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pages 49-60, 2005. X. K. Xiao, Y. F. Tao. Anatomy: Simple and Effective Privacy Preservation. In Proceedings of Very Large Data Bases (VLDB) Conference, pages 139-150, 2006. A. Machanavajjhala, J. Gehrke, D. Kifer. l-Diversity: Privacy Beyond k-Anonymity. ACM Transactions on Knowledge Discovery from Data,1(1): 24-35, 2007. N.H. Li, T.C. Li. t-Closeness: Privacy beyond k-Anonymity and l-Diversity. In Proceedings of the 23rd International Conference on Data Engineering, pages 106-115, 2007. X.K. Xiao, Y.F. Tao. M-Invariance: Towards Privacy Preserving Re-Publication of Dynamic Datasets. In Proceedings of the ACM Conference on Management of Data (SIGMOD), pages 689-700, 2007. X.K. Xiao, Y.F. Tao. Personalized Privacy Preservation. In Proceedings of the ACM Conference on Management of Data (SIGMOD), pages 229-240, 2006. X.J. Ye, Y.W. Zhang, M. Liu. A Personalized (a,k)-Anonymity Model . In Proceedings of the 9th International Conference on Web-Age Information Management. Piscataway, pages 341-348, 2008. J.W. Byun, A. Kamra, E. Bertino, et al. Efficient k-Anonymization Using Clustering Techniques. In Proceedings of DASFAA 2007, pages 188-200, 2007. G. Loukides, J.H. Shao. An Efficient Clustering Algorithm for k-Anonymisation. International Journal of Computer Science And Technology,23(2):188-202, 2008. J.L. Lin, M.C. Wei. Genetic Algorithm-Based Clustering Approach for k-Anonymization. International Journal of Expert Systems with Applications,36(6):9784-9792, 2009. L.J. Lu, X.J. Ye. An Improved Weighted-Feature Clustering Algorithm for k-Anonymity. In Proceedings of the 5th International Conference on Information Assurance and Security, pages 415-419, 2009. Z.H. Wang, J. Xu, W. Wang, et al. Clustering-Based Approach for Data Anonymization. Chinese Journal of Software,21(4):680-693, 2010. Y.H. Yu, W.Y. Bai. Integrated Privacy Protection and Access Control over Outsourced Database Services. Journal of Computational Information Systems, 6 (8): 2767-2777, 2010.