An Adaptable Perturbation Model of Privacy Preserving Data Mining Li Liu, Bhavani Thuraisingham, Murat Kantarcioglu and Latifur Khan Computer Science Department University of Texas at Dallas {liliu, bhavani.thuraisingham, muratk, lkhan}@ utdallas.edu Abstract Several approaches to privacy preserving data mining have been developed in recent years. These approaches can be classified into two main categories; they are based on perturbation and randomization techniques [1-4] and secure multi-party computation based techniques (SMC) [5-9]. The approach proposed by Kargupta et. al in [10] poses a challenge to the perturbation and randomization-based approaches. It claims that such approaches may lose information as well as not provide privacy by introducing random noise to the data. By using random matrix properties, Kargupta et. al successfully separates the data from the random noise and subsequently discloses the original data. Several approaches [5-9] fall into the second category (i.e. the multi-party computation), but they all require very high computation costs. Furthermore, these multi-party computation based approaches assume that each party uses the same data scheme thereby working only for a homogeneous environment. Heterogeneity, where different parties use different schemes, is a major issue that we need to tackle in the future. Randomization and perturbation are two very important techniques in privacy preserving data mining. Loss of information versus preservation of privacy is always a trade off. Furthermore, an approach that uses random matrix properties has recently posed a challenge to the perturbationbased techniques. The question is, can pertubation-based techniques still protect privacy? In order to find the answer to this question, we scrutinize two different approaches; one proposed by Agawam et. al using Bayes density functions and the other proposed by Kargupta et. al using random matrix. We set up simulation experiments to study these two approaches. The question is, besides the properties of random noise what else do we know about reconstructing the original distribution? First we compared the assumptions and preconditions of the two approaches. Then, by using different conditions, we have obtained some interesting results and have made some observations. We propose a modified version of Agawam et. al’s algorithm, which reconstructs the original distribution from the perturbed distribution rather than using the perturbed data. Furthermore, under the same conditions, and by using the random matrix filter approach we failed to obtain the original distribution. We give a hypothesis to explain this observation. Based on this hypothesis, we propose an adaptable perturbation model, which accounts for the diversity of information sensitivity. The adaptable perturbation model presented here has a parameter to adjust the perturbation level to best fit the different privacy concerns. Modified Agrawal et al. Algorithm on Construct Data in 1000 Intervals Kargupta et al. Algorithm with Gaussian Distribution STD=0.30 with 40 interval with 0.25% stopping criterion 1200 1200 1000 1000 800 800 600 600 400 400 200 200 0 0 -1 -0.5 0 Original Data 0.5 Perturbed 1 1.5 Reconstruted using modified algorithm 2 -1 -0.5 0 Original Data 0.5 Perturbed 1 Reconstructed 1.5 2 References: [1]. R. Agawam and R. Srikant. Privacy-preserving data mining. In Proceedings of the 2000 ACM SIGMOD Conference on Management of Data, pages 439-450, Dallas, TX, May 14-19 2000. ACM. [2]. D. Agawam and C. Aggarwal. On the design and quantication of privacy preserving data mining algorithms. In Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 247-255, Santa Barbara, California, USA, May 21-23 2001. [3]. A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of association rules. In Proceedings of the 8th Conference on Knowledge Discovery and Data Mining (KDD'02), 2002. [4]. S. J. Rizvi and J. R. Haritsa, Maintaining data privacy in association rule mining. In Proceedings of 28th International Conference on Very Large Data Bases. VLDB, Aug. 2023 2002. [5]. C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin and M. Y. Zhu. Tools for Privacy Preserving Distributed Data Mining, In SIGKDD Explorations, 4(2): 28-34 December 2002. [6]. M. Kantarcioglu and C. Clifton. Privacy-preserving distributed mining of association rules on horizontally partitioned data. In ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, June 2002. [7]. Y. Lindell and B. Pinkas. Privacy preserving data mining. In Advances in Cryptology CRYPTO 2000, pages 36-54. Springer-Verlag, August 20-24 2000. [8]. J. Vaidya and Chris Clifton. Privacy preserving association rule mining in vertically partitioned data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 639-- 644, Edmonton, Alberta, Canada, July 23-26 2002. [9]. J. Vaidya, C. Clifton, 2003, Privacy-Preserving K-Means Clustering over Vertically Partitioned Data. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 206 – 215, 2003. [10]. H. Kargupta, S. Datta, Q. Wang, K. Sivakumar, “On the privacy preserving properties of random data perturbation techniques”, IEEE ICDM, 2003. [11]. B. Thuraisingham, Privacy Constraint Processing in a Privacy Enhanced Database Management System, To appear in Data and Knowledge Engineering Journal, 2005