An Adaptable Perturbation Model of Privacy Preserving Data Mining

advertisement
An Adaptable Perturbation Model of Privacy Preserving Data Mining
Li Liu, Bhavani Thuraisingham, Murat Kantarcioglu and Latifur Khan
Computer Science Department
University of Texas at Dallas
{liliu, bhavani.thuraisingham, muratk, lkhan}@ utdallas.edu
Abstract
Several approaches to privacy preserving data mining have been developed in recent years. These
approaches can be classified into two main categories; they are based on perturbation and
randomization techniques [1-4] and secure multi-party computation based techniques (SMC) [5-9].
The approach proposed by Kargupta et. al in [10] poses a challenge to the perturbation and
randomization-based approaches. It claims that such approaches may lose information as well as
not provide privacy by introducing random noise to the data. By using random matrix properties,
Kargupta et. al successfully separates the data from the random noise and subsequently discloses
the original data. Several approaches [5-9] fall into the second category (i.e. the multi-party
computation), but they all require very high computation costs. Furthermore, these multi-party
computation based approaches assume that each party uses the same data scheme thereby working
only for a homogeneous environment. Heterogeneity, where different parties use different schemes,
is a major issue that we need to tackle in the future.
Randomization and perturbation are two very important techniques in privacy preserving
data mining. Loss of information versus preservation of privacy is always a trade off. Furthermore,
an approach that uses random matrix properties has recently posed a challenge to the perturbationbased techniques. The question is, can pertubation-based techniques still protect privacy? In order
to find the answer to this question, we scrutinize two different approaches; one proposed by
Agawam et. al using Bayes density functions and the other proposed by Kargupta et. al using
random matrix. We set up simulation experiments to study these two approaches. The question is,
besides the properties of random noise what else do we know about reconstructing the original
distribution? First we compared the assumptions and preconditions of the two approaches. Then,
by using different conditions, we have obtained some interesting results and have made some
observations. We propose a modified version of Agawam et. al’s algorithm, which reconstructs the
original distribution from the perturbed distribution rather than using the perturbed data.
Furthermore, under the same conditions, and by using the random matrix filter approach we failed
to obtain the original distribution. We give a hypothesis to explain this observation. Based on this
hypothesis, we propose an adaptable perturbation model, which accounts for the diversity of
information sensitivity. The adaptable perturbation model presented here has a parameter to adjust
the perturbation level to best fit the different privacy concerns.
Modified Agrawal et al. Algorithm on Construct Data in 1000 Intervals
Kargupta et al. Algorithm with Gaussian Distribution STD=0.30
with 40 interval with 0.25% stopping criterion
1200
1200
1000
1000
800
800
600
600
400
400
200
200
0
0
-1
-0.5
0
Original Data
0.5
Perturbed
1
1.5
Reconstruted using modified algorithm
2
-1
-0.5
0
Original Data
0.5
Perturbed
1
Reconstructed
1.5
2
References:
[1]. R. Agawam and R. Srikant. Privacy-preserving data mining. In Proceedings of the 2000 ACM
SIGMOD Conference on Management of Data, pages 439-450, Dallas, TX, May 14-19 2000.
ACM.
[2]. D. Agawam and C. Aggarwal. On the design and quantication of privacy preserving data
mining algorithms. In Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART
Symposium on Principles of Database Systems, pages 247-255, Santa Barbara, California,
USA, May 21-23 2001.
[3]. A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of
association rules. In Proceedings of the 8th Conference on Knowledge Discovery and Data
Mining (KDD'02), 2002.
[4]. S. J. Rizvi and J. R. Haritsa, Maintaining data privacy in association rule mining. In
Proceedings of 28th International Conference on Very Large Data Bases. VLDB, Aug. 2023 2002.
[5]. C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin and M. Y. Zhu. Tools for Privacy Preserving
Distributed Data Mining, In SIGKDD Explorations, 4(2): 28-34 December 2002.
[6]. M. Kantarcioglu and C. Clifton. Privacy-preserving distributed mining of association rules on
horizontally partitioned data. In ACM SIGMOD Workshop on Research Issues on Data
Mining and Knowledge Discovery, June 2002.
[7]. Y. Lindell and B. Pinkas. Privacy preserving data mining. In Advances in Cryptology
CRYPTO 2000, pages 36-54. Springer-Verlag, August 20-24 2000.
[8]. J. Vaidya and Chris Clifton. Privacy preserving association rule mining in vertically
partitioned data. In Proceedings of the Eighth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 639-- 644, Edmonton, Alberta, Canada, July
23-26 2002.
[9]. J. Vaidya, C. Clifton, 2003, Privacy-Preserving K-Means Clustering over Vertically
Partitioned Data. In Proceedings of the ninth ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 206 – 215, 2003.
[10]. H. Kargupta, S. Datta, Q. Wang, K. Sivakumar, “On the privacy preserving properties of
random data perturbation techniques”, IEEE ICDM, 2003.
[11]. B. Thuraisingham, Privacy Constraint Processing in a Privacy Enhanced Database
Management System, To appear in Data and Knowledge Engineering Journal, 2005
Download