Praveen Baskar, J. Gitanjali, K. Satheesh Kumar, J. Indumathi and

advertisement
A Framework for Association Rule Generation Using Privacy Enhancing Methodology…
1
A Framework for Association Rule Generation Using Privacy
Enhancing Methodology for Vertically Partitioned Data Mining
Praveen Baskar, J. Gitanjali, K. Satheesh Kumar1, J. Indumathi2 and G.V. Uma3
Department of Computer Science and Engineering,
Anna University, Chennai-600 025.
Tamil Nadu, India
E-mail: 1Sathishkumar248@gmail.com, 2indumathi.j@gmail.com, 3gvuma@annauniv.edu
“Sufficiently advanced technology is indistinguishable from magic.”
- Arthur C Clarke
ABSTRACT: At its nub, the value of privacy preserving data mining is plagiaristic not only from its flair to haul out
crucial knowledge, but also from its resiliency to molestation. It performs well at needed levels during times of both
crisis and normal operations. This task force’s central thrust is towards establishing a earth with robust data security,
where knowledge users persist to profit from data without compromising the data privacy. The goal of privacypreserving data mining is to liberate a dataset that researchers can study without being able to identify sensitive
information about any individuals in the data (with high probability). The contemporary chief methods existing
(i.e., the data obfuscation methods and secure computation methods) are circumscriptive in their own ways.
Henceforth in this paper, we present a new archetype to perform an enhanced privacy preservation for distributed
data mining (i.e., vertically partitioned data) without using the conventional techniques of perturbation or cryptography.
We have implemented and evaluated the true efficiency of the new technique on our own conceptual framework.
The specified new framework was used to compare and contrast each and every one of the techniques in a general
podium which will be the basis for ascertaining the suitable technique for a given type of application of privacy
preserving shared filtering. We hope the proposed solution will get hold of new techniques, paving way for research
track and work well according to the evaluation metrics including hiding effects, data utility, and time performance.
Keywords—Estimator, Excavator, Privacy, Privacy Preserving Data Mining (PPDM), Vertical Partitioned Privacy
Preserving Data Mining (VPPPDM).
INTRODUCTION
T
he detonation of new data mining techniques has
amplified privacy risks because now it is credible to
effectively coalesce and cross-examine massive data stores,
available on the web, in the fumble around of earlier
unidentified hidden patterns. Consecutively to make a
overtly accessible system safe and sound, we must
guarantee not only that private sensitive data have been
trimmed out, but also to make certain that certain inference
channels have been choked-up. The data and the concealed
knowledge in this data should be made secure.
Furthermore, the prerequisite for making our system as
open as probable—to the extent that data sensitivity is not
jeopardized—asks for diverse techniques that account for
the revelation organize of sensitive data.
Currently, databases are distributed either horizontally
or vertically among several organizations who would like
to collaborate in order to extract global knowledge, but at
the same time, privacy apprehensions may thwart these
parties from directly sharing the data among them. Privacy
of databases is of foremost concern when data is shared for
collaborative data mining in knowledge discovery systems
and is solved by the concept of Privacy Preserving Data
Mining (PPDM). PPDM is a discipline whose desire is to
empower liberation broadcasts of corresponding data while
preserving the privacy. There are several mechanisms by
which the Data privacy can be achieved. An imperative
issue of the hour is therefore to settle on which one among
these Privacy Preserving techniques are superior enough to
protect sensitive information. Nevertheless, it is one of the
decisive factors as to which of these techniques can be
evaluated to be the best.
LITERATURE SURVEY
The momentum for the proposed architecture likely here
emerges through a forethought of the role that privacy
plays in individual people’s lives, privacy legislation in
totaling to an acknowledgement of individual citizens
privacy preferences and any supplementary privacy
constraints indispensable by organisations or duty-bound
by regulatory bodies and a pithy survey of both the research
and the state of practice with regard to Privacy-Enhancing
Technologies.
R. Agrawal et al., (2000) [11] used the perturbation
technique (the original data would linger secret, while the
added noise would average out in the output) to preserve
the privacy. Inspite of the simplicity of the method it lacked
a formal framework for proving the quantification of
60
Mobile and Pervasive Computing (CoMPC–2008)
privacy. Regardless of the existence of other models
[2, 4 and 6] for studying the privacy achievable through
perturbation, there is no prescribed way to model and
quantify the privacy threat from mining of perturbed data.
Recently there has been some proofs [13, 5] which state
that for some data, and some kinds of noise, perturbation
provides almost no privacy at all.
Table 1: Comparison between the contemporary methods
Advantages
Secure Computation Method



Disadvantages
 Reduced
accuracy
 It lacks a formal
framework for
proving how
much privacy is
guaranteed.
 There is no
formal way to
model and
quantify the
privacy threat
from mining of
perturbed data.



Officers a well defined
model for privacy, which
includes methodologies for
proving and quantifying it.
There is a vast toolset of
cryptographic algorithms
and constructs which can
be used for implementing
privacy-preserving data
mining algorithms.
It provides accurate results
and not approximation.
Increased overhead
It is much slower method
and
Requires considerable
computation and
communication overhead
for each SC step.
Problem Statement
Contriving, Conceiving, designing, creating, implementing
archetype and algorithms for enhancing the privacy of
health care databases.
Problem Description
We intend to propose a new model to perform privacypreserving distributed data mining without using
contemporary methods; formulate an enhanced algorithm
for vertically partitioned databases, and association rule
mine for this model is the scope of our work. The main idea
is in the architecture which separates the entity which
computes the results and the entity which finally gets the
results and knows what they mean. We are also planning to
discuss their privacy and performance characteristics.
ARCHITECTURE OF THE PROPOSED WORK
We bring out a diagrammatic schematic representation of
the blocks involved in the proposed architecture as shown
in Figure 2.
Results to Data Owners
Data Analyst
The diagonal approach for privacy preserving data
mining developed, using cryptographic techniques [10, 3, 7],
most often the secure computation technique [12, 9, 8, 1].
Taking into consideration the tabulated issues of Table 1
and Figure 1, we present a new archetype to perform an
enhanced privacy-preservation for distributed data mining
without using the conventional techniques of perturbation
or cryptography. We are presenting algorithm for
association rule mining for vertically partitioned data,
utilizing this new paradigm. By discussing the privacy and
performance characteristics we will establish our credibility
for the algorithm.
EXCAVATOR
Encrypted
Data
Encryption
Db1
Privacy Preserving Data Mining
ESTIMATOR
Encryption
Encryption
Db2
Dbw
Db- Database
Perturbation
Method
 Simplicity
 Easy to use
PROBLEM DESCRIPTION
Local Databases
Fig. 2: Archetype for VPPDDM
Perturbation
Secure computation
Without perturbation and
secure computation
Cryptographic technique
Fig. 1: Taxonomy of PPDM techniques
The main idea is an architectural one. We have
N involved parties collaborating with each other. The work
of the algorithm is divided between the excavator and an
estimator. Excavator decides the computation type to be
done, and the estimator computes the item set without any
information about the item-sets.
A Framework for Association Rule Generation Using Privacy Enhancing Methodology…
IMPLEMENTATION
The Goal: Computation of the frequent itemsets in
vertically partitioned database without compromising the
participants’ privacy.
Privacy Definition: The privacy will be compromised if it
will be possible for any participant to compute some
specific value of the database with high probability. By
specific value we mean a value in a database cell which
belongs to some specific transaction.
ALGORITHM USED
Step 1: The Excavator sets the variable i (the size of the
item-sets being checked now) to 1.
Step 2: The Excavator chooses a random transformation of
item-sets of i elements and then iteratively picks each of
them.
Step 3: For each item-set from step 2, if all the subsets of
current item-set are frequent (apriori principle), the
Excavator orders all partakers to encrypt the Transactionids of the transactions using the same key.
Step 4: The Excavator then asks every partaker about
frequency of current item set.
Step 5: The partakers send the encrypted numbers of all
relevant transactions-ids to the “Estimator”.
Step 6: The “Estimator” finds the intersection of the
encrypted transactions-ids.
Step 7: The “Estimator” informs the Excavator if the
current set is frequent.
Step 8: While i is smaller from the number of database
attributes, the value of i is incremented by 1. The algorithm
returns to Step 2.
Step 9: The Excavator sends the results to the participants.
RESULT AND ANALYSIS
In a Vertically Partitioned Data different sites collect
information about the same set of entities but they collect
different feature sets i.e, Records (entities) split across
parties. Here the relations at individual sites must be joined
to get the relation to be mined.
Let I = {I1, I2, . . . , Im} be a set of attributes, usually
called items, and let D be a set of transactions. Each
transaction in D is a set T  I of items. An association rule
is an implication of the form X  Y, where, X  I, Y  I
and X ∩ Y = Ø. The rule X  Y has support s in D if at
least s% of the transactions in D contain X U Y. The rule
X  Y has confidence c in D if at least c% of the
transactions in D that contain X also contain Y. The
problem is to find all association rules with support and
confidence above certain thresholds (usually referred to as
minsup and minconf).
An association rule expresses the dependence of a set of
attributes on other attributes. No site should be able to learn
61
contents of a transaction at any other site, what rules are
supported by any other site, or the specific value of
support/confidence for any rule at any other site.
Issues that cause a disparity between local and global
results include: first, values for a single entity may be split
across sources. Data mining at individual sites will be
unable to detect cross-site correlations and second the same
item may be duplicated at different sites, and will be overweighted in the results.
Database Creation and Mining Boolean
Association Rules
Absence of an attribute is 0 and presence of an attribute is
assumed to be 1. Determining the frequent item sets is
determining how many rows have the values of all
attributes in the item set as 1. Suppose X, Y represent
attributes in the database. xi represents the value of X
attribute for i row. The Scalar Product X.Y=∑xi*yi, i = 1 to
n. where n is total number of transactions, if k is the
support threshold, then Frequent item sets X.Y > k. This
module stores the item sets in binary form i.e., if the
particular attribute comes in particular transaction means it
stores 1 otherwise it stores 0. Here each party has the same
number of transactions. Figure 3(a) to Figure 3(c) shows
the database tables stored in binary format.
Fig. 3(a): Database maintained by first party
Fig. 3(b): Database maintained by second party
Fig. 3(c): Database maintained by third party
62
Mobile and Pervasive Computing (CoMPC–2008)
The rule to be mined and minimum support and
minimum confidence are given as input to the data
Excavator as in Figure 4. The data Excavator then split the
rule and sends the items to the appropriate parties who will
do local data mining and send the transaction id’s in
encrypted form to the third party. The third party will
calculate the scalar product or intersection of all transaction
id’s which are in encrypted form and sends the results to
the data Excavator. Finally, the data Excavator will send
the results to the parties. Since the transaction id’s are sent
in encrypted form the third party do not know which items
are present in particular transaction from site. The screen
shots are as shown:
Fig. 6: Comparison of archetyped and non archetyped data
in terms of cost
Fig. 4: Association Rule to be mined is given as input
If the rule and the threshold values are given, then it will
give the actual support for that global rule as shown in
Figure 5(a) and Figure 5(b). Suppose the rule to be mined
is a1a2b1c1c3 then it will give the support for that rule.
Fig. 7: Comparison of archetyped and non archetyped data
in terms of effieciency
CONCLUSION
Fig. 5(a): Screen shot showing output for the given rule
The genesis of the techniques called the Privacy preserving
data mining techniques haul out the relevant intellect from
mammoth amount of data, while shielding at the same time
sensitive information. A number of data mining techniques,
integrating privacy protection mechanisms, have been
developed that allow one to smokescreen sensitive item sets
or patterns, ahead of the execution of the data mining
process. An imperative issue is to settle on which ones
among these privacy-preserving techniques are superior
enough to protect sensitive information. We have
implemented and evaluated the true efficiency of the new
technique on our own conceptual framework. The specified
new framework was used to compare and contrast each and
every one of the techniques in a general podium which will
be the basis for ascertaining the suitable technique for a
given type of application of privacy preserving shared
filtering.
FUTURE WORK
Fig. 5(b): Screen shot showing output for the given rule
As seen from the Figures of 6 and 7 our algorithms
produce accurate results (better than perturbation), but
usually with much less computation and communication
overhead than secure computation.
We intend to develop this work for the horizontally
partitioned data also. We hope the proposed solution
will get hold of new techniques, paving way for
research track and work well according to the evaluation
metrics including hiding effects, data utility, and time
performance.
A Framework for Association Rule Generation Using Privacy Enhancing Methodology…
REFERENCES
[1] Yao, C., How to generate and exchange secrets. In
Proceedings of the 27th IEEE Symposium on Foundations of
Computer Science, pages 162-167, 1986.
[2] Evfimievski, Srikant, R., Agrawal, R. and Gehrke, J.,
Privacy preserving mining of association rules. In Proc. Of
ACM SIGKDD’02, pages 217–228, Canada, July, 2002.
[3] Gilburd, Schuster, A. and Wolff, R., k-ttp: a new privacy
model for large-scale distributed environments. In Proc. of
ACM SIGKDD’04, pages 563–568, 2004.
[4] Dwork and Nissim, K., “Privacy-preserving data mining on
vertically partitioned databases”, In Proc. of CRYPTO’04,
August, 2004.
[5] Dwork and Nissim, K. Privacy-preserving data mining on
vertically partitioned databases. In Proc. of CRYPTO’04,
August, 2004.
[6] Kargupta, H., Datta, S., Wang, Q.and Sivakumar, K., On the
privacy preserving properties of random data perturbation
techniques. In Proc. of ICDM’03, page 99, Washington, DC,
USA, 2003. IEEE Computer Society.
[7] Dinurm I. and Nissim, K., Revealing information while
preserving privacy. In Proc. of PODS’03, pages 202–210,
June, 2003.
63
[8] Vaidya, J., Clifton, C., Secure set intersection cardinality
with application to association rule mining. Journal of
Computer Security 13(4): 593–622 (2005).
[9] Vaidya, J., Clifton. C., Privacy Preserving Association Rule
Mining in Vertically Partitioned Data. In Proceedings of
SIGKDD 2002, Edmonton, Alberta, Canada, 2002.
[10] Goldreich, O., Micali, S. and Wigderson, A., How to play
any mental game – a completeness theorem for protocols
with honest majority. In 19th ACM Symposium on the
Theory of Computing, pages 218–229, 1987.
[11] Chan, P., An Extensible Meta-Learning Approach for
Scalable and Accurate Inductive Learning. PhD
thesis,Department of Computer Science, Columbia
University, New York, NY, 1996. (Technical Report CUCS044-96).
[12] Agrawal, R. and Srikant, R., “Privacy-preserving data
mining”, In Proc. of the ACM SIGMOD’00, pages 439–450,
Dallas, Texas, USA, May, 2000.
[13] Du, W. and Atallah, M.J., Secure multi-party computation
problems and their applications: A review and open
problems. In Proceedings of the 2001 New Security
Paradigms Workshop, Cloudcroft, New Mexico, Sept.,
11–13, 2001.
[14] Huang, Z., Du, W. and Chen, B. Deriving private
information from randomized data. In Proc. of ACM
SIGMOD’05, 2005.
Download