A Framework for Association Rule Generation Using Privacy Enhancing Methodology… 1 A Framework for Association Rule Generation Using Privacy Enhancing Methodology for Vertically Partitioned Data Mining Praveen Baskar, J. Gitanjali, K. Satheesh Kumar1, J. Indumathi2 and G.V. Uma3 Department of Computer Science and Engineering, Anna University, Chennai-600 025. Tamil Nadu, India E-mail: 1Sathishkumar248@gmail.com, 2indumathi.j@gmail.com, 3gvuma@annauniv.edu “Sufficiently advanced technology is indistinguishable from magic.” - Arthur C Clarke ABSTRACT: At its nub, the value of privacy preserving data mining is plagiaristic not only from its flair to haul out crucial knowledge, but also from its resiliency to molestation. It performs well at needed levels during times of both crisis and normal operations. This task force’s central thrust is towards establishing a earth with robust data security, where knowledge users persist to profit from data without compromising the data privacy. The goal of privacypreserving data mining is to liberate a dataset that researchers can study without being able to identify sensitive information about any individuals in the data (with high probability). The contemporary chief methods existing (i.e., the data obfuscation methods and secure computation methods) are circumscriptive in their own ways. Henceforth in this paper, we present a new archetype to perform an enhanced privacy preservation for distributed data mining (i.e., vertically partitioned data) without using the conventional techniques of perturbation or cryptography. We have implemented and evaluated the true efficiency of the new technique on our own conceptual framework. The specified new framework was used to compare and contrast each and every one of the techniques in a general podium which will be the basis for ascertaining the suitable technique for a given type of application of privacy preserving shared filtering. We hope the proposed solution will get hold of new techniques, paving way for research track and work well according to the evaluation metrics including hiding effects, data utility, and time performance. Keywords—Estimator, Excavator, Privacy, Privacy Preserving Data Mining (PPDM), Vertical Partitioned Privacy Preserving Data Mining (VPPPDM). INTRODUCTION T he detonation of new data mining techniques has amplified privacy risks because now it is credible to effectively coalesce and cross-examine massive data stores, available on the web, in the fumble around of earlier unidentified hidden patterns. Consecutively to make a overtly accessible system safe and sound, we must guarantee not only that private sensitive data have been trimmed out, but also to make certain that certain inference channels have been choked-up. The data and the concealed knowledge in this data should be made secure. Furthermore, the prerequisite for making our system as open as probable—to the extent that data sensitivity is not jeopardized—asks for diverse techniques that account for the revelation organize of sensitive data. Currently, databases are distributed either horizontally or vertically among several organizations who would like to collaborate in order to extract global knowledge, but at the same time, privacy apprehensions may thwart these parties from directly sharing the data among them. Privacy of databases is of foremost concern when data is shared for collaborative data mining in knowledge discovery systems and is solved by the concept of Privacy Preserving Data Mining (PPDM). PPDM is a discipline whose desire is to empower liberation broadcasts of corresponding data while preserving the privacy. There are several mechanisms by which the Data privacy can be achieved. An imperative issue of the hour is therefore to settle on which one among these Privacy Preserving techniques are superior enough to protect sensitive information. Nevertheless, it is one of the decisive factors as to which of these techniques can be evaluated to be the best. LITERATURE SURVEY The momentum for the proposed architecture likely here emerges through a forethought of the role that privacy plays in individual people’s lives, privacy legislation in totaling to an acknowledgement of individual citizens privacy preferences and any supplementary privacy constraints indispensable by organisations or duty-bound by regulatory bodies and a pithy survey of both the research and the state of practice with regard to Privacy-Enhancing Technologies. R. Agrawal et al., (2000) [11] used the perturbation technique (the original data would linger secret, while the added noise would average out in the output) to preserve the privacy. Inspite of the simplicity of the method it lacked a formal framework for proving the quantification of 60 Mobile and Pervasive Computing (CoMPC–2008) privacy. Regardless of the existence of other models [2, 4 and 6] for studying the privacy achievable through perturbation, there is no prescribed way to model and quantify the privacy threat from mining of perturbed data. Recently there has been some proofs [13, 5] which state that for some data, and some kinds of noise, perturbation provides almost no privacy at all. Table 1: Comparison between the contemporary methods Advantages Secure Computation Method Disadvantages Reduced accuracy It lacks a formal framework for proving how much privacy is guaranteed. There is no formal way to model and quantify the privacy threat from mining of perturbed data. Officers a well defined model for privacy, which includes methodologies for proving and quantifying it. There is a vast toolset of cryptographic algorithms and constructs which can be used for implementing privacy-preserving data mining algorithms. It provides accurate results and not approximation. Increased overhead It is much slower method and Requires considerable computation and communication overhead for each SC step. Problem Statement Contriving, Conceiving, designing, creating, implementing archetype and algorithms for enhancing the privacy of health care databases. Problem Description We intend to propose a new model to perform privacypreserving distributed data mining without using contemporary methods; formulate an enhanced algorithm for vertically partitioned databases, and association rule mine for this model is the scope of our work. The main idea is in the architecture which separates the entity which computes the results and the entity which finally gets the results and knows what they mean. We are also planning to discuss their privacy and performance characteristics. ARCHITECTURE OF THE PROPOSED WORK We bring out a diagrammatic schematic representation of the blocks involved in the proposed architecture as shown in Figure 2. Results to Data Owners Data Analyst The diagonal approach for privacy preserving data mining developed, using cryptographic techniques [10, 3, 7], most often the secure computation technique [12, 9, 8, 1]. Taking into consideration the tabulated issues of Table 1 and Figure 1, we present a new archetype to perform an enhanced privacy-preservation for distributed data mining without using the conventional techniques of perturbation or cryptography. We are presenting algorithm for association rule mining for vertically partitioned data, utilizing this new paradigm. By discussing the privacy and performance characteristics we will establish our credibility for the algorithm. EXCAVATOR Encrypted Data Encryption Db1 Privacy Preserving Data Mining ESTIMATOR Encryption Encryption Db2 Dbw Db- Database Perturbation Method Simplicity Easy to use PROBLEM DESCRIPTION Local Databases Fig. 2: Archetype for VPPDDM Perturbation Secure computation Without perturbation and secure computation Cryptographic technique Fig. 1: Taxonomy of PPDM techniques The main idea is an architectural one. We have N involved parties collaborating with each other. The work of the algorithm is divided between the excavator and an estimator. Excavator decides the computation type to be done, and the estimator computes the item set without any information about the item-sets. A Framework for Association Rule Generation Using Privacy Enhancing Methodology… IMPLEMENTATION The Goal: Computation of the frequent itemsets in vertically partitioned database without compromising the participants’ privacy. Privacy Definition: The privacy will be compromised if it will be possible for any participant to compute some specific value of the database with high probability. By specific value we mean a value in a database cell which belongs to some specific transaction. ALGORITHM USED Step 1: The Excavator sets the variable i (the size of the item-sets being checked now) to 1. Step 2: The Excavator chooses a random transformation of item-sets of i elements and then iteratively picks each of them. Step 3: For each item-set from step 2, if all the subsets of current item-set are frequent (apriori principle), the Excavator orders all partakers to encrypt the Transactionids of the transactions using the same key. Step 4: The Excavator then asks every partaker about frequency of current item set. Step 5: The partakers send the encrypted numbers of all relevant transactions-ids to the “Estimator”. Step 6: The “Estimator” finds the intersection of the encrypted transactions-ids. Step 7: The “Estimator” informs the Excavator if the current set is frequent. Step 8: While i is smaller from the number of database attributes, the value of i is incremented by 1. The algorithm returns to Step 2. Step 9: The Excavator sends the results to the participants. RESULT AND ANALYSIS In a Vertically Partitioned Data different sites collect information about the same set of entities but they collect different feature sets i.e, Records (entities) split across parties. Here the relations at individual sites must be joined to get the relation to be mined. Let I = {I1, I2, . . . , Im} be a set of attributes, usually called items, and let D be a set of transactions. Each transaction in D is a set T I of items. An association rule is an implication of the form X Y, where, X I, Y I and X ∩ Y = Ø. The rule X Y has support s in D if at least s% of the transactions in D contain X U Y. The rule X Y has confidence c in D if at least c% of the transactions in D that contain X also contain Y. The problem is to find all association rules with support and confidence above certain thresholds (usually referred to as minsup and minconf). An association rule expresses the dependence of a set of attributes on other attributes. No site should be able to learn 61 contents of a transaction at any other site, what rules are supported by any other site, or the specific value of support/confidence for any rule at any other site. Issues that cause a disparity between local and global results include: first, values for a single entity may be split across sources. Data mining at individual sites will be unable to detect cross-site correlations and second the same item may be duplicated at different sites, and will be overweighted in the results. Database Creation and Mining Boolean Association Rules Absence of an attribute is 0 and presence of an attribute is assumed to be 1. Determining the frequent item sets is determining how many rows have the values of all attributes in the item set as 1. Suppose X, Y represent attributes in the database. xi represents the value of X attribute for i row. The Scalar Product X.Y=∑xi*yi, i = 1 to n. where n is total number of transactions, if k is the support threshold, then Frequent item sets X.Y > k. This module stores the item sets in binary form i.e., if the particular attribute comes in particular transaction means it stores 1 otherwise it stores 0. Here each party has the same number of transactions. Figure 3(a) to Figure 3(c) shows the database tables stored in binary format. Fig. 3(a): Database maintained by first party Fig. 3(b): Database maintained by second party Fig. 3(c): Database maintained by third party 62 Mobile and Pervasive Computing (CoMPC–2008) The rule to be mined and minimum support and minimum confidence are given as input to the data Excavator as in Figure 4. The data Excavator then split the rule and sends the items to the appropriate parties who will do local data mining and send the transaction id’s in encrypted form to the third party. The third party will calculate the scalar product or intersection of all transaction id’s which are in encrypted form and sends the results to the data Excavator. Finally, the data Excavator will send the results to the parties. Since the transaction id’s are sent in encrypted form the third party do not know which items are present in particular transaction from site. The screen shots are as shown: Fig. 6: Comparison of archetyped and non archetyped data in terms of cost Fig. 4: Association Rule to be mined is given as input If the rule and the threshold values are given, then it will give the actual support for that global rule as shown in Figure 5(a) and Figure 5(b). Suppose the rule to be mined is a1a2b1c1c3 then it will give the support for that rule. Fig. 7: Comparison of archetyped and non archetyped data in terms of effieciency CONCLUSION Fig. 5(a): Screen shot showing output for the given rule The genesis of the techniques called the Privacy preserving data mining techniques haul out the relevant intellect from mammoth amount of data, while shielding at the same time sensitive information. A number of data mining techniques, integrating privacy protection mechanisms, have been developed that allow one to smokescreen sensitive item sets or patterns, ahead of the execution of the data mining process. An imperative issue is to settle on which ones among these privacy-preserving techniques are superior enough to protect sensitive information. We have implemented and evaluated the true efficiency of the new technique on our own conceptual framework. The specified new framework was used to compare and contrast each and every one of the techniques in a general podium which will be the basis for ascertaining the suitable technique for a given type of application of privacy preserving shared filtering. FUTURE WORK Fig. 5(b): Screen shot showing output for the given rule As seen from the Figures of 6 and 7 our algorithms produce accurate results (better than perturbation), but usually with much less computation and communication overhead than secure computation. We intend to develop this work for the horizontally partitioned data also. We hope the proposed solution will get hold of new techniques, paving way for research track and work well according to the evaluation metrics including hiding effects, data utility, and time performance. A Framework for Association Rule Generation Using Privacy Enhancing Methodology… REFERENCES [1] Yao, C., How to generate and exchange secrets. In Proceedings of the 27th IEEE Symposium on Foundations of Computer Science, pages 162-167, 1986. [2] Evfimievski, Srikant, R., Agrawal, R. and Gehrke, J., Privacy preserving mining of association rules. In Proc. Of ACM SIGKDD’02, pages 217–228, Canada, July, 2002. [3] Gilburd, Schuster, A. and Wolff, R., k-ttp: a new privacy model for large-scale distributed environments. In Proc. of ACM SIGKDD’04, pages 563–568, 2004. [4] Dwork and Nissim, K., “Privacy-preserving data mining on vertically partitioned databases”, In Proc. of CRYPTO’04, August, 2004. [5] Dwork and Nissim, K. Privacy-preserving data mining on vertically partitioned databases. In Proc. of CRYPTO’04, August, 2004. [6] Kargupta, H., Datta, S., Wang, Q.and Sivakumar, K., On the privacy preserving properties of random data perturbation techniques. In Proc. of ICDM’03, page 99, Washington, DC, USA, 2003. IEEE Computer Society. [7] Dinurm I. and Nissim, K., Revealing information while preserving privacy. In Proc. of PODS’03, pages 202–210, June, 2003. 63 [8] Vaidya, J., Clifton, C., Secure set intersection cardinality with application to association rule mining. Journal of Computer Security 13(4): 593–622 (2005). [9] Vaidya, J., Clifton. C., Privacy Preserving Association Rule Mining in Vertically Partitioned Data. In Proceedings of SIGKDD 2002, Edmonton, Alberta, Canada, 2002. [10] Goldreich, O., Micali, S. and Wigderson, A., How to play any mental game – a completeness theorem for protocols with honest majority. In 19th ACM Symposium on the Theory of Computing, pages 218–229, 1987. [11] Chan, P., An Extensible Meta-Learning Approach for Scalable and Accurate Inductive Learning. PhD thesis,Department of Computer Science, Columbia University, New York, NY, 1996. (Technical Report CUCS044-96). [12] Agrawal, R. and Srikant, R., “Privacy-preserving data mining”, In Proc. of the ACM SIGMOD’00, pages 439–450, Dallas, Texas, USA, May, 2000. [13] Du, W. and Atallah, M.J., Secure multi-party computation problems and their applications: A review and open problems. In Proceedings of the 2001 New Security Paradigms Workshop, Cloudcroft, New Mexico, Sept., 11–13, 2001. [14] Huang, Z., Du, W. and Chen, B. Deriving private information from randomized data. In Proc. of ACM SIGMOD’05, 2005.