International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 A Novel Privacy Preserving Supervised Learning Approach in Data mining K Raghaveswara rao, V Sangeeta M.Tech Scholar , Associate professor Computer Science and Engineering at Pydah College of Engineering and Technology, Visakhapatnam Abstract:- In this paper we are proposing an efficient privacy preserving supervised learning approach with ID3 and Advanced Encryption Standard(AES). The main objective of the approach is to provide security during the mining of data over the networks, Confidentiality and sensitivity of data provided with our architecture during mining of data. Data owner can securely achieve his classified results without losing integrity of data after receiving the mined results from the analyst. I. INTRODUCTION Classification is process of grouping together documents or data that have similar properties or are related. Our understanding of the data and documents become greater and easier once they are classified. We can also infer logic based on the classification. Most of all it makes the new data to be sorted easily and retrieval faster with better results. Dewey Decimal Classification is the system most used in the libraries. It is hierarchical; there are ten parent classes which are further divided into ten further divisions which also are in turn divided into ten sections. Each book is assigned a number according to its class, division and section alphabetically. Dewey Decimal Classification is very successful in libraries but unfortunately it can’t be implemented in Information Retrieval. Somebody needs to have a central catalogue of all the documents in the web and whenever a new document is added the central committee would have to look at it classify it assign a number and publish it in the web. This is in strong violation of the way the internet works. Some authority controlling the contents of the web will restrict the amount of data that can be added into the web. We need a web that allows everyone to upload their content in the web together with a Machine Learning technique that finds these new data and classifies them as they come. Confidentiality issues in data mining. A key problem that arises in any en masse collection of data is that of confidentiality. The need for privacy is sometimes due to law (e.g., for medical databases) or can be motivated by business interests. However, there are situations ISSN: 2231-5381 where the sharing of data can lead to mutual gain. A key utility of large databases today is research, whether it be scientific, or economic and market oriented. Thus, for example, the medical field has much to gain by pooling data for research; as can even competing businesses with mutual interests. Despite the potential gain, this is often not possible due to the confidentiality issues which arise. We address this question and show that highly efficient solutions are possible. Our scenario is the following: Let P1 and P2 be parties owning (large) private databases D1 and D2. The parties wish to apply a data-mining algorithm to the joint databases D1, D2 without revealing any unnecessary information about their individual databases. That is, the only information learned by P1 about D2 is that which can be learned from the output of the data mining algorithm, and vice versa. We do not assume any “trusted” third party who computes the joint output. II. RELATED WORK Previous work in privacy-preserving data mining has addressed two issues [1]. In one, the aim is to preserve customer privacy by perturbing the data values [4]. In this scheme random noise data is introduced to distort sensitive values, and the distribution of the random data is used to generate a new data distribution which is close to the original data Distribution without revealing the original data values. The estimated original data distribution is used to reconstruct the data, and data mining techniques, such as classifiers and Association rules are applied to the reconstructed data set. Later refinement of this approach has tightened estimation of original values based on the distorted data [3]. The data distortion approach has also been applied to Boolean values in research work. Perturbation methods and their privacy protection have been criticized because some methods may derive private information from the reconstruction step [10]. Different to the original noise additive method in [4], many distinctive perturbation methods have been proposed. One important category is multiplicative perturbation method. In the view of geometric property of the data, multiplying the original data values with a random noise matrix is to rotate the original data http://www.ijettjournal.org Page 4158 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 matrix, so it is also called rotated based perturbation. In [5], authors have given a sound proof of “Rotation invariant Classifiers” to show some data mining tools can be directly applied to the rotation based perturbed data. In the later work [11], Liu et al have proposed multiplicative random projection which provided more enhanced privacy protection. There are some other interesting techniques, such as condensation based approach [2], matrix decomposition [21] and so on. As pointed out in [13], these recently research on perturbation based approaches apply the data mining techniques directly on the perturbed data skipping the reconstruction step. Choosing the suitable data mining techniques is determined by the method which noise has been introduce. To our knowledge, very few works focus on mapping or modifying the data mining techniques to meet the perturbation data needs. The other approach uses cryptographic tools to build data Mining models. For example, in [11], the goal is to securely Build an ID3 decision tree where the training set is distributed between two parties. Different solutions were Given to address different data mining problems using cryptographic techniques (e.g., [6, 9, 19]). This approach treats privacy-preserving data mining as a special case of secure multi-party computation and not only aims for preserving individual privacy but also tries to preserve leakage of any information other than the final result. In this paper we are introducing an efficient privacy preserving cryptographic approach for the classification of the datasets without exposing the user sensitive information to the external world III. PROPOSED WORK In this paper we are proposing a cryptographic classification approach The above architecture describes as follows Abstract view of proposed work: Step1: Read input Synthetic training and testing datasets. Step2: Forward the datasets to encoder of AES algorithm. Step3: Forward the unrealized datasets (Encrypted Datasets) to Analyst. Step4: Analyst applies the ID3 classification based on Information gain in terms of entropy. Step5: ID3 classifies the data by analyzing the testing data with training data and returns the classified or tested data. Step6: Analyst forwards the cipher classification rules to Data owner. Step7: After receiving Cipher classified data, Data owner forward the data to AES decoder. Step8: After the decryption Data owner receives the Plain classified data. A) Initialize Training Datasets for Machine Learning Datasets are the collection of tuples with respect to different attributes and possible values for each attribute and with class labels, is given for the classification process for analyzing the testing set behaviour with machine learning approach. Synthetic dataset can be gathered for the classification of results. Initially data set can be forwarded to the encoder, encoder returns the cipher dataset. B) Unrealized Dataset Creation Usually data can be passed to the analysts for the machine learning purpose, but there is a privacy preserving issue regarding the confidential information. So in this paper we introduced AES algorithm for the privacy issue. After applying this mechanism dataset can be constructed as unrealized dataset. i.e cipher dataset can be passed to the analyst for the classification instead of plain sensitive or confidential information. C) AES Algorithm Figure1. Privacy preserving Architecture ISSN: 2231-5381 Our paper uses an advanced cryptographic algorithm for secure data transmission the system mainly works on substitution and affine transformation techniques 1. Key Expansion—round keys are derived from the cipher key using key schedule. 2. Initial Round http://www.ijettjournal.org Page 4159 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 1. Sub Bytes—a non-linear substitution step where each byte is replaced with another according to a lookup table. 2. Shift Rows—a transposition step where each row of the state is shifted cyclically a certain number of steps. 3. Mix Columns—a mixing operation which operates on the columns of the state, combining the four bytes in each column. 4. Add Round Key—each byte of the state is combined with the round key using bitwise xor Rounds 3. Final Round (no Mix Columns) 1. Sub Bytes 2. Shift Rows 3. Add Round Key Complete implementation of the sub-bytes, shift-rows, mix columns and add round key as follows[21], for implementation details we had used built in algorithm from the dot-net namespaces. D) Classification with ID3 ID3 is one of the efficient Machine learning approaches for implementing the decision trees. Decision trees are used for classification purpose. Tree can be constructed based on the attribute based entropy or information gain values. We can efficiently analyze the classification rules by sending the testing data on to the training datasets. E) Retrieval of Original classified results After generating the classification results, results can be passed to the Data owner, there administrator can perform attribute oriented decryption for the resulted set. Original data set can reconstruct by the decoder and classified rules can be obtained finally at the data owner end. a. Methodology Initially data set can be forwarded to the encoder, encoder returns the cipher dataset. Data is passed to the analysts for the machine learning , but there is a need for privacy preserving, regarding the confidential information. So, in this paper we implemented AES algorithm for preserving the privacy. After applying the algorithm dataset is converted into unrealized dataset (cipher dataset). The cipher dataset is passed to the analyst for the classification instead of plain data. The analyst finds class labels basing on the information gain. Then it is used to construct the decision tree. The rules in cipher test format are formed basing on the ISSN: 2231-5381 decision tree. Now, the formed rules are sent back to the data owner. Data owner decrypts the rules and extracts the original data set. b. Experimental Analysis ID3 builds a decision tree from a fixed set of examples. The resulting tree is used to classify future samples. The example has several attributes and belongs to a class (like yes or no). The leaf nodes of the decision tree contain the class name whereas a non-leaf node is a decision node. The decision node is an attribute test with each branch (to another decision tree) being a possible value of the attribute. ID3 uses information gain to help it decide which attribute goes into a decision node. The advantage of learning a decision tree is that, a program rather than a knowledge engineer, elicits knowledge from an expert. Gain measures how well a given attribute separates training examples into targeted classes. The one with the highest information (information being the most useful for classification) is selected. In order to define gain, we first borrow an idea from information theory called entropy. Entropy measures the amount of information in an attribute. This is the formula for calculating homogeneity of a sample. It helps to measure the information gain with respect to the attributes Gain( A) E (Current set ) E (all child sets ) Read input synthetic training data set shown in the below figure 2. The Unrealized data set is forwarded to the analyst is shown in figure 3. The ID3 classifies the data by analyzing the testing data with training data with decision tree and eligible data shows the below figure 4. Analyst forwards the cipher classification rules to data owner. After receiving cipher classified data, data owner forwards the data to AES decoder. After the decryption data owner receives the plain classified data is shown in figure 5. http://www.ijettjournal.org Page 4160 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 Figure2. Original Data at Owner Figure3. Unrealized Dataset ISSN: 2231-5381 http://www.ijettjournal.org Page 4161 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 Figure 4. Tree and Eligible Data Final eligible Data After decryption at Data owner end can be shown as follows Figure5. Eligible Data after Classification and Decryption ISSN: 2231-5381 http://www.ijettjournal.org Page 4162 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 problem. We can minimize the computational complexity in AES by in changing the traditional GPU. Comparative Analysis: Classification is process of grouping together documents or data that have similar properties or are related. Our understanding of the data and documents become greater and easier once they are classified. We can also infer logic based on the classification. Most of all it makes the new data to be sorted easily and retrieval faster with better results. Recent proposal of privacy preserving during classification in data mining, mostly works on two approaches those are perturbation and randomization-based approaches and Cryptographic approaches, During the initial approach we inject fake values in to real dataset and converts into unrealized dataset. In the Cryptographic approach we convert the Plain data to cipher by using an cryptographic approach. The main drawback with the Previous approach is data retrievability, after retrieving the classified data from the analyst and the rules which are classified may not be optimal due to imputation of the fake values in the real dataset, maintain the details of fake imputation rules for entire dataset(Both training and testing datasets) is a time consuming process, Our proposed approach provides more security from the third parties but obviously computation complexity depends on the number of records in the datasets obviously. For optimal security we are considering our cryptographic approach with AES. IV. CONCLUSION & FUTURE WORK In this paper we proposed an efficient privacy preservation technique during classification of unreal datasets. It prevents the data owner from the un authorized access and privacy issues. Our proposed approach works efficiently with our violating the classification properties. Meanwhile, an accurate decision tree can be built directly from those unreal data sets. Finally the results yield accurate results even though classification applies on the cipher dataset. One of the shortcoming of ID algorithm is its inability to handle noisy data, which will lead to over fitting..The second drawback in ID3 is, attributes in training dataset and testing data must be matched, it leads to failure during classification in case of missing of any attribute and it may leads to incorrect predictions. Computational complexity is one of the main drawback when data is more. We can improve our research of privacy preserving and optimality with validation set pruning and some fuzziness for initial ISSN: 2231-5381 REFERENCES [1] Pui K. Fong and Jens H. Weber-Jahanke. “Privacy Preserving Decision Tree Learning Using Unrealized Data Sets.” Senior Member, IEEE Computer Society, 2012. [2] S. Ajmani, R. Morris, and B. Liskov, “A Trusted ThirdParty Computation Service,” Technical Report MIT-LCSTR-847, MIT, 2001. [3] S.L. Wang and A. Jafari, “Hiding Sensitive Predictive Association Rules,” Proc. IEEE Int’l Conf. Systems, Man and Cybernetics, pp. 164-169, 2005. [4] R. Agrawal and R. Srikant, “Privacy Preserving Data Mining,” Proc. ACM SIGMOD Conf. Management of Data (SIGMOD ’00), pp. 439-450, May 2000. [5] Q. Ma and P. Deng, “Secure Multi-Party Protocols for Privacy Preserving Data Mining,” Proc. Third Int’l Conf. Wireless Algorithms, Systems, and Applications (WASA ’08), pp. 526-537, 2008. [6] J. Gitanjali, J. Indumathi, N.C. Iyengar, and N. Sriman, “A Pristine Clean Cabalistic Foruity Strategize Based Approach for Incremental Data Stream Privacy Preserving Data Mining,” Proc. IEEE Second Int’l Advance Computing Conf. (IACC), pp. 410-415, 2010. [7] N. Lomas, “Data on 84,000 United Kingdom Prisoners is Lost,” Retrieved Sept. 12, 2008, http://news.cnet.com/83011009_3-10024550-83.html, Aug. 2008. [8] BBC News Brown Apologises for Records Loss. Retrieved Sept. 12, 2008, http://news.bbc.co.uk/2/hi/uk_news/politics/ 7104945.stm, Nov. 2007. [9] D. Kaplan, Hackers Steal 22,000 Social Security Numbers from Univ. of Missouri Database, Retrieved Sept. 2008, http://www.scmagazineus. com/Hackers-steal22000-Social-Security-numbers-from- Univ.-of-Missouridatabase/article/34964/, May 2007. [10] D. Goodin, “Hackers Infiltrate TD Ameritrade client Database,” Retrieved Sept. 2008, http://www.channelregister.co.uk/2007/ 09/15/ameritrade_database_burgled/, Sept. 2007. [11] L. Liu, M. Kantarcioglu, and B. Thuraisingham, “Privacy Preserving Decision Tree Mining from Perturbed Data,” Proc. 42nd Hawaii Int’l Conf. System Sciences (HICSS ’09), 2009. [12] Y. Zhu, L. Huang, W. Yang, D. Li, Y. Luo, and F. Dong, “Three New Approaches to Privacy-Preserving Add to Multiply Protocol and Its Application,” Proc. Second Int’l Workshop Knowledge Discovery and Data Mining, (WKDD ’09), pp. 554-558, 2009. [13] J. Vaidya and C. Clifton, “Privacy Preserving Association Rule Mining in Vertically Partitioned Data,” http://www.ijettjournal.org Page 4163 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 Proc Eighth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ’02), pp. 23- 26, July 2002. [14] M. Shaneck and Y. Kim, “Efficient Cryptographic Primitives for Private Data Mining,” Proc. 43rd Hawaii Int’l Conf. System Sciences (HICSS), pp. 1-9, 2010. [15] C. Aggarwal and P. Yu, Privacy-Preserving Data Mining:, Models and Algorithms. Springer, 2008. [16] L. Sweeney, “k-Anonymity: A Model for Protecting Privacy,” Int’l J. Uncertainty, Fuzziness and Knowledgebased Systems, vol. 10, pp. 557-570, May 2002. [17] J. Dowd, S. Xu, and W. Zhang, “Privacy-Preserving Decision Tree Mining Based on Random Substitions,” Proc. Int’l Conf. Emerging Trends in Information and Comm. Security (ETRICS ’06), pp. 145-159, 2006. [18] S. Bu, L. Lakshmanan, R. Ng, and G. Ramesh, “Preservation of Patterns and Input-Output Privacy,” Proc. IEEE 23rd Int’l Conf. Data Eng., pp. 696-705, Apr. 2007. [19] S. Russell and N. Peter, Artificial Intelligence. A Modern Approach 2/ E. Prentice-Hall, 2002. [20] P.K. Fong, “Privacy Preservation for Training Data Sets in Database: Application to Decision Tree Learning,” master’s thesis, Dept. of Computer Science, Univ. of Victoria, 2008. [21]http://en.wikipedia.org/wiki/Advanced_Encryption_Sta ndard ISSN: 2231-5381 V Sangeeta completed her M.Tech in Andhra University, Visakhapatnam in year 2006. She is currently working as an Associate professor and Head Of the Department of Computer Science and Engineering at Pydah College of Engineering and Technology, JNTUK University. She is pursuing her Ph.D degree in computer science at Andhra University. Her research focuses on Data Mining and Warehousing. Her areas of interest include Computer Networks, Network security, Operating Systems and Computer Organization. K Raghaveswara Rao completed MCA in Aditya Institute of Technology and Management in year 2007. He is pursuing M.Tech in Computer Science and Engineering from Pydah College of Engineering and Technology, Visakhapatnam Dist, AP. His areas of interest include Data Mining and Warehousing, Computer Networks and Computer Organization. http://www.ijettjournal.org Page 4164