International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013 Secure Classification Approach Over Out sourced Datasets RamkishorPondreti*,JayanthiRaoMadina# * 1 MTech scholar, #Assistantprofessor Department of Software Engineering , SISTAM college, Srikakulam, Andhra Pradesh 2 Dept of CSE , SISTAM college, Srikakulam, Andhra Pradesh Abstract:-In this paper we are proposing an efficient secure classification approach over outsourced datasets in data mining, Our approach is an efficient and empirical model of privacy preserving technique during the classification of data. We introduce a Cryptography based approach that protects centralized sample data sets utilized for decision tree data mining. Privacy preservation is applied to sanitize the samples prior to their release to third parties in order to mitigate the threat of their inadvertent disclosure or theft. In contrast to other sanitization methods, our approach does not affect the accuracy of data mining results. The decision tree can be built directly from the sanitized data sets, such that the originals do not need to be reconstructed. Moreover, this approach can be applied at any time during the data collection process so that privacy protection can be in effect even while samples are still being collected. I. INTRODUCTION DATA mining is widely used by researchers for science and business purposes. Data collected (referred to as “sample data sets” or “samples” in this paper) from individuals (referred to in this paper as “information providers”) are important for decision making or pattern recognition. Therefore, privacy-preserving processes have been developed to sanitize private information from the samples while keeping their utility[1]. Due to increasing concerns related to privacy, various privacy-preserving data mining techniques have been developed to address different privacy issues These techniques usually operate under various assumptions and employ different methods. In this paper, we will focus on the perturbation method that is extensively used in privacy preserving data mining[2] . Data modification techniques maintain privacy by modifying attribute values of the sample data sets. Essentially, data sets are modified by eliminating or unifying uncommon elements among all data sets. These similar data sets act as masks for the others within the group because they cannot be distinguished from the others; every data set is loosely linked with a certain number of information providers. k-anonymity [15] is a data modification approach that aims to protect private information of the samples by generalizing attributes. k-anonymity trades privacy for utility. Further, this approach can be applied only after the entire data collection process has been completed[3][4]. Privacy issues are further exacerbated now that the World Wide Web makes it easy for the new data to be ISSN: 2231-5381 automatically collected and added to databases [4] [5] [6] [17] The concerns over massive collection of data are naturally extending to analytic tools applied to data. Data mining, with its promise to efficiently discover valuable, non-obvious information from large databases, is particularly vulnerable to misuse [9] [10][18].A fruitful direction for future research in data mining will be the development of techniques that incorporate We introduce a new perturbation and randomization based approach that protects centralized sample data sets utilized for decision tree data mining. Privacy preservation is applied to sanitize the samples prior to their release to third parties in order to mitigate the threat of their inadvertent disclosure or theft. In contrast to other sanitization methods, our approach does not affect the accuracy of data mining results. The decision tree can be built directly from the sanitized data sets, such that the originals do not need to be reconstructed. Moreover, this approach can be applied at any time during the data collection process so that privacy protection can be in effect even while samples are still being collected. II. RELATED WORK Various researchers introduced various mechanisms for privacy preserving data mining techniques, this approaches not only concentrating on specific domain in data mining like clustering, classification, association rule mining and other. Even though various approaches released by the various researchers, they are not optimal a have their individual advantages and disadvantages in their proposed architectures. Our approach concentrating on privacy preserving technique for classification, here it involves the unrealized datasets, Data owner can not manipulate the data directly with out sending it to analyst, Privacy is also important issue while transmission data between data owner and analysis. Traditional approaches of randomization and perturbations approach provides the security as privacy preserving and Traditional mechanism of classification analyzes the testing data or sample with training data, there is no concept of privacy preserving. Privacy preserving techniques like randomization and perturbations approach may not optimal because there is a chance of loss(data integrity problem). Perturbation-based approaches attempt to achieve privacy protection by distorting information from the original http://www.ijettjournal.org Page 198 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013 data sets. The perturbed data sets still retain features of the originals so that they can be used to perform data mining directly or indirectly via data reconstruction. Random substitutions [16] is a perturbation approach that randomly substitutes the values of selected attributes to achieve privacy protection for those attributes, and then applies data reconstruction when these data sets are needed for data mining. Even though privacy of the selected attributes can be protected, the utility is not recoverable because the reconstructed data sets are random estimations of the originals. In this paper we are introducing an efficient privacy preserving cryptographic approach for the classification of the datasets without exposing the user sensitive information to the external world III. PROPOSED WORK In this paper we are proposing a cryptographic classification approach Encoder Plain Dataset Cipher Dataset Cipher Dataset Cipher Classification Rules Analyst Data Owner Cipher Rules CC rules Dataset Original Rules Decoder Classifier Figure1.Privacy preserving Architecture The above architecture describes as follow. Our empirical model of privacy preserving technique with integrated approach of cryptography with AES algorithm and Classification with ID3 algorithm. Data owner Maintains the training datasets with required datasets ,For making the realized datasets to unrealized datasets through the encoding approach of AES algorithm and forwards the unrealized datasets to analyst for classification,the main advantage with the unrealized datasets are classification does not require the semantics of the testing and training datasets. ISSN: 2231-5381 Initialize Training Datasets for Machine Learning: Datasets are the collection of tuples with respect to different attributes and possible values for each attribute and with class labels, is given for the classification process for analysing the testing set behaviour with machine learning approach. Synthetic dataset can be gathered for the classification of results. Initially data set can be forwarded to the encoder, encoder returns the cipher dataset. Unrealized Dataset Creation Usually data can be passed to the analysts for the machine learning purpose, but there is a privacy preserving issue regarding the confidential information. So in this paper we introduced AES algorithm for the privacy issue. After applying this mechanism dataset can be constructed as unrealized data set. i.e cipher dataset can be passed to the analyst for the classification instead of plain sensitive or confidential information. Classification with ID3 ID3 is one of the efficient Machine learning approaches for implementing the decision trees. Decision trees are used for classification purpose. Tree can be constructed based on the attribute based entropy or information gain values. We can efficiently analyze the classification rules by sending the testing data on to the training datasets. Retrieval of Original classified results After generating the classification results, results can be passed to the Data owner, there administrator can perform attribute oriented decryption for the resulted set. Original data set can reconstruct by the decoder and classified rules can be obtained finally at the data owner end. IV. EXPERIMENTAL ANALYSIS ID3 builds a decision tree from a fixed set of examples. The resulting tree is used to classify future samples. The example has several attributes and belongs to a class (like yes or no). The leaf nodes of the decision tree contain the class name whereas a non-leaf node is a decision node. The decision node is an attribute test with each branch (to another decision tree) being a possible value of the attribute. ID3 uses information gain to help it decide which attribute goes into a decision node. The advantage of learning a decision tree is that a program, rather than a knowledge engineer, elicits knowledge from an expert. Gain measures how well a given attribute separates training examples into targeted classes. The one with the highest information (information being the most useful for classification) is selected. In order to define gain, we first borrow an idea from information theory called entropy. Entropy measures the amount of information in an attribute This is the formula for calculating homogeneity of a sample. Entropy(S)=∑ It helps to measure the information gain with respect to the attributes Gain(A)=E(Current st)-∑ E(all child sets) http://www.ijettjournal.org Page 199 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013 Our Experimental result purposes we are using a synthetic dataset, the following dataset at Data owner side before Figure2. converting to unrealized dataset, after converting the dataset to unrealized dataset, data owner forwards to the analyst. Original Data at Owner information gain and analyzes the testing data with training or unrealized dataset. At analyst end ,he constructs the decision tree for Unrealized dataset which is encrypted ,based on Figure3. Unrealized Dataset Decision tree constructed with the class labels based on information gain ,in terms of entropy, the tree can be shown as follows. ISSN: 2231-5381 http://www.ijettjournal.org Page 200 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013 Figure 4. Tree and Eligible Data Final eligible Data After decryption at Data owner end can be shown as follows Figure5. Eligible Data IV. CONCLUSION ANDD FUTURE WORK In this paper we proposed an efficient privacy preservation technique during classification of unreal datasets. It prevents the data owner from the un authorized access and privacy issues, Our proposed approach works efficiently with our violating the classification properties. Meanwhile, an accurate decision tree can be built directly from those ISSN: 2231-5381 after Classification and Decryption unreal data sets. Finally the results yield accurate results even though classification applies on the cipher dataset. Classifies the testing data with training data without losing its data integrity Cryptography mechanism provided for secure classification ,during the data transmission between Data owner and analyst. AES already proved secure cryptographic approach than the traditional approaches. http://www.ijettjournal.org Page 201 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013 REFERENCES BIOGRAPHIES [1] Privacy Preserving Decision Tree LearningUsing Unrealized RamkishorPondreti is working as an Assistant Professor in Aditya Institute of Technology And Management, Tekkali. He received B.Tech from Aditya Institute of Technology And Management, Tekkali. He received M.Tech from Avanthi Institute of Engineering & Technology, Visakhapatnam.He is pursuing M.Tech in Sarada Institute of Science, Technology and Management, Srikakulam, Andhra Pradesh. Interesting areas are Data Structures, Java and Oracle database. Data Sets Pui K. Fong and Jens H. Weber-Jahnke, Senior Member, IEEE Computer Society. [2] Privacy Preserving Decision Tree Mining from Perturbed Data Li Liu. [3] L. Sweeney, “k-Anonymity: A Model for Protecting Privacy,” Int’l J. Uncertainty, Fuzziness and Knowledge-based Systems, vol. 10, pp. 557-570, May 2002 [4] Q. Ma and P. Deng, “Secure Multi-Party Protocols for Privacy Preserving Data Mining,” Proc. Third Int’l Conf. Wireless Algorithms, Systems, and Applications (WASA ’08), pp. 526537, 2008. [5] N. Lomas, “Data on 84,000 United Kingdom Prisoners is Lost,” Retrieved Sept. 12, 2008, http://news.cnet.com/83011009_3-10024550-83.html, Aug. 2008. [6] J. Gitanjali, J. Indumathi, N.C. Iyengar, and N. Sriman, “A Pristine Clean Cabalistic Foruity Strategize Based Approach for Incremental Data Stream Privacy Preserving Data Mining,” Proc. IEEE Second Int’l Advance Computing Conf. (IACC), pp. 410415, 2010. [7] S. Bu, L. Lakshmanan, R. Ng, and G. Ramesh, “Preservation of Patterns and Input-Output Privacy,” Proc. IEEE 23rd Int’l Conf. Data Eng., pp. 696-705, Apr. 2007. [8] S. Russell and N. Peter, Artificial Intelligence. A Modern Approach 2/ E. Prentice-Hall, 2002. [9] D. Goodin, “Hackers Infiltrate TD Ameritrade client Database,” Retrieved Sept.2008,http://www.channelregister.co.uk/2007/09/15/ameritrad e_database_burgled/, Sept. 2007. [10] L. Liu, M. Kantarcioglu, and B. Thuraisingham, “Privacy Preserving Decision Tree Mining from Perturbed Data,” Proc. 42nd Hawaii Int’l Conf. System Sciences (HICSS ’09), 2009. [11] Y. Zhu, L. Huang, W. Yang, D. Li, Y. Luo, and F. Dong, “Three New Approaches to Privacy-Preserving Add to Multiply Protocol and Its Application,” Proc. Second Int’l Workshop Knowledge Discovery and Data Mining, (WKDD ’09), pp. 554-558, 2009. [12] J. Vaidya and C. Clifton, “Privacy Preserving Association Rule Mining in Vertically Partitioned Data,” Proc Eighth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ’02), pp. 23- 26, July 2002. [13] J. Dowd, S. Xu, and W. Zhang, “Privacy-Preserving Decision Tree Mining Based on Random Substitions,” Proc. Int’l Conf. Emerging Trends in Information and Comm. Security (ETRICS ’06), pp. 145-159, 2006.(HICSS), pp. 1-9, 2010. [14] C. Aggarwal and P. Yu, Privacy-Preserving Data Mining:, Models and Algorithms. Springer, 2008. JayanthiRaoMadina is working as a HOD in Sarada Institute of Science, Technology And Management, Srikakulam, Andhra Pradesh. He received his M.Tech (CSE) from Aditya Institute of Technology And Management, Tekkali. Andhra Pradesh. His research areas include Image Processing, Computer Networks, Data Mining, Distributed Systems. [15] L. Sweeney, “k-Anonymity: A Model for Protecting Privacy,” Int’l J. Uncertainty, Fuzziness and Knowledge-based Systems, vol. 10, pp. 557-570, May 2002. ISSN: 2231-5381 http://www.ijettjournal.org Page 202