International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 1 – Oct 2014 A Novel and Privacy Preserving Unsupervised Learning Approach between Data Holders Potnuru Srilatha1, M.VenkataBalaji Chadrashekar2 1 1,2 Final Year M.Tech Student, 2Assoc. Professor Dept. of CSE,Aditya Institute of Technology and Management(AITAM), Tekkali,Srikakulam,Andhra Pradesh Abstract: Privacy preserving over data mining in distributed networks is still an important research issue in the field of Knowledge and data engineering or community based clustering approaches, privacy is an important factor while datasets or data integrates from different data holders or players for mining. Secure mining of data is required in open network. In this paper we are proposing an efficient privacy preserving data clustering technique in distributed networks. In existing approaches the researches mainly focused on protecting the multiparty computation which describes the privacy preserving clustering limitations in the multi user cases. That means a multiple data owners have their own databases and exchange each other to calculate the central clustering output. The result is calculated in particular iterations of hard communications between the data owners. I. INTRODUCTION In present days there are rapidly growing technology and knowledge discovery out of information and it becomes more prevalent. Generally clustering is basic approach for analyzing the data mining tools. It is a method for grouping of similar properties of items or objects. It is used form minimizing the processing time between the segregation of data. For clustering process a basic known method is k-means clustering algorithm. It is applied in many applications such as in pattern recognition, real life, customer behavior analysis etc. The data which is used in clustering methods are very sensitive information such as financial records by the back. There are some privcy issues to be considered in this confidential information. The data player means the bank should not leak any others information such as the customers which means do not observe by the others. Moreover there are so many issues that other third party members include in clustering analysis. [1][2] The database in the financial organizations should not retrieve the information to others. And in other application of clustering is in cloud computing that is it introduced software in service. In this a data owner sends the clustering data to service provider in order to get the profits such as cost complexity, pay on used services only. The service provider execute the clustering process and sends the formatted data to the data owner. In this study there is need of privacy and without accessing the actual data sent by the data owner. ISSN: 2231-5381 In other research that is perturbation method that the data owner creates a perturbed database from the actual database by including the noise. The created data base can be browsed by any other members. So who can gather the information of the perturbed databases he/she must perform the clustering methods on the collected databases. In some other researches propose a solution for privacy preserving clustering problem that the data partitioning based on various attributes and decentralized to many other parties. In this k-means algorithm is used which results o(nrk) and n is the number of data items, r is the number of users and k is the number of clusters. [5,6] In many of the clustering algorithms calculate the distances which mean the distance between the data points and compare with the distance of the divided points. If the distances calculated are more accurate we can calculate more accurate mining outputs. Otherwise the databases are going to be distorted by various distortions. So our proposed methods are initiative methods to ensure correct results. In traditional approaches there are so many methods which uses additive noise and multiplicative noise to increase the efficiency of the privacy. The additive noise distort more in clustering result because it do not consider the pair-wise distances of actual data. The disturbing in clustering can result is measured by difference between the clustering outputs of the actual data and many features are provided such as variation of information etc, are also studied for compare the clusters.[8] http://www.ijettjournal.org Page 32 International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 1 – Oct 2014 II. RELATED WORK There is an example for the privacy problem in clustering methods found in business organizations. Many companies have large set of records of their customers’ and their buying actions. These companies have to decide that cooperatively and apply data clustering method on their datasets for their benefit since this brings them a profit over other companies. The aim is to divide a market into various partitions of customers where any partition may be selected as a market to be achieved with various marketing combinations. These companies like to convert their data in a way that the privacy of customers cannot be disturbed. Some of the researchers introduced methods for assuring the moderate disclosure when the exploring of the detailed data. In this approach initially it builds a decision tree using the training data then it exchanges the confidential data only to confidential class labeled. This approach is more closure to the root but low precision. In other approach also considered building of the decision tree using the training data which values of each records are perturbed means added with noise that random values from probability distribution. The resulted records are different from actual records and not accurate compared with the original data records . In other approach focused on information loss. It is based on expectation maximization that converges to maximum similar records of the actual data though a large amount of data is there in the original database. It also has some features of Bayesian classification method. In other traditional approaches introduced association rule mining which more efficient in categorical items that has preserve the privacy of every transaction. This idea raised because of some items in every transaction are copied by novel items and not present in original transactions. In this some information is taken and some other information is raised to gain the optimal privacy in data. Basically the schema is very flexible to retrieve the association rules and protect with uniform randomization. requirements of privacy such as in case of medical records) or corporate secrecy can prevent such collaborative data mining. In privacy preserving that data mining [3, 4] considers itself with algorithms and methods which allow data owners to outsource in such endeavors without any collaborators. It is being needed to retrieve any items from their own databases. The two aims to get in this configuration are privacy that is no “leakage” of actual data and more efficient in communication and minimize the amount of data to passed between the users. Large databases and data streams: Algorithms that work on large databases and data streams must be efficient in terms of I/O access. Since secondary-storage access can be significantly more expensive than main-memory access, an algorithm should make a small number of passes over the data, ideally no more than one pass. Also, since main memory capacity typically lags significantly behind secondary storage, these algorithms should have sub-linear space complexity. Although the notion of data streams has been formalized relatively recently, a surprisingly large body of results already exists for stream data mining [10, 1, 2, 8]. Most of the privacy-preserving protocols available in the literature convert existing data mining algorithms or distributed data mining algorithms into privacy-preserving protocols. The resulting protocols can sometimes leak additional information. For example, in the privacypreserving clustering protocols of [4,1], the two collaborating parties learn the candidate cluster centers at the end of each iteration. As explained in [2], this intermediate information can sometimes reveal some of the parties’ original data, thus breaching privacy. While it is possible to modify these protocols to avoid revealing the candidate cluster centers [7], the resulting protocols have very high communication complexity. In this paper, we describe a k-clustering algorithm that we specifically designed to be converted into a communication-efficient protocol for private clustering. Private decentralized data mining:- The Data distributed among different databases owned by various data players. Distributed data mining is acquiring of knowledge from various databases in each of which is owned by various organizations. The threat to privacy becomes real since data mining techniques are able to derive highly sensitive knowledge from unclassified data that is not even known to database holders. Worse is the privacy invasion occasioned by secondary usage of data when individuals are unaware of “behind the scenes” use of data mining techniques [10] When distributed data mining is possible to get more knowledge or knowledge that has large applicability than use of data owned by a single property of The error-sum-of-squares (ESS) objective function is defined as the sum of the squares of the ISSN: 2231-5381 http://www.ijettjournal.org Page 33 International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 1 – Oct 2014 feature set from data objects after preprocessing document weight or file relevance score can be computed to convert the categorical object to numerical value. Document weight can be computed in terms of frequency of the keyword as term frequency (i.e. number of occurrences of a keyword in a document) and inverse document frequency(i.e number of occurrences of a keyword in all documents), here frequency indicates occurrences of a keyword in a document distances between points in the database to their nearest cluster centers. The k-clustering problem requires the partitioning of the data into k clusters with the objective of minimizing the ESS. Lloyd’s algorithm (more popularly known as the k-means algorithm) is a popular clustering tool for the problem. This algorithm converges to a local minimum for the ESS in a finite number of iterations. Ward’s algorithm for hierarchical agglomerative clustering also makes use of the notion of ESS. Although Ward’s algorithm has been observed to work well in practice, it is rather slow (O(n3)) and does not scale well to large databases. Cosine Similarity matrix can be constructed over document weights with the following similarity measure, Pre construction of cosine similarity matrix reduces the time complexity and improves the performance We are using a most widely used similarity measurement i.e cosine similarity III. PROPOSED SYSTEM We propose an empirical approach for secure clustering over distributed networks with enhanced k means and cryptographic mechanism. Clustering approach groups similar type of data objects from large amount of data points and cryptographic algorithm maintains privacy or data confidentiality while transmission of data over network Cos(dm,dn)= (dm * dn)/Math.sqrt(dm * dn) Every individual node can be treated as data holder of player in distributed network. Data holder initially preprocess the document by removing the inconsistent Cosine similarity of the document can be retrieved efficiently by simple intersection of row and column of respective document or object ids. Sample cosine similarity matrix as follows d1 d2 d3 d4 d5 d1 1.0 0.48 0.66 0.89 0.45 d2 0.77 1.0 0.88 0.67 0.88 Where dm is centroid (Document weight) dn is document weight or file Weight from the d3 0.45 0.9 1.0 0.67 0.34 d4 0.32 0.47 0.77 1.0 0.34 d5 0.67 0.55 0.79 0.89 1.0 In the above cosine similarity matrix table D(d1,d2…...dn) represents set of documents at data holder or player and their respective cosine similarities between the data objects , it minimizes time and space complexities by transmitting the preprocessed documents instead of raw documents , while computing the similarity between the centroids and documents while clustering. new cluster for every iteration, the following algorithm shows the optimized k-means algorithm as follows In our approach we are enhancing K Means clustering algorithm with intra cluster average based re-centroid computation instead of random selection of centroid from Step 2: While( iterations of clustering < = user specified maximum number of iterations) ISSN: 2231-5381 Algorithm: Step1 : Select any K number of data objects points from data set as centroids http://www.ijettjournal.org Page 34 International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 1 – Oct 2014 Step3 : Cosine similarity(di,dj) REFERENCES Where diis file relevance score of the data object I djis file relevance score of data object j Step 4: Assign each data object to its most similar or closest centroid Step 5: Re-compute new centroid with intra cluster data points (i.e average of any k data points or file relevance scores in the individual cluster). Ex: (P11+P12+….P1k) / K All data points from the same cluster for each new centroid 6. Continue step 2 to step 5. In traditional approach of k means clustering data objects can be grouped based on random selection of centroid, in this paper we are computing the centroid with intra cluster averages for every iteration. It improves the performance by preprocessing the documents and it obviously leads to time and space complexity issue over networks. Privacy can be maintained by triple DES algorithm, random session key can be generated to encrypt the preprocessed data objects and forwards to requested data holder, at receiver end these cipher data objects can be decrypted with same key [1] Privacy-Preserving Mining of Association Rules From Outsourced Transaction DatabasesFoscaGiannotti, Laks V. S. Lakshmanan, Anna Monreale, Dino Pedreschi, and Hui (Wendy) Wang. [2] Privacy Preserving Decision Tree Learning Using Unrealized Data SetsPui K. Fong and Jens H. Weber-Jahnke, Senior Member, IEEE Computer Society 3) Anonymization of Centralized and Distributed Social Networks by Sequential ClusteringTamirTassa and Dror J. Cohen [4] Privacy Preserving ClusteringS. Jha, L. Kruger, and P. McDaniel [5] Tools for Privacy Preserving Distributed Data Mining , Chris Clifton, Murat Kantarcioglu, Xiaodong Lin, Michael Y. Zhu [6] S. Datta, C. R. Giannella, and H. Kargupta, “Approximate distributedK-Means clustering over a peer-to-peer network,” IEEETKDE, vol. 21, no. 10, pp. 1372–1388, 2009. [7] M. Eisenhardt, W. M¨ uller, and A. Henrich, “Classifying documentsby distributed P2P clustering.” in INFORMATIK, 2003. [8] K. M. Hammouda and M. S. Kamel, “Hierarchically distributedpeerto-peer document clustering and cluster summarization,”IEEE Trans. Knowl. Data Eng., vol. 21, no. 5, pp. 681–698, 2009. [9] H.-C. Hsiao and C.-T.King, “Similarity discovery in structuredP2P overlays,” in ICPP, 2003. [10] Privacy Preserving Clustering By Data TransformationStanley R. M. Oliveira ,Osmar R. Za¨ıane BIOGRAPHIES She completed herB.TECH in Dept. of CSIT,SARADA Institute of Technology And Management (SITAM),AmpoloRoad ,Srikakulam, Andhra Pradesh. She Pursuing M.Tech in Dept. of CSE, Aditya Institute of Technology and Management(AITAM), Tekkali, Srikakulam, Andhra Pradesh. Her interesting areas are data mining, network security and cloud computing. IV. CONCLUSION We are concluding our current research work with efficient privacy preserving data clustering over distributed networks. Quality of the clustering mechanism enhanced with preprocessing, relevance matrix and centroid computation in k-means algorithm and cryptographic technique solves the secure transmission of data between data holders or players and saves the privacy preserving cost by forward the relevant features of the dataset instead of raw datasets. We can enhance security by establishing an efficient key exchange protocol and cryptographic techniques while transmission of data between data holders or players. ISSN: 2231-5381 http://www.ijettjournal.org Page 35