International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 1 – Oct 2014 An Empirical Model of Secure Clustering Technique over Decentralized Architecture Y.V.Siva Prasad1, P.Rajasekhar2 Final M.tech Student1, Assistant Professor2 1,2 Dept of CSE,avanthi institute Of Engineering.& tech, makavarapallem,A.p Abstract: Privacy preserving over data mining in distributed networks is still an important research issue in the field of Knowledge and data engineering or community based clustering approaches, privacy is an important factor while datasets or data integrates from different data holders or players for mining. Secure mining of data is required in open network. In this paper we are proposing an efficient privacy preserving data clustering technique in distributed networks. I. INTRODUCTION In data mining there are so many databases connects with each other in large communication networks. But in this key problem is privacy in these databases. Hence the concept of privacy preserving data mining arises. Many researches arise in this concept to secure data from intruders in the network. There are many datasets also share in the network. In this problem there are so many solutions are introduced in previous researches. There are decision tree based classifications, clustering based classification. To secure the private datasets all the solutions are depends on knowledge. There are different approaches to data owners to integrate with distributed bases. In classification approaches tree based methods are used in previous approaches. In privacy point of view the idle databases are deeply verified. In starting phases there is construction of databases in distributed networks are very hard to manage. In previous approaches transformations are used to dataset before the algorithms are applied. In another approach secure multipart computations are used. There are many advantages in distributed networks by introducing data mining concepts. In this environments there is very hard to gather all information of central site. Users have very flexible to download data from centralized location to maintain data mining operations. In distributing mining provides various methods to refer the limitations of data mining. And it ISSN: 2231-5381 provides computing, and communication. In increases the gaining in domain for data in applications. Data analysis plays main role in next generation of data drive in problem solving methods. In peer to peer networks or distribution networks there are so many challenges are there. Those are Scalability, efficiency, frequency of users, and decentralization of data. Coming to scalability there are many users and the data storing increases in network. There is threat in maintains the resources to share and exchange in network. Efficiency is main one in challenges because of increasing the scalability of users and resources there is an increase in load and working of the algorithms. Decentralization is although some peer-to-peer systemsstill use central servers for various purposes the entire process of this method is considered unsuitable for mosttasks. This is especially true for tasks involving largeamounts of data. The algorithms need to be designedto transfer minimal data and use no centralized coordination. For a real-world P2P system, data at the peers often change. Thus, the algorithms have to work incrementally. This is especially right for sensor networks where generally the data is considered to be streaming. Any algorithms that need to begin from scratch whenever the data changes are thus inappropriate for our goals. So the incremental methods are required. From the rate of data change may be higher than the rate of computation in some applications the methods has to be able to report are incomplete solution at any time. These are called anytime algorithms. In distributed systems can be very large. Therefore anyattempt to synchronize between the entire networks islikely to fail due to connection latency, or limited bandwidthin case of sensor networks. Thus, any algorithmdeveloped for peer-to-peer system should not take theroute of global synchronization. II. RELATED WORK Clustering is the method of grouping a set of items in such a way that items in the same group are more similar to each http://www.ijettjournal.org Page 5 International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 1 – Oct 2014 other than to those in other groups. It is a key task of data mining and a generalmethod for statistical data processing and utilized in many fields that are including machine learning and pattern recognition etc. In cluster processingis not a singleparticular algorithm but the basicwork to be solved. It can be solved by different algorithms that differentiate significantly in their notion of what constitutes a cluster and how to efficiently find them. The proper clustering algorithm and propertyconfiguration depends upon on individual data set and wish to use of the results. Cluster processing as such is not an automatic work but acyclic process of interactive multi-objective optimization involves testing and failure. It is very necessary to manipulate data preprocessing. Let us assume that the data is one dimensional (single attribute y) and are partitioned across s sites. Each site has nl data items ( ). Let denote the cluster membership for the ith cluster for the jth data point at the (t)th EM round. In the E step, the values (mean for cluster i), (variance for cluster i) and (Estimate of proportion of items i) are computed using the following sum: The second part of the summation in all these cases is local to every site. It’s easy to see that sharing this value does not reveal to the other sites. It’s also not necessary to share nl, and the inner summation values, but just computing n and the global summation for the values above using the secure sum technique described earlier. In the M step, the z values can be partitioned and computed locally given the global , and . This also does not involve any data sharing across sites. In term clustering there are a manynumber of tokens with same meaningsincluding automatic classification and numerical. The subtle differences are often in the usage of the results: while in data mining, the resulting groups are the matter of interest, in automatic classification the resulting discriminative power is of interest. This often leads to misunderstandings between researchers coming from the fields of data mining and machine learning, since they use the same tokens and often the same algorithms, but have different goals. The scenario described is where two parties with database D1 and D2 wish to apply the decision tree algorithm on the joint database D1 U D2 without revealing any unnecessary information about their database. The technique described uses secure multi party computation ISSN: 2231-5381 under the semi honest adversary model and attempts to reduce the number of bits communicated between the two parties. The traditional ID3 algorithm computes a decision tree by choosing at each tree level the best attribute to split on at that level ad thus partition the data. The tree building is complete when the data in uniquely partitioned into a single class value or there are no attributes to split on. The selection of best attribute uses information gain theory and selects the attribute that minimizes the entropy of the partitions and thus maximizes the information gain. In the PPDM scenario, the information gain for every attribute has to be computed jointly over all the database instances without divulging individual site data. We can show that this problem reduces to privately computing x ln x in a protocol which receives x1 and x2 as input where x1 + x2 = x. The basic frequency mining problem can be described as follows. There are n customers and each customer has a Boolean value . The problem is to find out the total number of 1s and 0s without knowing the customer values i.e. computing the sum without revealing each . We cannot use the secure sum protocol because of the following restrictions. • Each customer can send only one flow of communication to the miner and then there’s no further interaction. • The between themselves. customers never communicate The technique presented in [8] uses the additively homomorphic property of a variant of the ElGamal encryption. This is described below: Let G be a group in which discrete logarithm is hard and let g be a generator in G. Each customer has two pairs of private/public key pair and. The sum and , along with G and the generator g is known to everyone. Each customer sends to the miner the two values and. The miner computes. For the value of d for which, we can show that this represents the sum . Since 0 < d < n, this is easy to find by encrypt and compare. We can also that assuming all the keys are distributed properly when the protocol starts, the protocol for mining frequency protects each honest customer’s privacy against the miner and up to (n-2) corrupted customers. http://www.ijettjournal.org Page 6 International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 1 – Oct 2014 III. PROPOSED SYSTEM We propose an empirical approach for secure clustering over distributed networks with enhanced k means and cryptographic mechanism. Clustering approach groups similar type of data objects from large amount of data points and cryptographic algorithm maintains privacy or data confidentiality while transmission of data over network Every individual node can be treated as data holder of player in distributed network. Data holder initially preprocess the document by removing the inconsistent feature set from data objects after preprocessing document weight or file relevance score can be computed to convert the categorical object to numerical value. Document weight can be computed interms of frequency of the keyword as term frequency (i.e. number of occurrences of a keyword in a document) and inverse document frequency(i.e number of occurrences of a keyword in all documents), here frequency indicates occurrences of a keyword in a document d1 1.0 0.48 0.66 0.89 0.45 d1 d2 d3 d4 d5 d2 0.77 1.0 0.88 0.67 0.88 Cosine Similarity matrix can be constructed over document weights with the following similarity measure, Pre construction of cosine similarity matrix reduces the time complexity and improves the performance We are using a most widely used similarity measurement i.e cosine similarity Cos(dm,dn)= (dm * dn)/Math.sqrt(dm * dn) Where dm is centroid (Document weight) dn is document weight or file Weight from the Cosine similarity of the document can be retrieved efficiently by simple intersection of row and column of respective document or object ids. Sample cosine similarity matrix as follows d3 0.45 0.9 1.0 0.67 0.34 d4 0.32 0.47 0.77 1.0 0.34 d5 0.67 0.55 0.79 0.89 1.0 Fig1: Cosine Similarity Matrix In the above cosine similarity matrix table D(d1,d2…...dn) represents set of documents at data holder or player and their respective cosine similarities between the data objects , it minimizes time and space complexities by transmitting the preprocessed documents instead of raw documents , while computing the similarity between the centroids and documents while clustering. In our approach we are enhancing K Means clustering algorithm with intra cluster average based re-centroid computation instead of random selection of centroid from new cluster for every iteration, the following algorithm shows the optimized k-means algorithm as follows Step 2: While (iterations of clustering < = user specified maximum number of iterations) Step3: Cosine similarity (di,dj) Where di is file relevance score of the data object I djis file relevance score of data object j Step 4: Assign each data object to its most similar or closest centroid Algorithm: Step 5: Re-compute new centroid with intra cluster data points (i.e average of any k data points or file relevance scores in the individual cluster). Step1 : Select any K number of data objects points from data set as centroids Ex: (P11+P12+….P1k) / K ISSN: 2231-5381 http://www.ijettjournal.org Page 7 International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 1 – Oct 2014 All data points from the same cluster for each new centroid BIOGRAPHIES 6. Continue step 2 to step 5. Y.V.Siva Prasad completed B.Tech in avanthi institute of engg.& tech,makavarapallem,narsipatnam. He is pursuingM.Tech inavanthi institute Of Engineering.&tech,makavarapallem,narsi patnam. His areas of Interest are Data mining Network security and Cloud Computing. In traditional approach of k means clustering data objects can be grouped based on random selection of centroid, in this paper we are computing the centroid with intra cluster averages for every iteration. It improves the performance by preprocessing the documents and it obviously leads to time and space complexity issue over networks. Privacy can be maintained by triple DES algorithm, random session key can be generated to encrypt the preprocessed data objects and forwards to requested data holder, at receiver end these cipher data objects can be decrypted with same key P.Rajasekhar working as Assitant Professor in avanthi institute Of Engineering.&tech, makavarapallem, narsipatnam. His areas of interest are data mining, network security and cloud computing. IV. CONCLUSION We are concluding our current research work with efficient privacy preserving data clustering over distributed networks. Quality of the clustering mechanism enhanced with preprocessing, relevance matrix and centroid computation in k-means algorithm and cryptographic technique solves the secure transmission of data between data holders or players and saves the privacy preserving cost by forward the relevant features of the dataset instead of raw datasets. We can enhance security by establishing an efficient key exchange protocol and cryptographic techniques while transmission of data between data holders or players. REFERENCES [1] Privacy-Preserving Mining of Association Rules From Outsourced Transaction DatabasesFoscaGiannotti, Laks V. S. Lakshmanan, Anna Monreale, Dino Pedreschi, and Hui (Wendy) Wang. [2] Privacy Preserving Decision Tree Learning Using Unrealized Data SetsPui K. Fong and Jens H. Weber-Jahnke, Senior Member, IEEE Computer Society 3) Anonymization of Centralized and Distributed Social Networks by Sequential ClusteringTamirTassa and Dror J. Cohen [4] Privacy Preserving ClusteringS. Jha, L. Kruger, and P. McDaniel [5] Tools for Privacy Preserving Distributed Data Mining , Chris Clifton, Murat Kantarcioglu, Xiaodong Lin, Michael Y. Zhu [6] S. Datta, C. R. Giannella, and H. Kargupta, “Approximate distributedK-Means clustering over a peer-to-peer network,” IEEETKDE, vol. 21, no. 10, pp. 1372–1388, 2009. [7] M. Eisenhardt, W. M¨ uller, and A. Henrich, “Classifying documentsby distributed P2P clustering.” in INFORMATIK, 2003. [8] K. M. Hammouda and M. S. Kamel, “Hierarchically distributedpeerto-peer document clustering and cluster summarization,”IEEE Trans. Knowl. Data Eng., vol. 21, no. 5, pp. 681–698, 2009. [9] H.-C. Hsiao and C.-T.King, “Similarity discovery in structuredP2P overlays,” in ICPP, 2003. [10] Privacy Preserving Clustering By Data TransformationStanley R. M. Oliveira ,Osmar R. Za¨ıane ISSN: 2231-5381 http://www.ijettjournal.org Page 8