International Journal of Engineering Trends and Technology (IJETT) – Volume17 Number5–Nov2014 A Novel and Secure Mining of Data in Distributed Architecture M L V A Priya,S Venkata Suryanarayana2, Dr. K S N Prasad3 1 M.Tech Student,2Assistant Professor,3Associate Professor Dept of CSE, GVIT, Bhimavaram, A.P, India. Abstract:Data confidentialityover data mining in distributed networks is still an important and interesting research issue in the field of Knowledge and data engineering or community based clustering approaches, privacy is a basic factor while datasets or data integrates from different data holders or players for mining. Secure mining of data is required in open network. In this paper we are proposing an efficient privacy preserving data clustering technique in distributed networks in decentralized architecture. I.INTRODUCTION In distributed networks or open environments nodes ,it can be either centralized or decentralized architecture ,communicates with each other openly fordata transmission work , there is a rapid research work going on secure mining of data.Various researchers work on privacy preserving techniques while mining of data either in classification, association rule mining or clustering. Randomization and perturbation approaches available for privacy preserving process and it can be maintained in two ways, one is cryptographic approach here real data sets can be converted to unrealized datasets by encoding the real datasets and the second one imputation methods, here some fake values imputed between there real dataset and extracted while mining with some rules[1][2]. Clustering is a process of grouping similar type of objects based on distance (for numerical data) or similarity (for categorical data) between data objects. In distributed environment data holders or players maintains individual data sets and every node or vertex is connected with each other by an edge along with their quasi identifiers [3]. The graphical notation of the nodes accompanied by the attributes provided by the information of demographical such as age, mobile, address and profile improve the structure of the network. Researchers show interest over social networks for many disciplines activities like market research, sociology, psychology and epidemiology due the sensitivity of the data in social network ,the data presence is less over the network.so there is a need to anonymize the data to avoid the prevention of the sensitive information of the data of the particular user is protected ISSN: 2231-5381 privacy in order to secure the data of the particular user and the anonymization of the data is obtained Identifying and removing of attributes like names or social security numbers is insufficient from the data over the network, individuals information can be obtained by the graphical representation of the node using the structure of the released graph. Finally in the social network the data is descripted accompanied by the nodes and suggested a unique anonymization technique and also categorized the data based on name by the clustering . This algorithm uses significantly the graph to represent the information losses by the anonymization and observe the privacy preservation of the data over the network with different users using the network. II. RELATED WORK In social network, nodes can be represented as vertices and those vertices V (v1,v2…vn) connected through set of edges E in a undirected graph G (V,E) and nonidentifying attribute to describe node is known as quasiidentifier. Clustering can be performed on the quasi identifiers like age and gender, distributed clustering groups similar type of objects based on minimum distance between the nodes. In social network there is a problem for privacy preservation therefore to split the data between several users we follow the distributed setting over the network. The main aim is to protect the data or information of the user about the links over the network without knowing to the other user about the anonymized view of the data over the network with unified method to provide the privacy. Now a centralized setting implements an anonymization algorithm to identify the variants using a sequence clustering denote as Sq. This algorithm efficiently performs over the algorithm SaNGreeA because of campan and truta based on clustering by achieving the anonymity over the network. According to the knowledge regarding the privacy this is one of the best ways to provide privacy preservation of distributed social network. Set of entities has relations between them are known as networks and available in open for all the user in the network are known as social network. Consider some amount of population in which information provided for http://www.ijettjournal.org Page 243 International Journal of Engineering Trends and Technology (IJETT) – Volume17 Number5–Nov2014 Node1 Node2 Node i ………………………………………..... Req Pre-processed Documents Clustering Req Pre-processed Documents Req Node n New Document Fig1: proposed Architecture individual among the total people and also creates relation like friendship, correspondence, collaboration and so on. Similarly we consider about information network in which it describes a scientific journals and refers the links. These are denoted by the graphical notation of the nodes which corresponds to the entities and provide edge relation between the users. In real time network the data will much more complicated and also it contains information in addition such as asymmetric interaction over the financial transaction in which more than one user is involved and in social network the co-membership is used to modeled as a hyper graph. In this paper we are proposing an efficient and secure data mining technique with optimized k-means and cryptographic approach, for cluster the similar type of information, initially data points need to be share the information which is at the individual data holders or players. In this paper we are proposing architecture, Every data holder or player clusters the documents itself after preprocessing and computes the local and global frequencies of the documents for calculation of file Weight or document weights. Distributed k means algorithm is one of the efficient distributed clustering algorithms. In our work we are working towards optimized clustering in distributed networks by improving traditional clustering algorithm. In our approach if a new dataset placed at data holder, it requests the other data holders to forward the relevant features from other data holders instead of entire datasets to cluster the documents. Individual peers at data holders initially preprocess raw data by eliminating the unnecessary features from datasets after the preprocessing of datasets, compute the file Weight of the datasets or preprocessed feature set in terms of term frequency and inverse document frequencies and computes the file relevance matrix to reduce the time complexity while clustering datasets. We are using a most widely used similarity measurement i.e. cosine similarity III. PROPOSED SYSTEM ISSN: 2231-5381 In this paper we are emphasizing on mining approach not on cryptographic technique, For secure transmission of data, various cryptographic algorithms and key exchange protocols available. The following diagram shows architecture of the proposed work as follows Cos(dm,dn)= (dm * dn)/Math.sqrt(dm * dn) Where dm is centroid (Document weight) http://www.ijettjournal.org Page 244 International Journal of Engineering Trends and Technology (IJETT) – Volume17 Number5–Nov2014 In the following example diagram shows a simple way to retrieve similarity between documents in at individual data holders by computing cosine similarity prior clustering as follows. dn is document weight or file Weight from the d1 d2 d3 d4 d5 d1 1.0 0.48 0.66 0.89 0.45 d2 0.77 1.0 0.88 0.67 0.88 d3 0.45 0.9 1.0 0.67 0.34 d4 0.32 0.47 0.77 1.0 0.34 d5 0.67 0.55 0.79 0.89 1.0 Fig2: Similarity Matrix In the above table D(d1,d2….dn ) represents set of documents at data holder or player and their respective cosine similarities, it reduces the time complexity while computing the similarity between the centroids and documents while clustering. In our approach we are enhancing K Means algorithm with re-centoird computation instead of single random selection ate every iteration, the following algorithm shows the optimized k-means algorithm as follows Algorithm : 1: Select K points as initial centroids for initial iteration 2: until Termination condition is met (user specified maximum no of iterations) 3: get_relevance(dm,dn) Where dmis the document M file Weight from relevance matrix dnis the document N file Weight from relevance matrix 4: Assign each point to its closest centroid to form K clusters 5: Recompute the centroid with intra cluster data points (i.e. average of any k data points in the individual cluster). Ex: (P11+P12+….P1k) / K All points from the same cluster 6. Compute new centroid for merged cluster In the traditional approach of k means algorithm it randomly selects a new centroid, in our approach we areenhancing by prior construction of relevance matrix and by considering the average k random selection of document Weight for new centroid calculation . Triple DES : Triple DES is the common name for the Triple Data Encryption Algorithm (TDEA) block cipher. It is so named because it applies the Data Encryption Standard (DES) cipher algorithm three times to each data block. Triple DES provides a relatively simple method of increasing the key ISSN: 2231-5381 size of DES to protect against brute force attacks, without requiring a completely new block cipher algorithm. The standards define three keying options: Keying option 1: All three keys are independent. Keying option 2: K1 and K2 are independent, and K3 = K1. Keying option 3: All three keys are identical, i.e. K1 = K2 = K3. Keying option 1 is the strongest, with 3 x 56 = 168 independent key bits. Keying option 2 provides less security, with 2 x 56 = 112 key bits. This option is stronger than simply DES encrypting twice, e.g. with K1 and K2, because it protects against meet-in-the-middle attacks. Keying option 3 is no better than DES, with only 56 key bits. This option provides backward compatibility with DES, because the first and second DES operations simply cancel out. It is no longer recommended by the National Institute of Standards and Technology (NIST) and not supported by ISO/IEC 18033-3. In general Triple DES with three independent keys (keying option 1) has a key length of 168 bits (three 56-bit DES keys), but due to the meet-in-the-middle attack the effective security it provides is only 112 bits. Keying option 2, reduces the key size to 112 bits. However, this option is susceptible to certain chosen-plaintext or knownplaintext attacks and thus it is designated by NIST to have only 80 bits of security. IV. CONCLUSION We are concluding our current research work with efficient privacy preserving data clustering over distributed networks. Quality of the clustering mechanism enhanced with preprocessing, relevance matrix and centroid computation in k-means algorithm and cryptographic technique solves the secure transmission of data between data holders or players and saves the privacy preserving cost by forward the relevant features of the dataset instead of raw datasets. We can enhance security by establishing http://www.ijettjournal.org Page 245 International Journal of Engineering Trends and Technology (IJETT) – Volume17 Number5–Nov2014 an efficient key exchange protocol and cryptographic techniques while transmission of data between data holders or players. Dr. K S N Prasad working as Associate professor, CSE Department in GVIT, Bhimavaram, A.P, India. His interesting REFERENCES areas are Data Mining, Network security. [1] Privacy-Preserving Mining of Association Rules From Outsourced Transaction Databases Fosca Giannotti, Laks V. S. Lakshmanan, Anna Monreale, Dino Pedreschi, and Hui (Wendy) Wang. [2] Privacy Preserving Decision Tree Learning Using Unrealized Data SetsPui K. Fong and Jens H. WeberJahnke, Senior Member, IEEE Computer Society 3) Anonymization of Centralized and Distributed Social Networks by Sequential Clustering Tamir Tassa and Dror J. Cohen [4] Privacy Preserving ClusteringS. Jha, L. Kruger, and P. McDaniel [5] Tools for Privacy Preserving Distributed Data Mining , Chris Clifton, Murat Kantarcioglu, Xiaodong Lin, Michael Y. Zhu [6] S. Datta, C. R. Giannella, and H. Kargupta, “Approximate distributedK-Means clustering over a peerto-peer network,” IEEETKDE, vol. 21, no. 10, pp. 1372– 1388, 2009. [7] M. Eisenhardt, W. M¨ uller, and A. Henrich, “Classifying documentsby distributed P2P clustering.” in INFORMATIK, 2003. [8] K. M. Hammouda and M. S. Kamel, “Hierarchically distributedpeer-to-peer document clustering and cluster summarization,”IEEE Trans. Knowl. Data Eng., vol. 21, no. 5, pp. 681–698, 2009. [9] H.-C. Hsiao and C.-T.King, “Similarity discovery in structuredP2P overlays,” in ICPP, 2003. [10] Privacy Preserving Clustering By Data Transformation Stanley R. M. Oliveira ,Osmar R. Za¨ıane BIOGRAPHIES Mr.M L V APriya pursuing M.Tech in CSE Department in GVIT, Bhimavaram, A.P, India. Her interesting areas are Data Mining, Network security. Mr. S VenkataSuryanarayana working as Assistant professor in CSE Department in GVIT, Bhimavaram, A.P, India. His interesting areas are Data Mining, Network security. ISSN: 2231-5381 http://www.ijettjournal.org Page 246