An Efficient High Dimensional Data Clustering over Decentralized Architecture S.Gayatri ,S.Kalyani

International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 7 – Oct 2014 An Efficient High Dimensional Data Clustering over Decentralized Architecture S.Gayatri1,S.Kalyani2 Final M.Tech Student1, Assistant Professor2 Dept of CSE ,Vignan’s Institute of enginering for womenAndhra Pradesh. 2 Information Technology ,Vignan’s Institute of enginering for womenAndhra Pradesh. 1 Abstract:Clustering is one of the most traditional data mining techniques for knowledge extraction of mass data storages and high dimensional dataset in the years of research, Un preprocessed data (Articles, prepositions ,conjunctions ,adverbs…..etc.) may leads to irrelevant clusters. In this paper we are proposing feature set extraction of preprocessing of input documents and document weights can be computed based on term frequency and inverse document frequencies then computes the mutation based centroid for each iteration, except initial iteration during clustering of documents. I. INTRODUCTION In recent years, personalized information services play an important role in people’s life. There are two important problems worth researching in the fields. One is how to get and describe user personal information, i.e. building user model, the other is how to organize the information resources, i.e. document clustering. Personal information is described exactly only if user behavior and the resource what they look for or search have been accurately analyzed. The effectiveness of a personalized service depends on completeness and accuracy of user model. The basic operation is organizing the information resources. In this paper we focus on document clustering. At present, as millions of scientific documents available on the Web. Indexing or searching millions of documents and retrieving the desired information has become an increasing challenge and opportunity with the rapid growth of scientific documents. Clustering plays an important role in analysis of user interests in user model. So high-quality scientific document clustering plays a more and more important role in the real word applications such as personalized service and recommendation systems. ISSN: 2231-5381 Clustering is a classical method in data mining research. Clustering algorithms group a set of documents into subsets or clusters. The algorithms’ goal is to create clusters that are coherent internally, but clearly different from each other. In other words, documents within a cluster should be as similar as possible; and documents in one cluster should be as dissimilar as possible from documents in other clusters. II. RELATED WORK Flat clustering creates a ﬂat set of clusters without any explicit structure that would relate clusters to each other. Hierarchical clustering creates a hierarchy of clusters. A second important distinction can be made between hard and soft cluster. HARD CLUSTERING algorithms. Hard clustering computes a hard assignment – each document. SOFT CLUSTERING is a member of exactly one cluster. The assignment of soft clustering algorithms is soft – a document’s assignment is a distribution over all clusters. In a soft assignment, a document has fractional membership in several clusters. The cluster hypothesis states the fundamental assumption we make when using clustering in information retrieval. Documents in the same cluster behave similarly with respect to relevance to information needs. The hypothesis states that if there is a document from a cluster that is relevant to a search request, then it is likely that other documents from the same cluster are also relevant. This is because clustering puts together documents that share many terms. The cluster hypothesis essentially is the contiguity. Search result clustering where clustering by search results we mean the documents that were returned in response to a query. The default http://www.ijettjournal.org Page 336 International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 7 – Oct 2014 presentation of search results in information retrieval is a simple list. Users scan the list from top to bottom until they have found the information they are looking for. Instead, search result clustering clusters the search results, so that similar documents appear together. It is often easier to scan a few coherent groups than many individual documents. This is particularly useful if a search term has different word senses. compute the file Weight of the datasets or preprocessed feature set in terms of term frequency and inverse document frequencies and computes the file relevance matrix to reduce the time complexity while clustering datasets. We are using a most widely used similarity measurement i.e cosine similarity III. PROPOSED SYSTEM dm is centroid (Document weight) In this paper we are proposing an efficient and secure data mining technique with optimized kmeans and cryptographic approach, for cluster the similar type of information, initially data points need to be share the information which is at the individual data holders or players. In this paper we are emphasizing on mining approach not on cryptographic technique, For secure transmission of data various cryptographic algorithms and key exchange protocols available. dn is document weight or file Weight from the Cos(dm,dn)= (dm * dn )/Math.sqrt(dm * dn ) Where In the following examplediagram shows a simple way to retrieve similarity between documents in at individual data holders by computing cosine similarity prior clustering as follows. In the below table D(d1,d2….dn) represents set of documents at data holder or player and their respective cosine similarities, it reduces the time complexity while computing the similarity between the centroids and documents while clustering. Individual peers at data holders initially preprocess raw data by eliminating the unnecessary features from datasets after the preprocessing of datasets, d1 d2 d3 d4 d5 d1 1.0 0.48 0.66 0.89 0.45 d2 0.77 1.0 0.88 0.67 0.88 d3 0.45 0.9 1.0 0.67 0.34 d4 d5 0.32 0.67 0.47 0.55 0.77 0.79 1.0 0.89 0.34 1.0 single random selection ate every iteration, the following algorithm shows the optimized k-means algorithm as follows In our approach we are enhancing K Means algorithm with recentoird computation instead of Algorithm : 1: Select K points as initial centroids for initial iteration 2: until Termination condition is met (user specified maximum no of iterations) 3: get_relevance(dm,dn) Where dmis the document M file Weight from relevance matrix dnis the document N file Weight from relevance matrix ISSN: 2231-5381 http://www.ijettjournal.org Page 337 International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 7 – Oct 2014 4: Assign each point to its closest centroid to form K clusters 5: Recompute the centroid with intra cluster data points (i.e average of any k data points in the individual cluster). Ex: (P11+P12+….P1k) / K All points from the same cluster 6. Compute new centorid for merged cluster In the traditional approach of k means algorithm it randomly selects a new centroid, in our approach we are enhacing by prior construction of relevance matrix and by considering the average k random selection of document Weight for new centroid calculation . [8] K. M. Hammouda and M. S. Kamel, “Hierarchically distributedpeerto-peer document clustering and cluster summarization,”IEEE Trans. Knowl. Data Eng., vol. 21, no. 5, pp. 681–698, 2009. [9] H.-C. Hsiao and C.-T.King, “Similarity discovery in structuredP2P overlays,” in ICPP, 2003. [10] Privacy Preserving Clustering By Data TransformationStanley R. M. Oliveira ,Osmar R. Za¨ıane IV. CONCLUSION We are concluding our current research work with efficient privacy preserving data clustering over distributed networks. Quality of the clustering mechanism enhanced with preprocessing, relevance matrix and centroid computation in k-means algorithm and cryptographic technique solves the secure transmission of data between data holders or players and saves the privacy preserving cost by forward the relevant features of the dataset instead of raw datasets. We can enhance security by establishing an efficient key exchange protocol and cryptographic techniques while transmission of data between data holders or players. REFERENCES BIOGRAPHIES S gayatri completed her b.tech gayatri engg for woman. She is pursuing m.tech vignan engg colllege for woman, ap, india. Her interested areas are data mining, network security. S. kalyani working as asst. professor , in information technology ,Vignan’s Institute of engineering for women Andhra Pradesh. Her interested areas are data mining network security. [1] Privacy-Preserving Mining of Association Rules From Outsourced Transaction Databases Fosca Giannotti, LaksV. S. Lakshmanan, Anna Monreale, Dino Pedreschi, and Hui (Wendy) Wang. [2] Privacy Preserving Decision Tree Learning Using Unrealized Data SetsPui K. Fong and Jens H. Weber-Jahnke, Senior Member, IEEE Computer Society [3]Anonymization of Centralized and Distributed Social Networks by Sequential Clustering Tamir assa and Dror J. Cohen [4] Privacy Preserving ClusteringS. Jha, L. Kruger, and P. McDaniel [5] Tools for Privacy Preserving Distributed Data Mining , Chris Clifton, Murat Kantarcioglu, Xiaodong Lin, Michael Y. Zhu [6] S. Datta, C. R. Giannella, and H. Kargupta, “Approximate distributed K-Means clustering over a peer-to-peer network,” IEEETKDE, vol. 21, no. 10, pp. 1372–1388, 2009. [7] M. Eisenhardt, W. M¨ uller, and A. Henrich, “Classifying documentsby distributed P2P clustering.” in INFORMATIK, 2003. ISSN: 2231-5381 http://www.ijettjournal.org Page 338

An Efficient High Dimensional Data Clustering over Decentralized Architecture S.Gayatri ,S.Kalyani

Related documents

Products

Support

An Efficient High Dimensional Data Clustering over Decentralized Architecture S.Gayatri ,S.Kalyani

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib