International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 7- Dec 2013 A Novel Process to Cluster Data in Distributed Resources Chiranjeevi Jami*,Chanti.Suragala# * *# FinalM.TechStudent,#Assistant professor Dept of CSE , SISTAM college, Srikakulam, Andhra Pradesh Abstract:-Searching is more frequently using task for information gathering or browsing information from web. In this task we can users search in different resources. Due to more scalability this searching process takes more time to process. So by using grouping similar data we can reduce the process time. This is possible using clustering. So we introduced a novel approach the grouping the data in distributed resources. So that we can reduce the process time and group similar data in less amount of time. Ultimately it can reduce searching time. I. INTRODUCTION Information retrieval is the main task of data exchanging and searching process. In this data has to be brief and classified. For fast searching process we have group similar data into fine clusters. Clustering is the best process for browsing and searching. Clustering is classified into two types such as Keyword clustering and document clustering. Document Clustering is the process of groupingthe text documents. In documents keywords to be extracted and find the similarities between the documents and clusters for optimal results we can pre-process the text documents by eliminating the unnecessary keywords from the document. Generally document clustering can perform on centralized system only. But this process becomes more problem on processing of grouping and calculations in single system that is centralized server. Some of the researchers performed this clustering process on web pages, then it is tested in text documents. It is mainly used in large amount of data mainlining systems and organizations. In clustering generally use tokens that is keywords of the documents and document weights. For fast processing and calculations and reducing the complexity of the process, it considers only keywords such as unique words present in the document excluding the grammar words of the language. ISSN: 2231-5381 In our work, we introduced a new framework contains quick processing of clustering in distributed systems. In the next section II briefly explained the existing methodologies and section III explained our proposed work. II. RELATED WORK Mining the data over the distributed networks leads the importance in the recent days of technology, because of the various features involved while mining the distributed information either clustering ,classification, association rule mining or any other mining mechanism, here we are proposing an empirical model of distributed clustering approach for efficient document clustering in distributed networks, We are clustering the documents the based on the document similarity and groups the documents which are semantically equal. One approach to data partitioning is to take a conceptual point of view that identifies thespecifically, probabilistic models assume that the data comes from a mixture of several populations whose distributions and priors we want to find. Corresponding algorithms aredescribed in the sub-section Probabilistic Clustering. The advantage of probabilistic methods is the interpretability of the constructed clusters and those are having concise cluster representation also allows inexpensive computation of intra-clusters measures of fit that give rise to a global objective functioncluster with a certain model whose unknown parameters have to be found. Similarity calculation is the main part in our proposed work. We use cosine similarity and explained in our proposed work. It means algorithm the keywords or tokens are to be clustered up to some criteria to be reached. A key limitation of k-means is its cluster model. The concept is based on spherical clusters that are separable in a way so that the mean value converges towards the cluster center. The clusters are expected to be of similar size, so http://www.ijettjournal.org Page 386 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 7- Dec 2013 that the assignment to the nearest cluster center is the correct assignment. When for example applying k-means with a value of k=3 onto the well-known Iris flower data set, the result often fails to separate the three Iris species contained in the data set. With k=2, the two visible clusters (one containing two species) will be discovered, whereas with k=3 one of the two clusters will be split into two even parts. In fact, k=2 is more appropriate for this data set, despite the data set containing 3 classes. As with any other clustering algorithm, the k-means result relies on the data set to satisfy the assumptions made by the clustering algorithms. It works well on some data sets, while failing on others. III.PROPOSED WORK In our work we initially construct distributed hash table for the documents contains terms or keywords or tokens and location of the term in that document. The purpose of this distributed hash table is to refer the summary of the document in the clustering process. Then we calculate the weight of individual text document. In this we introduced new concepts called cluster holder and cluster summary. Cluster holder contains the all requirements of the documents. Cluster Summary maintains the total gist of the document contains keywords, keyword count in each document, keyword present in which document in each node in the distributed Systems, document weight of the documents and similarity between the documents and clusters. Gist of the document:It contains keywords and keyword frequency in the document. Document Weight:It is total terms present in the document and which is calculated in the form of probability. Cosine Similarity is the distance measure 5. Using hash table, term frequencies, and document weights perform clustering process. Form cluster holder in each node as P={p1,p2,....} Where P is cluster holders in the network. 6. Maintain cluster summary in each node in the network. In this it follows the following steps. 7. Order clusters in ascending order with respect to the document weights. 8. If nay new document occurs in the distributed systems, it perform the comparison process using the similarity between the terms and the documents using cluster summary and document weights following the above steps. By using this clustering algorithm we can cluster data in the distributed system very easily. Searching process performance also increases. It is mainly designed for the distributed networks to reduce load on the single system. It results best processing time when testing process in the simulation. Experimental Analysis: The experimental results shown below: In each node after reading of the input documents , documents has to cluster and maintains cluster summary such as total number of keywords with their frequency in the clusters present in each and every node, In our case the actual score is the cosine similarity between document and cluster centroids and it is defined as Cos(d,c)=∑ ( ) ( ) | | | | Next Clustering, Consider two nodes have some documents. On these documents we perform AggloromativeHierarchal Clustering algorithm. Algorithm is shown as follows: 1.For each node input documents. 2.Find frequncies of terms in in each document represented as td 3.Find document weight in each node as Dw In this every node it maintains cluster holder. Cluster holder the node which contains the overall gist of the documents. Each can request for the cluster holder of another node to cluster the documents in the distributed systems. 4. Contruct distribited hash table represented as DHn. ISSN: 2231-5381 http://www.ijettjournal.org Page 387 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 7- Dec 2013 The above cluster holder also maintained in the node. After clustering of the documents it will show the new document belongs to which cluster in which node. For new document , the calculations and the assigning is shown above. IV.CONCLUSION We designed a method for clustering of text in distributed systems. Reducing complexity of calculations our work very useful. In searching applications also it is very helpful. For reducing the resources work load and the processing it works efficiently. Compared to traditional process in distribution systems text clustering process faster. technology infrastructures,”Int J Digit Libr, vol. 5, no. 4, pp. 266 – 274, 2005. [2] P. Cudr´e-Mauroux, S. Agarwal, and K. Aberer, “Gridvine: Aninfrastructure for peer information management,” IEEE InternetComputing, vol. 11, no. 5, 2007. [3] J. Lu and J. Callan, “Content-based retrieval in hybrid peer-topeernetworks,” in CIKM, 2003. [4] J. Xu and W. B. Croft, “Cluster-based language models for distributedretrieval,” in SIGIR, 1999. [5] O. Papapetrou, W. Siberski, and W. Nejdl, “PCIR: CombiningDHTs and peer clusters for efficient full-text P2P indexing,”Computer Networks, vol. 54, no. 12, pp. 2019–2040, 2010. [6] S. Datta, C. R. Giannella, and H. Kargupta, “Approximate distributedK-Means clustering over a peerto-peer network,” IEEETKDE, vol. 21, no. 10, pp. 1372– 1388, 2009. [7] M. Eisenhardt, W. M¨ uller, and A. Henrich, “Classifying documentsby distributed P2P clustering.” in INFORMATIK, 2003. [8] K. M. Hammouda and M. S. Kamel, “Hierarchically distributedpeer-to-peer document clustering and cluster summarization,”IEEE Trans. Knowl. Data Eng., vol. 21, no. 5, pp. 681–698, 2009. [9] H.-C. Hsiao and C.-T.King, “Similarity discovery in structuredP2P overlays,” in ICPP, 2003. [10] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan,“Chord: A scalable peer-to-peer lookup service for internet applications,”in SIGCOMM, 2001. [11] K. Aberer, P. Cudr´e-Mauroux, A. Datta, Z. Despotovic,M. Hauswirth, M. Punceva, and R. Schmidt, “P-Grid: a selforganizingstructured P2P system,” SIGMOD Record, vol. 32, no. 3,pp. 29–33, 2003. [12] A. I. T. Rowstron and P. Druschel, “Pastry: Scalable, decentralizedobject location, and routing for large-scale peer-to-peer systems,”in IFIP/ACM Middleware, Germany, 2001. [13] C. D. Manning, P. Raghavan, and H. Schtze, Introduction toInformation Retrieval. Cambridge University Press, 2008. [14] M. Steinbach, G. Karypis, and V. Kumar, “A comparison of documentclustering techniques,” in KDD Workshop on Text Mining,2000. [15] G. Forman and B. Zhang, “Distributed data clustering can beefficient and exact,” SIGKDD Explor.Newsl., vol. 2, no. 2, pp. 34–38, 2000. [16] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H. Kargupta,“Distributed data mining in peer-to-peer networks,” IEEE InternetComputing, vol. 10, no. 4, pp. 18–26, 2006. REFERENCES [1] Y. Ioannidis, D. Maier, S. Abiteboul, P. Buneman, S. Davidson,E. Fox, A. Halevy, C. Knoblock, F. Rabitti, H. Schek, andG. Weikum, “Digital library information- ISSN: 2231-5381 http://www.ijettjournal.org Page 388 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 7- Dec 2013 BIOGRAPHIES ChiranjeeviJami is a Student in M.Tech(SE) in Sarada Institute of science Technology And Management,Srikakulam. He Received his B.Tech(IT) from Aditya Institute of Technology And Management(AITAM), Tekkali. His interesting areas are Data warehousing,java and oracle database. Chanti.Suragala is working as an Asst.professor in Sarada Institute of Science, Technology And Management, Srikakulam, Andhra Pradesh. He received his M.Tech (CSE) from Aditya Institute of Technology And Management, Tekkali. JNTU Kakinada Andhra Pradesh. His research areas include Image Processing. ISSN: 2231-5381 http://www.ijettjournal.org Page 389