International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013 A Fast and Simple Text clustering in distributed Systems 1 Birlangi Usha Rani, 2U.D Prasanna 1 M.Tech Scholar, 2Associate Professor Dept Of Computer Science And Engineering, 1,2 Aditya Institute Of Technology And Management, Tekkali, Andhra Pradesh 1,2 Abstract: In recent days data exchanging is more common work in network. Data exchanging from single resource is not a difficult work. But exchanging data from multiple resources is difficult thing, so need clustering to group data for large amount of data documents. Traditional clustering process is done on plain documents only. By following traditional methods at the time of data gathering from the network perform clustering on plain documents which is not secure, because data exchanging from the network. So we introduced a method to provide security to that data documents using cryptographic method.By using we can provide security to documents. I.INTRODUCTION Clustering is a division of data into groups of similar objects. Each group, called cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups. In other words, the goal of a good document clustering scheme is to minimize intra-cluster distances between documents, while maximizing inter-cluster distances (using an appropriate distance measure between documents)[1][2]. A distance measure (or, dually, similarity measure) thus lies at the heart of document clustering. Clustering is the most common form of unsupervised learning and this is the major difference between clustering and classification[3]. No super-vision means that there is no human expert who has assigned documents to classes. In clustering, it is the distribution and makeup of the data that will determine cluster membership. Clustering is sometimes erroneously referred to as automatic classification; however, this is inaccurate, since the clusters found are not known prior to processing whereas in case of classification the classes are predefined. In clustering, it is the distribution and the nature of data that will determine cluster membership, in opposition to the classification where the classifier learns the association between objects and classes from a so called training set, i.e. a set of data correctly labeled by hand, and then replicates the learnt behavior on unlabeled dataDocument clustering has been investigated for use in a ISSN: 2231-5381 number of different areas of textmining and information retrieval. Initially, document clustering was investigated for improvingthe precision or recall in information retrieval systems and as an efficient way offinding the nearest neighbors of a document. More recently, clustering has beenproposed for use in browsing a collection of documents or in organizing the resultsreturned by a search engine in response to a user’s query. Document clustering hasalso been used to automatically generate hierarchical clusters of documents[4][5][6]. We were developing an application for recommendations of news articles to the readers of a news portal. The following challenges gave us the motivation to use clustering of the news articles: 1. The number of available articles was large. 2. A large number of articles were added each day. 3. Articles corresponding to same news were added from different sources. 4. The recommendations had to be generated and updated in real time[7][8] . By clustering the articles we could reduce our domain of search for recommendations as most of the users had interest in the news corresponding to a few number of clusters. This improved our time efficiency to a great extent. Also we could identify the articles of same news from different sources. The main motivation of this work has been to investigate possibilities for the improvement of the effectiveness of document clustering by finding out the main reasons of ineffectiveness of the already built algorithms and get their solutions[9]. Initially we applied the K-Means and Agglomerative Hierarchical clustering methods on the data and found that the results were not very satisfactory and the main reason for this was the noise in the graph, created for the data. Thus we tried for pre-processing of the graph to remove the extra edges. We applied a heuristic for removing the inter cluster edges and then applied the standard graph clustering methods to get much better results[10][11][12]. We also tried a completely different approach by first clustering the words of the documents by using a standard clustering approach and thus reducing the noise and then using this word cluster to cluster the documents. We found http://www.ijettjournal.org Page 37 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013 that this also gave better results than the classical K-Means and Agglomerative Hierarchical clustering methods. We first study the effectiveness of pre-processing of data representation and then applying the classical clustering methods. We then detect the effectiveness of a new clustering algorithm in which the noise is reduced by first clustering the features of the data and then clustering the data on the basis of their feature’s clusters. Clustering is the most common form of unsupervised learning and is a major tool in a number of applications in many fields of business and science. Hereby, we summarize the basic directions in which clustering is used. • Finding Similar Documents This feature is often used when the user has spotted one “good” document in a search result and wants more-like-this. The interesting property here is that clustering is able to discover documents that are conceptually alike in contrast to search-based approaches that are only able to discover whether the documents share many of the same words. • Organizing Large Document Collections Document retrieval focuses on finding documents relevant to a particular query, but it fails to solve the problem of making sense of a large number of uncategorized documents. The challenge here is to organize these documents in a taxonomy identical to the one humans would create given enough time and use it as a browsing interface to the original collection of documents[13]. • Duplicate Content Detection In many applications there is a need to find duplicates or near-duplicates in a large number of documents. Clustering is employed for plagiarism detection, grouping of related news stories and to reorder search results rankings (to assure higher diversity among the topmost documents). Note that in such applications the description of clusters is rarely needed[14]. • Recommendation System In this application a user is recommended articles based on the articles the user has already read. Clustering of the articles makes it possible in real time and improves the quality a lot[15]. • Search Optimization Clustering helps a lot in improving the quality and efficiency of search engines as the user query can be first compared to the clusters instead of comparing it directly to the documents and the search results can also be arranged easily. The goal of a document clustering scheme is to minimize intra-cluster distances between documents, while maximizing inter-cluster distances (using an appropriate distance measure between documents). A distance measure (or, dually, similarity measure) thus lies at the heart of document clustering. The large variety of documents makes it almost impossible to create a general algorithm which can work best in case of all kinds of datasets [16][17]. Document clustering is being studied from many decades but still it is far from a trivial and solved problem. The challenges are: ISSN: 2231-5381 1. Selecting appropriate features of the documents that should be used for clustering. 2. Selecting an appropriate similarity measure between documents. 3. Selecting an appropriate clustering method utilising the above similarity measure. 4. Implementing the clustering algorithm in an efficient way that makes it feasible in terms of required memory and CPU resources. Finding ways of assessing the quality of the performed clustering. Furthermore, with medium to large document collections (10,000+ documents), the number of termdocument relations is fairly high (millions+), and the computational complexity of the algorithm applied is thus a central factor in whether it is feasible for real-life applications. If a dense matrix is constructed to represent term-document relations, this matrix could easily become too large to keep in memory - e.g. 100, 000 documents × 100, 000 terms = 1010 entries ~ 40 GB using 32-bit floating point values. If the vector model is applied, the dimensionality of the resulting vector space will likewise be quite high (10,000+). This means that simple operations, like finding the Euclidean distance between two documents in the vector space, become time consuming tasks. II. RELATED WORK k-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. Theprocedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed apriori. The main idea is to define k centers, one for each cluster. These centers should be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest center. When no point is pending, the first step is completed and an early group age is done. At this point we need to re-calculate k new centroids as barycenter of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest newcenter. A loop has been generated. As a result of this loop we may notice that the k centers change their location step by step until no more changes are done or in other words centers do not move any more. Finally, this algorithm aims at minimizing an objective function know as squared error function given by: ||)2 J(v)=∑ ∑ (|| where,‘||xi vj||’ is the Euclidean distance between xi and vj..‘ci ’ is the number of data points in ith cluster. c’ is the number of cluster centers. http://www.ijettjournal.org Page 38 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013 Distributed hash tables are a way to organize the storage and network resources of many Internet hosts to create a single storage system. The promise of DHTs lies in their potential to create a distributed storage infrastructure that, in aggregate, is more robust and offers higher performance than any individual host. Because DHTs are likely to be deployed on potentially unreliable volunteer nodes spread around the globe, meeting this goal is challenging: nodes in the system may join or leave at any time and latencies between nodes can be large. In a distributed implementation, known as a distributed hash table, or DHT, the hash table is distributed among a set of nodes. Nodes all use the same hash function. Looking up a key gives you a node ID that holds the data. The entire goal of a DHT is to allow anyone to find the node that corresponds to a given key. That node will be responsible for holding the information associated with that key. A key difference between the DHT approach and the centralized or flooding approaches is that a specific node is responsible for holding information relating to they key (even if it just sends a link to the content)[18]-[20]. To get a clear understanding of distributed hash tables, highlighting the concept of hashtable is necessary. Basically, a hash table is an array to store a set of items. Every itemx is mapped to a hash value h (V ) and then stored in slot h (V ) in the array. The hashfunction is a function: H:U{0,1……m-1} That maps each possible item in U to a position in the hash table. The parameter m is thesize of the hash table.This technique cannot be applied directly to store data in peers. This is infeasible because the number of active peers changes constantly and leads to thenecessity of continuously adjusting the table's indexing. Furthermore, this would requirea new allocation of data to peers with each peer departure, arrival or failure, which is very inefficient. These difficulties and performance constraints related to the direct useof hash tables in the peer-to-peer networks represented an incentive to develop a theconcept of Distributed Hash Table (DHT), which became progressively a standard methodin Peer-to-Peer networks. This structure is based on the following main concepts: 1. Mapping data values to keys: Data value V is mapped to a key K using a hash function as follows: h (V ) = K: The hash function needs to meet a quite demanding set of properties. First, the hashfunction should be easy to compute in order to ensure high efficiency of the mappingprocess. In addition to this requirement, the hash function should be one-way, i.e. itis hard to invert, so that for any K, it is computationally infeasible to find V suchas V = h(K). Another property of h is that it should be collision-free i.e. for any V it is impossible to find another V 0 such as h (V )) = h (V 0). ISSN: 2231-5381 These targeted properties of the hash function are hard to satisfy simultaneouslysince they may be contradictory: for example to obtain a function that is hard to invert, the degree of difficulty to compute the value of such a function will necessarilyincrease. This fact make designing such functions a very challenging task. 2. Dynamic partitioning of the keys set among nodes:The interval of keys is divided in different parts and each part is associated to anactive peer in the network. This partitioning is dynamic and can be efficiently adjusted by any change in the set of participants: In case a node newly joins the network, any active node is contacted and half of itskeys subset is given to the new node. The routing structure has to be updated: theto the contacted node neighboring nodes are informed about the new node and theirrouting information is adequately updated.If a node leaves the network, the keys subset is allocated to its neighbors and thestored data is moved to the new responsible nodes. The keys set partitioning can be adapted by nodes failure. In fact, the corresponding keys subset is allocated to otheractive nodes but the stored data cannot be recovered. Until the updating of the keyspartitioning is done, the functioning of the network can continue by using redundantrouting paths and nodes. 3. Data Storage: Once the key K is calculated, the data can be stored at the location associated to theobtained key. There are two ways of storing the data. This can be done directly,where data values are stored directly by the node responsible for their associatedkeys. Another alternative is to store pointers to where the data values are actuallystored. 4. Data lookup: Any node in the network can retrieve any stored data. The requesting nodes contactsa random active node. If the data is stored at a key in the subset associated to thecontacted node, there is no need for routing the data request through the networkstructure. Otherwise the request is spread until reaching the node responsible forstoring the requested data. Several routing algorithms were developed in this context.Based on the desired features of the network (minimum latency time, high security,a certain routing algorithm can be adopted. III.PROPOSED WORK Algorithm: Initialize peers {p1,p2…pn}. Let Take pi Collect documents D={d1,d2,d3…dn} all terms in the documents T={t1,t2….tn} For(d=1 to n) { Construct DH Table } Calculate Document Weight WD=TFtd dfD=TFtd log2(n/dfD) http://www.ijettjournal.org Page 39 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013 Where TFtd is the number of occurrences of term t in the document d and dfD is the term frequency in the collection of documents. Calculate similarity between the documents and as well as clusters. Generate Cluster Summary (keyword, frequency) CS={TF(t,d),Ft} Form Cluster Holder and named as P={p1,p2…Pn.} If d is new document, Calculate document weight as shown above. Encrypt Cluster Summary{ep1,ep2….epn} For(p=1 to pn) { Encrypt(CS) } Calculate frequency of the keywordsTF(t,d) Calculate similarity between cluster and the new document using cluster summary CS. Assign new document into cluster. P1 P2 Cluster Holder Cluster Summary P5 P3 DHT P4 Encrypted cluster holder, cluster summary data Decrypt cluster Summary, Cluster Holder data The methods which we used is explain elaborately below. For encrypting the cluster summary, we adapted Advanced Encryption Standard Algorithm. AES is based on a design principle known as a substitution-permutation network, and is fast in both software and hardware. Unlike its predecessor DES, AES does not use a Feistel network. AES is a variant of Rijndael which has a fixed block size of 128 bits, and a key size of 128, 192, or 256 bits. By contrast, the Rijndael specification per se is specified with block and key sizes that may be any multiple of 32 bits, both with a minimum of 128 and a maximum of 256 bits[20]. For a given number of documents construct distributed hash table. For every document we construct DHT which contains tokens or terms and keys. These tables are referenced for next clustering process. ISSN: 2231-5381 Second is similarity between the nodes in the network so we use cosine similarity. In this similarity calculation we consider only the similar properties between the edges. The reason of taking cosine similarity measure is explained below. Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1 it is less than 1 for any other angle. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a Cosine similarity of 1, two vectors at 90° have a similarity of 0 and two vectors diametrically opposed have a similarity of -1 and it is independent of their sign. This similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. Note that these bounds apply for any number of dimensions and their Cosine similarity is most commonly used in high-dimensional positive spaces. In Information Retrieval and text mining and each term is notionally assigned a different dimension and a document is characterized by a vector where the value of each dimension corresponds to the number of times that term appears. Cosine similarity then gives a useful measure of how similar two documents are likely to be in terms of their subject matter.The technique is also used to measure cohesion within clusters in the field of data mining. Cosine distance is a term often used for the complement in positive space, that is:Dc(A,B)=1-Sc(A,B) . It is important to note and that this is not a properdistance metric as it does not have the triangle inequality property and it violates the coincidence axiom and repair the triangle inequality property whilst maintaining the same ordering and necessary to convert to Angular distance (see below.)One of the reasons for the popularity of Cosine similarity is that it is very efficient to evaluate especially for sparse vectors and only the non-zero dimensions need to be considered. In our case the actual score is the cosine similarity between document and cluster centroids and it is defined as Cos(d,c)=∑ ( ) ( ) | | | | Note that the similarity is compared for new document is with cluster centroids and the new document. The above generated cluster summary is used for calculation of the similarity measure. Experimental Analysis: We tested in two systems and we considered 2 peers, and documents are available in two peers. The results as shown below. http://www.ijettjournal.org Page 40 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013 The below graph shows efficiency of our work. We take performance complexity, Security, and efficiency of existing and proposed methods. 1.2 1 0.8 0.6 0.4 0.2 0 proposed work Traditional In this every node it maintains cluster summary. IV.CONCLUSION In our proposed work we designed a method for clustering of text in distributed systems. In this algorithm we introduced a cryptographic algorithm to encrypt the documents to provide security to documents. In real time applications also it is very helpful. For reducing the resources work and the processing it works efficiently. Compared to traditional process in distribution systems text clustering process faster. REFERENCES The above cluster holder also maintained in every node. For new document the below analysis should be done. ISSN: 2231-5381 [1] Y. Ioannidis, D. Maier, S. Abiteboul, P. Buneman, S. Davidson,E. Fox, A. Halevy, C. Knoblock, F. Rabitti, H. Schek, andG. Weikum, “Digital library informationtechnology infrastructures,”Int J Digit Libr, vol. 5, no. 4, pp. 266 – 274, 2005. [2] P. Cudr´e-Mauroux, S. Agarwal, and K. Aberer, “Gridvine: Aninfrastructure for peer information management,” IEEE InternetComputing, vol. 11, no. 5, 2007. [3] J. Lu and J. Callan, “Content-based retrieval in hybrid peer-topeernetworks,” in CIKM, 2003. [4] J. Xu and W. B. Croft, “Cluster-based language models for distributedretrieval,” in SIGIR, 1999. [5] O. Papapetrou, W. Siberski, and W. Nejdl, “PCIR: CombiningDHTs and peer clusters for efficient full-text P2P indexing,”Computer Networks, vol. 54, no. 12, pp. 2019–2040, 2010. [6] S. Datta, C. R. Giannella, and H. Kargupta, “Approximate distributedK-Means clustering over a peerto-peer network,” IEEETKDE, vol. 21, no. 10, pp. 1372– 1388, 2009. http://www.ijettjournal.org Page 41 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013 [7] M. Eisenhardt, W. M¨ uller, and A. Henrich, “Classifying documentsby distributed P2P clustering.” in INFORMATIK, 2003. [8] K. M. Hammouda and M. S. Kamel, “Hierarchically distributedpeer-to-peer document clustering and cluster summarization,”IEEE Trans. Knowl. Data Eng., vol. 21, no. 5, pp. 681–698, 2009. [9] H.-C. Hsiao and C.-T.King, “Similarity discovery in structuredP2P overlays,” in ICPP, 2003. [10] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan,“Chord: A scalable peer-to-peer lookup service for internet applications,”in SIGCOMM, 2001. [11] K. Aberer, P. Cudr´e-Mauroux, A. Datta, Z. Despotovic,M. Hauswirth, M. Punceva, and R. Schmidt, “P-Grid: a selforganizingstructured P2P system,” SIGMOD Record, vol. 32, no. 3,pp. 29–33, 2003. [12] A. I. T. Rowstron and P. Druschel, “Pastry: Scalable, decentralizedobject location, and routing for large-scale peer-to-peer systems,”in IFIP/ACM Middleware, Germany, 2001. [13] C. D. Manning, P. Raghavan, and H. Schtze, Introduction toInformation Retrieval. Cambridge University Press, 2008. [14] M. Steinbach, G. Karypis, and V. Kumar, “A comparison of documentclustering techniques,” in KDD Workshop on Text Mining,2000. [15] G. Forman and B. Zhang, “Distributed data clustering can beefficient and exact,” SIGKDD Explor.Newsl., vol. 2, no. 2, pp. 34–38, 2000. [16] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H. Kargupta,“Distributed data mining in peer-to-peer networks,” IEEE InternetComputing, vol. 10, no. 4, pp. 18–26, 2006. [17] S. Datta, C. Giannella, and H. Kargupta, “K-Means clustering overa large, dynamic network,” in SDM, 2006. [18] G. Koloniari and E. Pitoura, “A recall-based cluster formationgame in P2P systems,” PVLDB, vol. 2, no. 1, pp. 455–466, 2009. [19] K. M. Hammouda and M. S. Kamel, “Distributed collaborativeweb document clustering using cluster keyphrase summaries,”Information Fusion, vol. 9, no. 4, pp. 465–480, 2008. [20]http://www.facweb.iitkgp.ernet.in/~sourav/ AES.pdf. [21] ES-MPICH2: A Message Passing Interface with Enhanced Security ISSN: 2231-5381 http://www.ijettjournal.org Page 42