International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 A Novel Framework for Text Clustering In Distributed Networks S.K.A.Manoj#1, B.V.V.V.Satya Kameswari *2 1,2 1 Assistant professor, 2Final MTech Student Dept. of CSE, Pydah College of Engineering and Technology , Boyapalem,Visakhapatnam, AP, India Abstract: Text clustering is most important work in the searching techniques such as information retrieval. The very frequently using information retrieval uses centralized approach. But this approach has some flaws such as increasing of processing time and retrieving time due to scalability of the users. In text clustering the existing algorithms uses this centralized method and because of this approach it increases the load of the centralized system and we introduced the distributed approach that performs information retrieval based on the clustering individually of the peers and it assigns the documents based on the probabilistic analysis. I.INRODUCTION Peer-to-peer (P2P) computing or networking is a distributed application architecture that partitions tasks or workloads between peers and equally privileged and equipotent participants. They are said to form a peer-to-peer network of nodes. This make a portion of their resources such as processing power and disk storage or network bandwidth and directly available to other network participants without the need for central coordination by servers and the two suppliers and consumers of resources and these are in contrast to the traditional client– server model where the consumption and supply of resources is always divide and the emerging collaborative P2P systems are going beyond the era of peers doing similar things while sharing resources are looking for diverse peers that can bring in unique resources and capabilities to a virtual community thereby empowering it to engage in greater tasks beyond those that can be accomplished by individual peers and that are beneficial to all the peers. Peer-to-peer systems often implement an abstract overlay network can built at the Application Layer and on top of the native or physical network topology and it overlays are used for indexing and peer discovery and make the P2P system independent from the physical network architecture. The content exchanged directly over the underlying IP network. The systems are an exception and it implement extra routing layers to obscure the identity of the source or destination user node. A pure P2P network does not have the notion of clients or servers but only equal peer nodes that simultaneously function as both "clients" and "servers" to the other nodes on the network. The previous model of network arrangement differs from the client–server model where communication is usually to and from a centralized resource. There is a file transfer that does not use the P2P model is the File Transfer ISSN: 2231-5381 Protocol (FTP) service in which the client and server programs are distinct: the clients initiate the transfer and the servers satisfy these requests. Peer to peer network that overlay network consists of all the participating peers as network nodes and the node links between any two nodes that know each other that is if a peer is participating it knows the location of another peer in the P2P network and then there is a directed edge from the former node to the latter in the overlay network. Considering to how the nodes in the overlay network are linked to each other and we can classify the P2P networks as structured or unstructured. A) Structured systems In structured P2P networks, peers are organized following specific criteria and algorithms and leads to overlays with specific topologies and properties. Typically they are using distributed hash table (DHT) based indexing, such as in the Chord system.Peer to Peer systems that are appropriate for large-scale implementations due to high scalability and some guarantees on performance (typically the complexity is O(log N), where N is the number of nodes in the P2P system). Structured P2P networks employ a globally consistent protocol to ensure that any node can efficiently route a search to some peer that has the desired file resource that is rare and guarantee necessitates a more structured pattern of overlay links. The frequently happening common type of structured P2P networks implement a distributed hash table and in which a variant of consistent hashing is used to assign ownership of each file to a particular peer in a way analogous to a traditional hash table's assignment of each key to a particular array slot and the term DHT is commonly used to refer to the structured overlay and in practice the DHT is a data structure implemented on top of a structured overlay. B) Unstructured systems Unstructured P2P networks do not impose any structure on the overlay networks and the nodes in these networks connect in an ad-hoc fashion based on a loose set of rules. The unstructured P2P systems would have absolutely no centralized elements or nodes but in practice there are several types of unstructured systems with various degrees of centralization. There are three categories as shown below: http://www.ijettjournal.org Page 3736 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 In pure peer-to-peer systems the entire network consists solely of equipotent peers. One routing layer as there are no preferred nodes with any special infrastructure function. In centralized peer-to-peer systems such as a central server is used for indexing functions and to bootstrap the entire system. Anyway this has similarities with a structured architecture and the connections between peers are not determined by any algorithm. Hybrid peer-to-peer systems allow such infrastructure nodes to exist and so as called super nodes. An unstructured P2P network is formed when the overlay links are established arbitrarily and such networks can be easily constructed as a new peer that wants to join the network can copy existing links of another node and then form its own links over time. The peer to peer network is unstructured which contains if a peer wants to find a desired piece of data in the network and the query has to be flooded through the network to find as many peers as possible that share the data. Disadvantages with such networks are that the queries may not always be resolved. Most frequently populated content is likely to be available at several peers and any peer searching for it is likely to find the same thing and else if a peer is looking for rare data shared by only a few other peers and then it is highly unlikely that search will be successful. There is no correlation between a peer and the content managed by it and there is no guarantee that flooding will find a peer that has the desired data. The flooding leads high amount of signalling traffic in the network and hence such networks typically have very poor search efficiency. The most popular Gossip protocol is an example of this concept and all known P2P networks are unstructured. C) Distributed hash tables Distributed hash tables (DHTs) are a class of decentralized distributed systems that provide a lookup service similar to a hash table are key and value pairs are stored in the DHT and any participating node can efficiently retrieve the value associated with a given key. The maintenance of the mapping from keys to values is distributed among the nodes and in such a way that a change in the set of participants causes a minimal amount of disruption and this allows DHTs to scale to extremely large numbers of nodes and to handle continual node arrivals departures and failures and the DHTs form an infrastructure that can be used to build P2P networks DHT-based networks have been widely utilized for accomplishing efficient resource discovery for grid computing systems and it aids in resource management and scheduling of applications. The advances kept recently in the domain of decentralized resource discovery have been based on extending the existing DHTs with the capability of multi-dimensional data organization and input query routing and the efforts have looked at embedding spatial database indices such as the Space Filling Curves (SFCs) including ISSN: 2231-5381 MX-CIF Quad tree and R*-tree for managing all operations of complex Grid resource query objects over DHT networks. The Spatial indices are well suited for handling the complexity of Grid resource queries. Some spatial indices can have issues as regards to routing load-balance in case of a skewed data set of all the spatial indices are more scalable in terms of the number of hops traversed and messages generated while searching and routing Grid resource queries. Design choices include overlay rings. More recent evaluation of P2P resource discovery solutions under real workloads have pointed out several issues in DHT-based solutions such as high cost of advertising/discovering resources and static and dynamic load imbalance. A key factor to reduce network traffic in these systems is to reduce the number of required comparisons between documents and clusters. In Our approach achieves this by applying probabilistic pruning: Instead of considering all clusters for comparison with each document and only a few most relevant ones are taken into consideration. We apply this core idea to K-Means and one of the frequently used text clustering algorithms. In our proposed algorithm, called Probabilistic Clustering for peer to peer systems reduces the number of required comparisons by an order of magnitude and with negligible influence on clustering quality. II. RELATED WORK A) P2P Basic Clustering Algorithm Let N1,N2, . . . ,Nn denote the nodes in the system each with data set Xi and global dataset is denoted as X which equals Un i=1 Xi. Consider the neighbour node Neigh(i) denote the set of nodes Ni is directly connected to at a given time that is the immediate neighbours of Ni. The node in the group can reliably compute Neigh at any given time. As a consequence and consider an example each node can determine if the link to any of its immediate neighbours from a previous time has gone down. First node Ni carries out one round of K-means on its local data Xi using . The result is a new set of centroids and their associated cluster counts The collection is stored in the history table whose purpose will e explained below. Once all Nk have responded or cease to be neighbors, Ni continues. Let us consider Resp(i) denote the set of nodes that did responds and every response message from node Nk contains the locally updated centroids and cluster counts at node Nk during iteration ℓ and Node Ni updates its jth centroid as follows (producing the jth centroid at the beginning of iteration ℓ + 1) A) Initialization: The initial centroids are chosen randomly because are the same for every peer and which initiates the algorithm chooses the initial centroids, then propagates them to its neighbours and a node receives the initial centroids, it propagates them to its neighbours and then begins iteration one. Poll response and suppose that node Ni receives polling message (h,ℓ’) and the message came from http://www.ijettjournal.org Page 3737 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 node Nh and during its iteration ℓ’. If ℓ’ < ℓ, then Ni’s history table contains its local centroids and their cluster counts from iteration ℓ’. The node sends these immediately in a response message to Nh and If ℓ’ > ℓ then Ni’s history table does not contain local centroids for iteration ℓ’ and the poll message (hh, ℓ’ i) is placed in the poll table and If ℓ’ = ℓ then Ni checks is history of table contains local centroids and their cluster counts for iteration ℓ’. If not, hh, ℓi is placed in the poll table and Lastly Ni must also check its poll table during iteration ℓ. This is done immediately after producing the local centroids and their cluster counts. Ni sends its local centroids and their cluster counts in a response message to each hj . These poll messages are then removed from the table. B) Termination: A node, Ni, can enter a terminated state, say at the end of iteration ℓ and this happens Ni no longer updates its centroids or sends polling messages. It respond to polling messages as follows. If ℓ’≤ℓ, then Ni looks up its local centroids and their cluster counts for iteration ℓ’ in its history table and then they are immediately sent in a response message to Nh. If the condition ℓ’ > ℓ satisfies then Ni looks up its local centroids and their counts for iteration ℓ and sends these in a response message to Nh. C) Analysis: Consider P2P K-means at some fixed moment in time. Let I denote the maximum number of iterations carried out by any node and let L denote max{Neigh(i) : 1 ≤ i ≤ n}. Then immediately we provide worst-case space and communication analysis of P2P K-means with respect to I, L, K, and n. The node Ni and the given space required is proportional to the size of Ni’s history and poll tables. Local centroids and their cluster counts are added for each iteration and poll table is of size O(IL) and since each of Ni’s neighbours sends one poll messages per iteration, thus, a maximum of I per neighbour. So total space is O(I(K + L)). The number of messages (4 byte numbers) transmitted by Ni is O(ILK) and because of Ni sends a poll message. On top of this, Ni sends a response of size O(K) for each entry of the O(IL) entries in its poll table (O(ILK) in total). Hence the total amount of space and communication over all nodes is O(nI(K+L)) or O(nILK). D)Document Frequency-based Selection The simplest possible method for feature selection in document clustering is that of the use of document frequency to filter out irrelevant features. There is a usage for inverse document frequencies reduces the importance of such words and that may not alone be sufficient to reduce the noise effects of very frequent words. In other words which are too frequent in the corpus can be removed because they are typically common words such as “a”, “an”, “the”, or “of” which are not discriminative from a clustering perspective. Such words are also referred to as stop words and some methods are commonly available in the literature for stop-word removal and available stop word lists of about 300 to 400 words are used for the retrieval process. It occur extremely infrequently can also be removed from the collection and this is because such words do not add anything to the similarity computations which ISSN: 2231-5381 are used in most clustering methods. In some cases, such words may be misspellings or typographical errors in documents. The disturbed text collections which are derived from the web and social networks are more likely to contain such terms. There are some lines of research defined document frequency based selection purely on the basis of very infrequent terms and because of these terms contribute the least to the similarity calculations and emphasized that very frequent words should also be removed, especially if they are not discriminative. The TF-IDF weighting method cans also naturally filter out very common words in a easy way. Clearly, the standard set of stop words provide a valid set of words to prune. E) Term Contribution The concept of term contribution is based on the fact that the results of text clustering are highly dependent on document similarity. The contribution of a term can be viewed as its contribution to document similarity and in the case of dot-product based similarity and the similarity between two documents is defined as the dot product of their normalized frequencies and their contribution of a term of the similarity of two documents is the product of their normalized frequencies in the two documents. The gist of all pairs of documents in order to determine the term contribution and this process requires O(n2) time for each term and therefore sampling methods may be required to speed up the contribution and it leads to favour highly frequent words without regard to the specific discriminative power within a clustering process. The term selection is based on some pre-assumed similarity function. And this strategy makes these methods unsupervised an there is a concern that the term selection might be biased due to the potential bias of the assumed similarity function. That is, if a different similarity function is assumed, we may end up having different results for term selection. Thus the choice of an appropriate similarity function may be important for these methods. III. MODEL FRAMEWORK PCP2P consists of two parallel activities are the cluster indexing and document assignment process and cluster indexing is performed by the cluster holders and regular activities these peers create compact cluster summaries and index them in the underlying DHT and using the most frequent cluster terms as keys and next activity is document assignment consists of two steps pre-selection and full comparison process. The pre-selection step is that the peer holding d retrieves selected cluster summaries from the DHT index is to identify the most relevant clusters. In the comparison process is the peer computes similarity score estimates for d using the retrieved cluster summaries and having low similarity estimates are filtered out. Lastly d is assigned to the cluster with the highest similarity. The filtering algorithm reduces drastically the number of full comparisons and cluster indexing as well as document assignments are repeated periodically to http://www.ijettjournal.org Page 3738 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 compensate churn is to maintain an up-to-date clustering solution. In our proposed work our algorithm has following steps and our algorithm construct distributed hash table. After second step we retrieve all relevant clusters .It follows the below process. A) Preselection step It gathers the terms or tokens from the clusters and is denoted with TF(t,d) and for find the minimum frequency of tokens or terms of document denoted by Dmin(d). The peer directly to retrieve all summaries published using t as a key. For reducing the duplicate retrieval of gist and p executes these requests sequentially includes in each request the cluster ids of all summaries already retrieved. All results are then merged, and a list with the retrieved cluster summaries is constructed and denote it with Cpre. The term frequency details are published. B) Filtering Before going to filtering we have to cluster the documents based on the terms in the documents. For grouping or clustering we used clustering algorithm such as k-mediods algorithm which is an efficient algorithm and because we cluster documents as well as terms in the documents. The token based clustering we consider all terms in the documents. Then we extract grammar words such as auxiliary verbs and conjunctions etc. In this process we first gather all the terms and their frequencies from the documents and then we extract grammar words from the extracted words and then we apply clustering algorithm based on the tokens and after that grouping or clustering we maintain the clustering details as cluster summary which includes top terms and frequency of terms and sum of all frequencies of the terms and cluster length. To avoid overloading and each cluster holder selects random peers to serve as helper cluster holders and replicates the cluster centroid to them and their IP addresses are also included in the cluster summaries and so that peers can randomly choose a helper for comparing their documents with the cluster centroid without going through the cluster holder and the Communication between the master and helper cluster holders only takes place for updating the centroids and load balancing does not impede the system scalability. In this filtering we will find similarity between the document and centroids of the document. Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any angle. The judgement of orientation and not magnitude: two vectors with the same orientation have cosine similarity of 1, two vectors at 90° have a similarity of 0 and then two vectors diametrically opposed have a similarity of -1, independent of their magnitude and the Cosine similarity is particularly used in positive space and the outcome is neatly bounded in [0,1]. These bounds apply for any number of dimensions and then the Cosine similarity is most ISSN: 2231-5381 commonly used in high-dimensional positive space and Instance in Information Retrieval each token or term is notionally assigned a different dimension and a document is characterised by a vector where the value of each dimension corresponds to the number of times that term appears in the document. The useful measure is similarity of how similar two documents are likely to be in terms of their subject matter. The technique is also used to compare documents in text mining and used to measure cohesion within clusters in the field data. The Cosine distance is a term often used for the complement in positive space, that is:Dc(A,B)=1Sc(A,B) . It is important to note that this is not a proper distance metric as it does not have the triangle inequality property; to achieve this whilst maintaining the same ordering and necessary to convert to Angular distance (see below.) One of the reasons for the popularity of Cosine similarity is that it is very efficient to evaluate and especially for sparse vectors and as only the non-zero dimensions need to be considered. For this similarity we have to take term frequency of the terms of document and centroid of another document. Cos(d,c)= ( , )∗ ( , ) ( )∗( ) Similarity between the documents are high those two documents are belongs to one cluster. Actually the cosine similarity is available so we estimate the similarity between document and centroid. And is denoted as ECos(d,c). For fast sorting of we concentrate of top rated terms of the documents. C) Comparison Step In this step we compare the similarities between the documents and the similarity is high that document is kept in the current cluster which is compared with the document that is most similar and for the comparison we use filtering to find the similarity of the document. P1 Clustering Process P2 P3 DHT, Cluster Summary, Document holder p1, p2, p3 are peers which performs document clustering individually and it searches it uses all cluster holder data for searching IV. CONCLUSION In our proposed work we achieved text clustering based on the probabilistic analysis and very efficiently works on the distributed networks when sharing the data. In searching process it performs distributed search on the network. It will http://www.ijettjournal.org Page 3739 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013 search the neighbouring peers’ data also and compared to other clustering it will increase the performance of the system and restores the summary of the data of every peer for easy searching of the data. REFERENCES [1] Y. Ioannidis, D. Maier, S. Abiteboul, P. Buneman, S. Davidson, E. Fox, A. Halevy, C. Knoblock, F. Rabitti, H. Schek, and G. Weikum, “Digital library informationtechnology infrastructures,” Int J Digit Libr, vol. 5, no. 4, pp. 266 – 274, 2005. [2] P. Cudr´e-Mauroux, S. Agarwal, and K. Aberer, “Gridvine: An infrastructure for peer information management,” IEEE Internet Computing, vol. 11, no. 5, 2007. [3] J. Lu and J. Callan, “Content-based retrieval in hybrid peer-topeer networks,” in CIKM, 2003. [4] J. Xu and W. B. Croft, “Cluster-based language models for distributed retrieval,” in SIGIR, 1999. [5] O. Papapetrou, W. Siberski, and W. Nejdl, “PCIR: Combining DHTs and peer clusters for efficient full-text P2P indexing,” Computer Networks, vol. 54, no. 12, pp. 2019–2040, 2010. [6] S. Datta, C. R. Giannella, and H. Kargupta, “Aproximate distributed K-Means clustering over a peer-to-peer network,” IEEE TKDE, vol. 21, no. 10, pp. 1372–1388, 2009. [7] M. Eisenhardt, W. M¨ uller, and A. Henrich, “Classifying documents by distributed P2P clustering.” in INFORMATIK, 2003. [8] K. M. Hammouda and M. S. Kamel, “Hierarchically distributed peer-to-peer document clustering and cluster summarization,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 5, pp. 681–698, 2009. [9] H.-C. Hsiao and C.-T. King, “Similarity discovery in structured P2P overlays,” in ICPP, 2003. [10] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan, “Chord: A scalable peer-to-peer lookup service for internet applications,” in SIGCOMM, 2001. [11] K. Aberer, P. Cudr´e-Mauroux, A. Datta, Z. Despotovic, M. Hauswirth, M. Punceva, and R. Schmidt, “P-Grid: a selforganizing structured P2P system,” SIGMOD Record, vol. 32, no. 3, pp. 29–33, 2003. [12] A. I. T. Rowstron and P. Druschel, “Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems,” in IFIP/ACM Middleware, Germany, 2001. [13] C. D. Manning, P. Raghavan, and H. Schtze, Introduction to Information Retrieval. Cambridge University Press, 2008. [14] M. Steinbach, G. Karypis, and V. Kumar, “A comparison of document clustering techniques,” in KDD Workshop on Text Mining, 2000. [15] G. Forman and B. Zhang, “Distributed data clustering can be efficient and exact,” SIGKDD Explor. Newsl., vol. 2, no. 2, pp. 34– 38, 2000. ISSN: 2231-5381 [16] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H. Kargupta, “Distributed data mining in peer-to-peer networks,” IEEE Internet Computing, vol. 10, no. 4, pp. 18– 26, 2006. [17] S. Datta, C. Giannella, and H. Kargupta, “K-Means clustering over a large, dynamic network,” in SDM, 2006. [18] G. Koloniari and E. Pitoura, “A recall-based cluster formation game in P2P systems,” PVLDB, vol. 2, no. 1, pp. 455–466, 2009. [19] K. M. Hammouda and M. S. Kamel, “Distributed collaborative web document clustering using cluster keyphrase summaries,” Information Fusion, vol. 9, no. 4, pp. 465–480, 2008. [20] M. Bender, S. Michel, P. Triantafillou, G. Weikum, and C. Zimmer, “Minerva: Collaborative p2p search,” in VLDB, 2005, pp. 1263– 1266. BIOGRAPHIES B.V.V.V.SATYAKAMESWARI completed her post graduation M.Sc(CS) in Gayatri Vidya Parishad for P.G.courses from AndhraUniversity,Visakhapatnam.At present she is studying M.Tech(CSE) in Pydah College Of Engineering & Technology, JNTU Kakinada,Visakhapatnam.Her area of interests are Data Mining and Data Warehousing and Operating Systems S.K.A.Manoj received his B.Sc degree from A.V.N college, Andhra university, Visakhapatnam. He Completed MCA in Bullayya college from Andhra University, Visakhapatnam.He completed M.Tech(CSE) from Pydah college of Engineering & Technology, JNTU Kakinada,Visakhapatnam. His area of interests includes Data Mining and Data Warehousing, Artificial Intelligence, Software Engineering, Computer Networks. He is now the Assistant Professor in the Department of CSE in Pydah College of Engineering & Technology, Vishakhapatnam. http://www.ijettjournal.org Page 3740