A Novel Framework for Text Clustering In Distributed Networks S.K.A.Manoj

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
A Novel Framework for Text Clustering In
Distributed Networks
S.K.A.Manoj#1, B.V.V.V.Satya Kameswari *2
1,2
1
Assistant professor, 2Final MTech Student
Dept. of CSE, Pydah College of Engineering and Technology , Boyapalem,Visakhapatnam, AP, India
Abstract: Text clustering is most important work in the
searching techniques such as information retrieval. The very
frequently using information retrieval uses centralized
approach. But this approach has some flaws such as increasing
of processing time and retrieving time due to scalability of the
users. In text clustering the existing algorithms uses this
centralized method and because of this approach it increases
the load of the centralized system and we introduced the
distributed approach that performs information retrieval
based on the clustering individually of the peers and it assigns
the documents based on the probabilistic analysis.
I.INRODUCTION
Peer-to-peer (P2P) computing or networking is a
distributed application architecture that partitions tasks or
workloads between peers and equally privileged and
equipotent participants. They are said to form a peer-to-peer
network of nodes. This make a portion of their resources
such as processing power and disk storage or network
bandwidth and directly available to other network
participants without the need for central coordination by
servers and the two suppliers and consumers of resources
and these are in contrast to the traditional client–
server model where the consumption and supply of
resources is always divide and the emerging collaborative
P2P systems are going beyond the era of peers doing similar
things while sharing resources are looking for diverse peers
that can bring in unique resources and capabilities to a
virtual community thereby empowering it to engage in
greater tasks beyond those that can be accomplished by
individual peers and that are beneficial to all the peers.
Peer-to-peer systems often implement an
abstract overlay network can built at the Application Layer
and on top of the native or physical network topology and it
overlays are used for indexing and peer discovery and make
the P2P system independent from the physical network
architecture. The content exchanged directly over the
underlying IP network. The systems are an exception and it
implement extra routing layers to obscure the identity of the
source or destination user node. A pure P2P network does
not have the notion of clients or servers but only
equal peer nodes that simultaneously function as both
"clients" and "servers" to the other nodes on the network.
The previous model of network arrangement differs from
the client–server model where communication is usually to
and from a centralized resource. There is a file transfer that
does not use the P2P model is the File Transfer
ISSN: 2231-5381
Protocol (FTP) service in which the client and server
programs are distinct: the clients initiate the transfer and the
servers satisfy these requests. Peer to peer network
that overlay network consists of all the participating peers as
network nodes and the node links between any two nodes
that know each other that is if a peer is participating it
knows the location of another peer in the P2P network and
then there is a directed edge from the former node to the
latter in the overlay network. Considering to how the nodes
in the overlay network are linked to each other and we can
classify the P2P networks as structured or unstructured.
A) Structured systems
In structured P2P networks, peers are organized following
specific criteria and algorithms and leads to overlays with
specific topologies and properties. Typically they are using
distributed hash table (DHT) based indexing, such as in
the Chord system.Peer to Peer systems that are appropriate
for large-scale implementations due to high scalability and
some guarantees on performance (typically the complexity
is O(log N), where N is the number of nodes in the P2P
system).
Structured P2P networks employ a globally consistent
protocol to ensure that any node can efficiently route a
search to some peer that has the desired file resource that is
rare and guarantee necessitates a more structured pattern of
overlay links. The frequently happening common type of
structured P2P networks implement a distributed hash table
and in which a variant of consistent hashing is used to
assign ownership of each file to a particular peer in a way
analogous to a traditional hash table's assignment of each
key to a particular array slot and the term DHT is commonly
used to refer to the structured overlay and in practice the
DHT is a data structure implemented on top of a structured
overlay.
B) Unstructured systems
Unstructured P2P networks do not impose any structure on
the overlay networks and the nodes in these networks
connect in an ad-hoc fashion based on a loose set of rules.
The unstructured P2P systems would have absolutely no
centralized elements or nodes but in practice there are
several types of unstructured systems with various degrees
of centralization. There are three categories as shown below:
http://www.ijettjournal.org
Page 3736
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013



In pure peer-to-peer systems the entire network
consists solely of equipotent peers. One routing layer as
there are no preferred nodes with any special
infrastructure function.
In centralized peer-to-peer systems such as a central
server is used for indexing functions and to bootstrap
the entire system. Anyway this has similarities with a
structured architecture and the connections between
peers are not determined by any algorithm.
Hybrid peer-to-peer systems allow such infrastructure
nodes to exist and so as called super nodes.
An unstructured P2P network is formed when the
overlay links are established arbitrarily and such networks
can be easily constructed as a new peer that wants to join
the network can copy existing links of another node and
then form its own links over time. The peer to peer network
is unstructured which contains if a peer wants to find a
desired piece of data in the network and the query has to be
flooded through the network to find as many peers as
possible that share the data. Disadvantages with such
networks are that the queries may not always be resolved.
Most frequently populated content is likely to be available at
several peers and any peer searching for it is likely to find
the same thing and else if a peer is looking for rare data
shared by only a few other peers and then it is highly
unlikely that search will be successful. There is
no correlation between a peer and the content managed by it
and there is no guarantee that flooding will find a peer that
has the desired data. The flooding leads high amount of
signalling traffic in the network and hence such networks
typically have very poor search efficiency. The most
popular Gossip protocol is an example of this concept and
all known P2P networks are unstructured.
C) Distributed hash tables
Distributed hash tables (DHTs) are a class of
decentralized distributed systems that provide a lookup
service similar to a hash table are key and value pairs are
stored in the DHT and any participating node can efficiently
retrieve the value associated with a given key. The
maintenance of the mapping from keys to values is
distributed among the nodes and in such a way that a change
in the set of participants causes a minimal amount of
disruption and this allows DHTs to scale to extremely large
numbers of nodes and to handle continual node arrivals
departures and failures and the DHTs form an infrastructure
that can be used to build P2P networks
DHT-based networks have been widely utilized for
accomplishing
efficient
resource
discovery for grid
computing systems and it aids in resource management and
scheduling of applications. The advances kept recently in
the domain of decentralized resource discovery have been
based on extending the existing DHTs with the capability of
multi-dimensional data organization and input query routing
and the efforts have looked at embedding spatial database
indices such as the Space Filling Curves (SFCs) including
ISSN: 2231-5381
MX-CIF Quad tree and R*-tree for managing all operations
of complex Grid resource query objects over DHT
networks. The Spatial indices are well suited for handling
the complexity of Grid resource queries. Some spatial
indices can have issues as regards to routing load-balance in
case of a skewed data set of all the spatial indices are more
scalable in terms of the number of hops traversed and
messages generated while searching and routing Grid
resource queries. Design choices include overlay
rings. More recent evaluation of P2P resource discovery
solutions under real workloads have pointed out several
issues in DHT-based solutions such as high cost of
advertising/discovering resources and static and dynamic
load imbalance.
A key factor to reduce network traffic in these
systems is to reduce the number of required comparisons
between documents and clusters. In Our approach achieves
this by applying probabilistic pruning: Instead of
considering all clusters for comparison with each document
and only a few most relevant ones are taken into
consideration. We apply this core idea to K-Means and one
of the frequently used text clustering algorithms. In our
proposed algorithm, called Probabilistic Clustering for peer
to peer
systems reduces the number of required
comparisons by an order of magnitude and with negligible
influence on clustering quality.
II. RELATED WORK
A) P2P Basic Clustering Algorithm
Let N1,N2, . . . ,Nn denote the nodes in the system
each with data set Xi and global dataset is denoted as X
which equals Un i=1 Xi. Consider the neighbour node Neigh(i)
denote the set of nodes Ni is directly connected to at a given
time that is the immediate neighbours of Ni. The node in the
group can reliably compute Neigh at any given time. As a
consequence and consider an example each node can
determine if the link to any of its immediate neighbours
from a previous time has gone down.
First node Ni carries out one round of K-means on
its local data Xi using . The result is a new set of centroids
and their associated cluster counts The collection is stored
in the history table whose purpose will e explained below.
Once all Nk have responded or cease to be neighbors, Ni
continues. Let us consider Resp(i) denote the set of nodes
that did responds and every response message from node Nk
contains the locally updated centroids and cluster counts at
node Nk during iteration ℓ and Node Ni updates its jth
centroid as follows (producing the jth centroid at the
beginning of iteration ℓ + 1)
A) Initialization: The initial centroids are chosen randomly
because are the same for every peer and which initiates the
algorithm chooses the initial centroids, then propagates
them to its neighbours and a node receives the initial
centroids, it propagates them to its neighbours and then
begins iteration one. Poll response and suppose that node Ni
receives polling message (h,ℓ’) and the message came from
http://www.ijettjournal.org
Page 3737
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
node Nh and during its iteration ℓ’. If ℓ’ < ℓ, then Ni’s
history table contains its local centroids and their cluster
counts from iteration ℓ’. The node sends these immediately
in a response message to Nh and If ℓ’ > ℓ then Ni’s history
table does not contain local centroids for iteration ℓ’ and the
poll message (hh, ℓ’ i) is placed in the poll table and If ℓ’ = ℓ
then Ni checks is history of table contains local centroids
and their cluster counts for iteration ℓ’. If not, hh, ℓi is placed
in the poll table and Lastly Ni must also check its poll table
during iteration ℓ. This is done immediately after producing
the local centroids and their cluster counts. Ni sends its local
centroids and their cluster counts in a response message to
each hj . These poll messages are then removed from the
table.
B) Termination: A node, Ni, can enter a terminated state, say
at the end of iteration ℓ and this happens Ni no longer
updates its centroids or sends polling messages. It respond
to polling messages as follows. If ℓ’≤ℓ, then Ni looks up its
local centroids and their cluster counts for iteration ℓ’ in its
history table and then they are immediately sent in a
response message to Nh. If the condition ℓ’ > ℓ satisfies then
Ni looks up its local centroids and their counts for iteration
ℓ and sends these in a response message to Nh.
C) Analysis: Consider P2P K-means at some fixed moment
in time. Let I denote the maximum number of iterations
carried out by any node and let L denote max{Neigh(i) : 1 ≤ i
≤ n}. Then immediately we provide worst-case space and
communication analysis of P2P K-means with respect to I,
L, K, and n. The node Ni and the given space required is
proportional to the size of Ni’s history and poll tables. Local
centroids and their cluster counts are added for each
iteration and poll table is of size O(IL) and since each of
Ni’s neighbours sends one poll messages per iteration, thus,
a maximum of I per neighbour. So total space is O(I(K +
L)). The number of messages (4 byte numbers) transmitted
by Ni is O(ILK) and because of Ni sends a poll message. On
top of this, Ni sends a response of size O(K) for each entry
of the O(IL) entries in its poll table (O(ILK) in total). Hence
the total amount of space and communication over all nodes
is O(nI(K+L)) or O(nILK).
D)Document Frequency-based Selection
The simplest possible method for feature selection
in document clustering is that of the use of document
frequency to filter out irrelevant features. There is a usage
for inverse document frequencies reduces the importance of
such words and that may not alone be sufficient to reduce
the noise effects of very frequent words. In other words
which are too frequent in the corpus can be removed
because they are typically common words such as “a”, “an”,
“the”, or “of” which are not discriminative from a
clustering perspective. Such words are also referred to as
stop words and some methods are commonly available in
the literature for stop-word removal and available stop word
lists of about 300 to 400 words are used for the retrieval
process. It occur extremely infrequently can also be
removed from the collection and this is because such words
do not add anything to the similarity computations which
ISSN: 2231-5381
are used in most clustering methods. In some cases, such
words may be misspellings or typographical errors in
documents. The disturbed text collections which are derived
from the web and social networks are more likely to contain
such terms. There are some lines of research defined
document frequency based selection purely on the basis of
very infrequent terms and because of these terms contribute
the least to the similarity calculations and emphasized that
very frequent words should also be removed, especially if
they are not discriminative. The TF-IDF weighting method
cans also naturally filter out very common words in a easy
way. Clearly, the standard set of stop words provide a valid
set of words to prune.
E) Term Contribution
The concept of term contribution is based on the
fact that the results of text clustering are highly dependent
on document similarity. The contribution of a term can be
viewed as its contribution to document similarity and in the
case of dot-product based similarity and the similarity
between two documents is defined as the dot product of
their normalized frequencies and their contribution of a term
of the similarity of two documents is the product of their
normalized frequencies in the two documents. The gist of
all pairs of documents in order to determine the term
contribution and this process requires O(n2) time for each
term and therefore sampling methods may be required to
speed up the contribution and it leads to favour highly
frequent words without regard to the specific discriminative
power within a clustering process. The term selection is
based on some pre-assumed similarity function. And this
strategy makes these methods unsupervised an there is a
concern that the term selection might be biased due to the
potential bias of the assumed similarity function. That is, if
a different similarity function is assumed, we may end up
having different results for term selection. Thus the choice
of an appropriate similarity function may be important for
these methods.
III. MODEL FRAMEWORK
PCP2P consists of two parallel activities are the
cluster indexing and document assignment process and
cluster indexing is performed by the cluster holders and
regular activities these peers create compact cluster
summaries and index them in the underlying DHT and using
the most frequent cluster terms as keys and next activity is
document assignment consists of two steps pre-selection and
full comparison process. The pre-selection step is that the
peer holding d retrieves selected cluster summaries from the
DHT index is to identify the most relevant clusters.
In the comparison process is the peer computes
similarity score estimates for d using the retrieved cluster
summaries and having low similarity estimates are filtered
out. Lastly d is assigned to the cluster with the highest
similarity. The filtering algorithm reduces drastically the
number of full comparisons and cluster indexing as well as
document assignments are repeated periodically to
http://www.ijettjournal.org
Page 3738
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
compensate churn is to maintain an up-to-date clustering
solution.
In our proposed work our algorithm has following steps and
our algorithm construct distributed hash table. After second
step we retrieve all relevant clusters .It follows the below
process.
A) Preselection step
It gathers the terms or tokens from the clusters and is
denoted with TF(t,d) and for find the minimum frequency of
tokens or terms of document denoted by Dmin(d). The peer
directly to retrieve all summaries published using t as a key.
For reducing the duplicate retrieval of gist and p executes
these requests sequentially includes in each request the
cluster ids of all summaries already retrieved. All results are
then merged, and a list with the retrieved cluster summaries
is constructed and denote it with Cpre. The term frequency
details are published.
B) Filtering
Before going to filtering we have to cluster the
documents based on the terms in the documents. For
grouping or clustering we used clustering algorithm such as
k-mediods algorithm which is an efficient algorithm and
because we cluster documents as well as terms in the
documents. The token based clustering we consider all
terms in the documents. Then we extract grammar words
such as auxiliary verbs and conjunctions etc.
In this process we first gather all the terms and
their frequencies from the documents and then we extract
grammar words from the extracted words and then we apply
clustering algorithm based on the tokens and after that
grouping or clustering we maintain the clustering details as
cluster summary which includes top terms and frequency of
terms and sum of all frequencies of the terms and cluster
length.
To avoid overloading and each cluster holder
selects random peers to serve as helper cluster holders and
replicates the cluster centroid to them and their IP addresses
are also included in the cluster summaries and so that peers
can randomly choose a helper for comparing their
documents with the cluster centroid without going through
the cluster holder and the Communication between the
master and helper cluster holders only takes place for
updating the centroids and load balancing does not impede
the system scalability.
In this filtering we will find similarity between the
document and centroids of the document. Cosine
similarity is a measure of similarity between two vectors of
an inner product space that measures the cosine of the angle
between them. The cosine of 0° is 1, and it is less than 1 for
any angle. The judgement of orientation and not magnitude:
two vectors with the same orientation have cosine similarity
of 1, two vectors at 90° have a similarity of 0 and then two
vectors diametrically opposed have a similarity of -1,
independent of their magnitude and the Cosine similarity is
particularly used in positive space and the outcome is neatly
bounded in [0,1]. These bounds apply for any number of
dimensions and then the Cosine similarity is most
ISSN: 2231-5381
commonly used in high-dimensional positive space and
Instance in Information Retrieval each token or term is
notionally assigned a different dimension and a document is
characterised by a vector where the value of each dimension
corresponds to the number of times that term appears in the
document. The useful measure is similarity of how similar
two documents are likely to be in terms of their subject
matter.
The technique is also used to compare documents
in text mining and used to measure cohesion within clusters
in the field data. The Cosine distance is a term often used
for the complement in positive space, that is:Dc(A,B)=1Sc(A,B) . It is important to note that this is not a
proper distance metric as it does not have the triangle
inequality property; to achieve this whilst maintaining the
same ordering and necessary to convert to Angular distance
(see below.) One of the reasons for the popularity of Cosine
similarity is that it is very efficient to evaluate and
especially for sparse vectors and as only the non-zero
dimensions need to be considered.
For this similarity we have to take term frequency
of the terms of document and centroid of another document.
Cos(d,c)=
( , )∗
( , )
( )∗( )
Similarity between the documents are high those two
documents are belongs to one cluster. Actually the cosine
similarity is available so we estimate the similarity between
document and centroid. And is denoted as ECos(d,c). For
fast sorting of we concentrate of top rated terms of the
documents.
C) Comparison Step
In this step we compare the similarities between the
documents and the similarity is high that document is kept
in the current cluster which is compared with the document
that is most similar and for the comparison we use filtering
to find the similarity of the document.
P1
Clustering
Process
P2
P3
DHT, Cluster
Summary,
Document holder
p1, p2, p3 are peers which performs document clustering
individually and it searches it uses all cluster holder data for
searching
IV. CONCLUSION
In our proposed work we achieved text clustering based on
the probabilistic analysis and very efficiently works on the
distributed networks when sharing the data. In searching
process it performs distributed search on the network. It will
http://www.ijettjournal.org
Page 3739
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
search the neighbouring peers’ data also and compared to
other clustering it will increase the performance of the
system and restores the summary of the data of every peer
for easy searching of the data.
REFERENCES
[1] Y. Ioannidis, D. Maier, S. Abiteboul, P. Buneman, S.
Davidson, E. Fox, A. Halevy, C. Knoblock, F. Rabitti, H.
Schek, and G. Weikum, “Digital library informationtechnology infrastructures,” Int J Digit Libr, vol. 5, no. 4,
pp. 266 – 274, 2005.
[2] P. Cudr´e-Mauroux, S. Agarwal, and K. Aberer,
“Gridvine: An infrastructure for peer information
management,” IEEE Internet Computing, vol. 11, no. 5,
2007.
[3] J. Lu and J. Callan, “Content-based retrieval in hybrid
peer-topeer networks,” in CIKM, 2003.
[4] J. Xu and W. B. Croft, “Cluster-based language models
for distributed retrieval,” in SIGIR, 1999.
[5] O. Papapetrou, W. Siberski, and W. Nejdl, “PCIR:
Combining DHTs and peer clusters for efficient full-text
P2P indexing,” Computer Networks, vol. 54, no. 12, pp.
2019–2040, 2010.
[6] S. Datta, C. R. Giannella, and H. Kargupta, “Aproximate
distributed K-Means clustering over a peer-to-peer
network,” IEEE TKDE, vol. 21, no. 10, pp. 1372–1388,
2009.
[7] M. Eisenhardt, W. M¨ uller, and A. Henrich,
“Classifying documents by distributed P2P clustering.” in
INFORMATIK, 2003.
[8] K. M. Hammouda and M. S. Kamel, “Hierarchically
distributed peer-to-peer document clustering and cluster
summarization,” IEEE Trans. Knowl. Data Eng., vol. 21,
no. 5, pp. 681–698, 2009.
[9] H.-C. Hsiao and C.-T. King, “Similarity discovery in
structured P2P overlays,” in ICPP, 2003.
[10] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H.
Balakrishnan, “Chord: A scalable peer-to-peer lookup
service for internet applications,” in SIGCOMM, 2001.
[11] K. Aberer, P. Cudr´e-Mauroux, A. Datta, Z.
Despotovic, M. Hauswirth, M. Punceva, and R. Schmidt,
“P-Grid: a selforganizing structured P2P system,” SIGMOD
Record, vol. 32, no. 3, pp. 29–33, 2003.
[12] A. I. T. Rowstron and P. Druschel, “Pastry: Scalable,
decentralized object location, and routing for large-scale
peer-to-peer systems,” in IFIP/ACM Middleware, Germany,
2001.
[13] C. D. Manning, P. Raghavan, and H. Schtze,
Introduction to Information Retrieval. Cambridge
University Press, 2008.
[14] M. Steinbach, G. Karypis, and V. Kumar, “A
comparison of document clustering techniques,” in KDD
Workshop on Text Mining, 2000.
[15] G. Forman and B. Zhang, “Distributed data clustering
can be efficient and exact,” SIGKDD Explor. Newsl., vol. 2,
no. 2, pp. 34– 38, 2000.
ISSN: 2231-5381
[16] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H.
Kargupta, “Distributed data mining in peer-to-peer
networks,” IEEE Internet Computing, vol. 10, no. 4, pp. 18–
26, 2006.
[17] S. Datta, C. Giannella, and H. Kargupta, “K-Means
clustering over a large, dynamic network,” in SDM, 2006.
[18] G. Koloniari and E. Pitoura, “A recall-based cluster
formation game in P2P systems,” PVLDB, vol. 2, no. 1, pp.
455–466, 2009.
[19] K. M. Hammouda and M. S. Kamel, “Distributed
collaborative web document clustering using cluster
keyphrase summaries,” Information Fusion, vol. 9, no. 4,
pp. 465–480, 2008.
[20] M. Bender, S. Michel, P. Triantafillou, G. Weikum, and
C. Zimmer, “Minerva: Collaborative p2p search,” in VLDB,
2005, pp. 1263– 1266.
BIOGRAPHIES
B.V.V.V.SATYAKAMESWARI completed
her post graduation M.Sc(CS) in Gayatri
Vidya Parishad for P.G.courses from
AndhraUniversity,Visakhapatnam.At present
she is studying M.Tech(CSE) in Pydah
College Of Engineering & Technology, JNTU
Kakinada,Visakhapatnam.Her area of interests
are Data Mining and Data
Warehousing and Operating
Systems
S.K.A.Manoj received his B.Sc degree from
A.V.N
college,
Andhra
university,
Visakhapatnam. He Completed MCA in
Bullayya college from Andhra University,
Visakhapatnam.He completed M.Tech(CSE)
from Pydah college of Engineering &
Technology, JNTU Kakinada,Visakhapatnam.
His area of interests includes Data Mining and Data Warehousing,
Artificial Intelligence, Software Engineering, Computer Networks.
He is now the Assistant Professor in the Department of CSE in
Pydah College of Engineering & Technology, Vishakhapatnam.
http://www.ijettjournal.org
Page 3740
Download