A Fast and Simple Text clustering in distributed Systems Birlangi Usha Rani,

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013
A Fast and Simple Text clustering in distributed
Systems
1
Birlangi Usha Rani, 2U.D Prasanna
1
M.Tech Scholar, 2Associate Professor
Dept Of Computer Science And Engineering,
1,2
Aditya Institute Of Technology And Management, Tekkali, Andhra Pradesh
1,2
Abstract: In recent days data exchanging is more
common work in network. Data exchanging from single
resource is not a difficult work. But exchanging data
from multiple resources is difficult thing, so need
clustering to group data for large amount of data
documents. Traditional clustering process is done on
plain documents only. By following traditional methods
at the time of data gathering from the network perform
clustering on plain documents which is not secure,
because data exchanging from the network. So we
introduced a method to provide security to that data
documents using cryptographic method.By using we
can provide security to documents.
I.INTRODUCTION
Clustering is a division of data into groups of similar
objects. Each group, called cluster, consists of objects that
are similar between themselves and dissimilar to objects of
other groups. In other words, the goal of a good document
clustering scheme is to minimize intra-cluster distances
between documents, while maximizing inter-cluster
distances (using an appropriate distance measure between
documents)[1][2].
A distance measure (or, dually, similarity measure)
thus lies at the heart of document clustering. Clustering is
the most common form of unsupervised learning and this is
the
major
difference
between
clustering and
classification[3]. No super-vision means that there is no
human expert who has assigned documents to classes. In
clustering, it is the distribution and makeup of the data that
will determine cluster membership.
Clustering is sometimes erroneously referred to as
automatic classification; however, this is inaccurate, since
the clusters found are not known prior to processing
whereas in case of classification the classes are predefined. In clustering, it is the distribution and the nature of
data that will determine cluster membership, in opposition
to the classification where the classifier learns the
association between objects and classes from a so called
training set, i.e. a set of data correctly labeled by hand, and
then replicates the learnt behavior on unlabeled
dataDocument clustering has been investigated for use in a
ISSN: 2231-5381
number of different areas of textmining and information
retrieval. Initially, document clustering was investigated
for improvingthe precision or recall in information retrieval
systems and as an efficient way offinding the nearest
neighbors of a document. More recently, clustering has
beenproposed for use in browsing a collection of
documents or in organizing the resultsreturned by a search
engine in response to a user’s query. Document clustering
hasalso been used to automatically generate hierarchical
clusters of documents[4][5][6].
We
were
developing
an
application
for
recommendations of news articles to the readers of a news
portal. The following challenges gave us the motivation to
use clustering of the news articles:
1. The number of available articles was large.
2. A large number of articles were added each day.
3. Articles corresponding to same news were added
from different sources.
4. The recommendations had to be generated and
updated in real time[7][8] .
By clustering the articles we could reduce our domain
of search for recommendations as most of the users had
interest in the news corresponding to a few number of
clusters. This improved our time efficiency to a great
extent. Also we could identify the articles of same news
from different sources. The main motivation of this work
has been to investigate possibilities for the improvement of
the effectiveness of document clustering by finding out the
main reasons of ineffectiveness of the already built
algorithms and get their solutions[9].
Initially we applied the K-Means and Agglomerative
Hierarchical clustering methods on the data and found that
the results were not very satisfactory and the main reason
for this was the noise in the graph, created for the data.
Thus we tried for pre-processing of the graph to remove the
extra edges. We applied a heuristic for removing the inter
cluster edges and then applied the standard graph clustering
methods to get much better results[10][11][12].
We also tried a completely different approach by first
clustering the words of the documents by using a standard
clustering approach and thus reducing the noise and then
using this word cluster to cluster the documents. We found
http://www.ijettjournal.org
Page 37
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013
that this also gave better results than the classical K-Means
and Agglomerative Hierarchical clustering methods.
We first study the effectiveness of pre-processing of
data representation and then applying the classical
clustering methods. We then detect the effectiveness of a
new clustering algorithm in which the noise is reduced by
first clustering the features of the data and then clustering
the data on the basis of their feature’s clusters.
Clustering is the most common form of
unsupervised learning and is a major tool in a number of
applications in many fields of business and science.
Hereby, we summarize the basic directions in which
clustering is used.
• Finding Similar Documents This feature is often used
when the user has spotted one “good” document in a search
result and wants more-like-this. The interesting property
here is that clustering is able to discover documents that are
conceptually alike in contrast to search-based approaches
that are only able to discover whether the documents share
many of the same words.
• Organizing Large Document Collections Document
retrieval focuses on finding documents relevant to a
particular query, but it fails to solve the problem of making
sense of a large number of uncategorized documents. The
challenge here is to organize these documents in a
taxonomy identical to the one humans would create given
enough time and use it as a browsing interface to the
original collection of documents[13].
• Duplicate Content Detection In many applications
there is a need to find duplicates or near-duplicates in a
large number of documents. Clustering is employed for
plagiarism detection, grouping of related news stories and
to reorder search results rankings (to assure higher
diversity among the topmost documents). Note that in such
applications the description of clusters is rarely needed[14].
• Recommendation System In this application a user is
recommended articles based on the articles the user has
already read. Clustering of the articles makes it possible in
real time and improves the quality a lot[15].
• Search Optimization Clustering helps a lot in
improving the quality and efficiency of search engines as
the user query can be first compared to the clusters instead
of comparing it directly to the documents and the search
results can also be arranged easily.
The goal of a document clustering scheme is to
minimize intra-cluster distances between documents, while
maximizing inter-cluster distances (using an appropriate
distance measure between documents). A distance measure
(or, dually, similarity measure) thus lies at the heart of
document clustering. The large variety of documents
makes it almost impossible to create a general algorithm
which can work best in case of all kinds of datasets
[16][17].
Document clustering is being studied from many
decades but still it is far from a trivial and solved problem.
The challenges are:
ISSN: 2231-5381
1. Selecting appropriate features of the documents that
should be used for clustering.
2. Selecting an appropriate similarity measure between
documents.
3. Selecting an appropriate clustering method utilising
the above similarity measure.
4. Implementing the clustering algorithm in an
efficient way that makes it feasible in terms of required
memory and CPU resources.
Finding ways of assessing the quality of the performed
clustering. Furthermore, with medium to large document
collections (10,000+ documents), the number of termdocument relations is fairly high (millions+), and the
computational complexity of the algorithm applied is thus a
central factor in whether it is feasible for real-life
applications. If a dense matrix is constructed to represent
term-document relations, this matrix could easily become
too large to keep in memory - e.g. 100, 000 documents ×
100, 000 terms = 1010 entries ~ 40 GB using 32-bit
floating point values. If the vector model is applied, the
dimensionality of the resulting vector space will likewise
be quite high (10,000+). This means that simple operations,
like finding the Euclidean distance between two documents
in the vector space, become time consuming tasks.
II. RELATED WORK
k-means
is
one
of
the
simplest
unsupervised learning algorithms that solve the
well known clustering problem. Theprocedure follows a
simple and easy way to classify a given data set through
a certain number of clusters (assume k clusters) fixed
apriori. The main idea is to define k centers, one for each
cluster. These centers should
be placed in a
cunning way
because
of different location causes
different result. So, the better choice is to place
them as much as possible far away from each other.
The next step is to take each point belonging to a given
data set and associate it to the nearest center. When no
point is pending, the first step is completed and an early
group age is done. At this point we need to re-calculate k
new centroids as barycenter of the clusters resulting from
the previous step. After we have these k new centroids,
a new binding has to be done between the same data set
points and the nearest newcenter. A loop has been
generated. As a result of this loop we may notice that
the k centers change their location step by step until no
more changes are done or in other words centers do not
move any more. Finally, this algorithm
aims
at minimizing an objective function know as squared error
function given by:
||)2
J(v)=∑ ∑ (||
where,‘||xi vj||’ is
the
Euclidean
distance
between xi and vj..‘ci ’ is the number of data points
in ith cluster. c’ is the number of cluster centers.
http://www.ijettjournal.org
Page 38
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013
Distributed hash tables are a way to organize the
storage and network resources of many Internet hosts to
create a single storage system. The promise of DHTs lies in
their potential to create a distributed storage infrastructure
that, in aggregate, is more robust and offers higher
performance than any individual host. Because DHTs are
likely to be deployed on potentially unreliable volunteer
nodes spread around the globe, meeting this goal is
challenging: nodes in the system may join or leave at any
time and latencies between nodes can be large.
In a distributed implementation, known as
a distributed hash table, or DHT, the hash table is
distributed among a set of nodes. Nodes all use the same
hash function. Looking up a key gives you a node ID that
holds the data. The entire goal of a DHT is to allow anyone
to find the node that corresponds to a given key. That node
will be responsible for holding the information associated
with that key. A key difference between the DHT approach
and the centralized or flooding approaches is that a specific
node is responsible for holding information relating to they
key (even if it just sends a link to the content)[18]-[20].
To get a clear understanding of distributed hash
tables, highlighting the concept of hashtable is necessary.
Basically, a hash table is an array to store a set of items.
Every itemx is mapped to a hash value h (V ) and then
stored in slot h (V ) in the array. The hashfunction is a
function:
H:U{0,1……m-1}
That maps each possible item in U to a position in
the hash table. The parameter m is thesize of the hash
table.This technique cannot be applied directly to store data
in peers. This is infeasible because the number of active
peers changes constantly and leads to thenecessity of
continuously adjusting the table's indexing. Furthermore,
this would requirea new allocation of data to peers with
each peer departure, arrival or failure, which is very
inefficient. These difficulties and performance constraints
related to the direct useof hash tables in the peer-to-peer
networks represented an incentive to develop a theconcept
of Distributed Hash Table (DHT), which became
progressively a standard methodin Peer-to-Peer networks.
This structure is based on the following main concepts:
1. Mapping data values to keys:
Data value V is mapped to a key K using a hash function as
follows:
h (V ) = K:
The hash function needs to meet a quite demanding set of
properties. First, the hashfunction should be easy to
compute in order to ensure high efficiency of the
mappingprocess. In addition to this requirement, the hash
function should be one-way, i.e. itis hard to invert, so that
for any K, it is computationally infeasible to find V suchas
V = h(K). Another property of h is that it should be
collision-free i.e. for any V it is impossible to find another
V 0 such as h (V )) = h (V 0).
ISSN: 2231-5381
These targeted properties of the hash function are
hard to satisfy simultaneouslysince they may be
contradictory: for example to obtain a function that is hard
to invert, the degree of difficulty to compute the value of
such a function will necessarilyincrease. This fact make
designing such functions a very challenging task.
2. Dynamic partitioning of the keys set among nodes:The
interval of keys is divided in different parts and each part is
associated to anactive peer in the network. This partitioning
is dynamic and can be efficiently adjusted by any change in
the set of participants:
In case a node newly joins the network, any active
node is contacted and half of itskeys subset is given to the
new node. The routing structure has to be updated: theto
the contacted node neighboring nodes are informed about
the new node and theirrouting information is adequately
updated.If a node leaves the network, the keys subset is
allocated to its neighbors and thestored data is moved to
the new responsible nodes. The keys set partitioning can be
adapted by nodes failure. In fact, the corresponding keys
subset is allocated to otheractive nodes but the stored data
cannot be recovered. Until the updating of the
keyspartitioning is done, the functioning of the network
can continue by using redundantrouting paths and nodes.
3. Data Storage:
Once the key K is calculated, the data can be stored at the
location associated to theobtained key. There are two ways
of storing the data. This can be done directly,where data
values are stored directly by the node responsible for their
associatedkeys. Another alternative is to store pointers to
where the data values are actuallystored.
4. Data lookup:
Any node in the network can retrieve any stored
data. The requesting nodes contactsa random active node.
If the data is stored at a key in the subset associated to
thecontacted node, there is no need for routing the data
request through the networkstructure. Otherwise the
request is spread until reaching the node responsible
forstoring the requested data. Several routing algorithms
were developed in this context.Based on the desired
features of the network (minimum latency time, high
security,a certain routing algorithm can be adopted.
III.PROPOSED WORK
Algorithm:
Initialize peers {p1,p2…pn}.
Let Take pi
Collect documents D={d1,d2,d3…dn}
all terms in the documents T={t1,t2….tn}
For(d=1 to n)
{
Construct DH Table
}
Calculate Document Weight
WD=TFtd dfD=TFtd log2(n/dfD)
http://www.ijettjournal.org
Page 39
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013
Where TFtd is the number of occurrences
of term t in the document d and dfD is the term
frequency in the collection of documents.
Calculate similarity between the documents and as
well as clusters.
Generate Cluster Summary (keyword, frequency)
CS={TF(t,d),Ft}
Form Cluster Holder and named as
P={p1,p2…Pn.}
If d is new document,
Calculate document weight as shown above.
Encrypt Cluster Summary{ep1,ep2….epn}
For(p=1 to pn)
{
Encrypt(CS)
}
Calculate frequency of the keywordsTF(t,d)
Calculate similarity between cluster and the new document
using cluster summary CS.
Assign new document into cluster.
P1
P2
Cluster Holder
Cluster Summary
P5
P3
DHT
P4
Encrypted cluster
holder, cluster
summary data
Decrypt cluster
Summary, Cluster
Holder data
The methods which we used is explain elaborately below.
For encrypting the cluster summary, we adapted
Advanced Encryption Standard Algorithm.
AES is based on a design principle known as a
substitution-permutation network, and is fast in both
software and hardware. Unlike its predecessor DES, AES
does not use a Feistel network. AES is a variant of Rijndael
which has a fixed block size of 128 bits, and a key size of
128, 192, or 256 bits. By contrast, the Rijndael
specification per se is specified with block and key sizes
that may be any multiple of 32 bits, both with a minimum
of 128 and a maximum of 256 bits[20].
For a given number of documents construct
distributed hash table. For every document we construct
DHT which contains tokens or terms and keys. These
tables are referenced for next clustering process.
ISSN: 2231-5381
Second is similarity between the nodes in the
network so we use cosine similarity. In this similarity
calculation we consider only the similar properties between
the edges. The reason of taking cosine similarity measure is
explained below.
Cosine similarity is a measure of similarity
between two vectors of an inner product space that
measures the cosine of the angle between them. The cosine
of 0° is 1 it is less than 1 for any other angle. It is thus a
judgment of orientation and not magnitude: two vectors
with the same orientation have a Cosine similarity of 1,
two vectors at 90° have a similarity of 0 and two vectors
diametrically opposed have a similarity of -1 and it is
independent of their sign. This similarity is particularly
used in positive space, where the outcome is neatly
bounded in [0,1].
Note that these bounds apply for any number of
dimensions and their Cosine similarity is most commonly
used in high-dimensional positive spaces. In Information
Retrieval and text mining and each term is notionally
assigned a different dimension and a document is
characterized by a vector where the value of each
dimension corresponds to the number of times that term
appears. Cosine similarity then gives a useful measure of
how similar two documents are likely to be in terms of
their subject matter.The technique is also used to measure
cohesion within clusters in the field of data mining.
Cosine distance is a term often used for the
complement in positive space, that is:Dc(A,B)=1-Sc(A,B) .
It is important to note and that this is not a properdistance
metric as it does not have the triangle inequality property
and it violates the coincidence axiom and repair the triangle
inequality property whilst maintaining the same ordering
and necessary to convert to Angular distance (see
below.)One of the reasons for the popularity of Cosine
similarity is that it is very efficient to evaluate especially
for sparse vectors and only the non-zero dimensions need
to be considered.
In our case the actual score is the cosine similarity
between document and cluster centroids and it is defined as
Cos(d,c)=∑
(
)
(
)
| | | |
Note that the similarity is compared for new document is
with cluster centroids and the new document. The above
generated cluster summary is used for calculation of the
similarity measure.
Experimental Analysis:
We tested in two systems and we considered 2
peers, and documents are available in two peers. The
results as shown below.
http://www.ijettjournal.org
Page 40
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013
The below graph shows efficiency of our work. We take
performance complexity, Security, and efficiency of
existing and proposed methods.
1.2
1
0.8
0.6
0.4
0.2
0
proposed work
Traditional
In this every node it maintains cluster summary.
IV.CONCLUSION
In our proposed work we designed a method for
clustering of text in distributed systems. In this algorithm
we introduced a cryptographic algorithm to encrypt the
documents to provide security to documents. In real time
applications also it is very helpful. For reducing the
resources work and the processing it works efficiently.
Compared to traditional process in distribution systems text
clustering process faster.
REFERENCES
The above cluster holder also maintained in every node.
For new document the below analysis should be done.
ISSN: 2231-5381
[1] Y. Ioannidis, D. Maier, S. Abiteboul, P. Buneman, S.
Davidson,E. Fox, A. Halevy, C. Knoblock, F. Rabitti, H.
Schek, andG. Weikum, “Digital library informationtechnology infrastructures,”Int J Digit Libr, vol. 5, no. 4,
pp. 266 – 274, 2005.
[2] P. Cudr´e-Mauroux, S. Agarwal, and K. Aberer,
“Gridvine: Aninfrastructure for peer information
management,” IEEE InternetComputing, vol. 11, no. 5,
2007.
[3] J. Lu and J. Callan, “Content-based retrieval in hybrid
peer-topeernetworks,” in CIKM, 2003.
[4] J. Xu and W. B. Croft, “Cluster-based language models
for distributedretrieval,” in SIGIR, 1999.
[5] O. Papapetrou, W. Siberski, and W. Nejdl, “PCIR:
CombiningDHTs and peer clusters for efficient full-text
P2P indexing,”Computer Networks, vol. 54, no. 12, pp.
2019–2040, 2010.
[6] S. Datta, C. R. Giannella, and H. Kargupta,
“Approximate distributedK-Means clustering over a peerto-peer network,” IEEETKDE, vol. 21, no. 10, pp. 1372–
1388, 2009.
http://www.ijettjournal.org
Page 41
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 1- Dec 2013
[7] M. Eisenhardt, W. M¨ uller, and A. Henrich,
“Classifying documentsby distributed P2P clustering.” in
INFORMATIK, 2003.
[8] K. M. Hammouda and M. S. Kamel, “Hierarchically
distributedpeer-to-peer document clustering and cluster
summarization,”IEEE Trans. Knowl. Data Eng., vol. 21,
no. 5, pp. 681–698, 2009.
[9] H.-C. Hsiao and C.-T.King, “Similarity discovery in
structuredP2P overlays,” in ICPP, 2003.
[10] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H.
Balakrishnan,“Chord: A scalable peer-to-peer lookup
service for internet applications,”in SIGCOMM, 2001.
[11] K. Aberer, P. Cudr´e-Mauroux, A. Datta, Z.
Despotovic,M. Hauswirth, M. Punceva, and R. Schmidt,
“P-Grid: a selforganizingstructured P2P system,” SIGMOD
Record, vol. 32, no. 3,pp. 29–33, 2003.
[12] A. I. T. Rowstron and P. Druschel, “Pastry: Scalable,
decentralizedobject location, and routing for large-scale
peer-to-peer systems,”in IFIP/ACM Middleware, Germany,
2001.
[13] C. D. Manning, P. Raghavan, and H. Schtze,
Introduction
toInformation
Retrieval.
Cambridge
University Press, 2008.
[14] M. Steinbach, G. Karypis, and V. Kumar, “A
comparison of documentclustering techniques,” in KDD
Workshop on Text Mining,2000.
[15] G. Forman and B. Zhang, “Distributed data clustering
can beefficient and exact,” SIGKDD Explor.Newsl., vol. 2,
no. 2, pp. 34–38, 2000.
[16] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H.
Kargupta,“Distributed data mining in peer-to-peer
networks,” IEEE InternetComputing, vol. 10, no. 4, pp.
18–26, 2006.
[17] S. Datta, C. Giannella, and H. Kargupta, “K-Means
clustering overa large, dynamic network,” in SDM, 2006.
[18] G. Koloniari and E. Pitoura, “A recall-based cluster
formationgame in P2P systems,” PVLDB, vol. 2, no. 1, pp.
455–466, 2009.
[19] K. M. Hammouda and M. S. Kamel, “Distributed
collaborativeweb document clustering using cluster
keyphrase summaries,”Information Fusion, vol. 9, no. 4,
pp. 465–480, 2008.
[20]http://www.facweb.iitkgp.ernet.in/~sourav/ AES.pdf.
[21] ES-MPICH2: A Message Passing Interface with
Enhanced Security
ISSN: 2231-5381
http://www.ijettjournal.org
Page 42
Download