A Novel Process to Cluster Data in Distributed Resources Chiranjeevi Jami

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 7- Dec 2013
A Novel Process to Cluster Data in Distributed
Resources
Chiranjeevi Jami*,Chanti.Suragala#
*
*#
FinalM.TechStudent,#Assistant professor
Dept of CSE , SISTAM college, Srikakulam, Andhra Pradesh
Abstract:-Searching is more frequently using task for
information gathering or browsing information from
web. In this task we can users search in different
resources. Due to more scalability this searching
process takes more time to process. So by using
grouping similar data we can reduce the process time.
This is possible using clustering. So we introduced a
novel approach the grouping the data in distributed
resources. So that we can reduce the process time and
group similar data in less amount of time. Ultimately it
can reduce searching time.
I. INTRODUCTION
Information retrieval is the main task of data
exchanging and searching process. In this data has to be
brief and classified. For fast searching process we have
group similar data into fine clusters. Clustering is the best
process for browsing and searching. Clustering is
classified into two types such as Keyword clustering and
document clustering.
Document Clustering is the process of
groupingthe text documents. In documents keywords to be
extracted and find the similarities between the documents
and clusters for optimal results we can pre-process the text
documents by eliminating the unnecessary keywords from
the document.
Generally document clustering can perform on
centralized system only. But this process becomes more
problem on processing of grouping and calculations in
single system that is centralized server. Some of the
researchers performed this clustering process on web
pages, then it is tested in text documents. It is mainly used
in large amount of data mainlining systems and
organizations.
In clustering generally use tokens that is keywords
of the documents and document weights. For fast
processing and calculations and reducing the complexity of
the process, it considers only keywords such as unique
words present in the document excluding the grammar
words of the language.
ISSN: 2231-5381
In our work, we introduced a new framework
contains quick processing of clustering in distributed
systems. In the next section II briefly explained the existing
methodologies and section III explained our proposed
work.
II. RELATED WORK
Mining the data over the distributed networks
leads the importance in the recent days of technology,
because of the various features involved while mining the
distributed information either clustering ,classification,
association rule mining or any other mining mechanism,
here we are proposing an empirical model of distributed
clustering approach for efficient document clustering in
distributed networks, We are clustering the documents the
based on the document similarity and groups the
documents which are semantically equal.
One approach to data partitioning is to take a
conceptual point of view that identifies thespecifically,
probabilistic models assume that the data comes from a
mixture of several populations whose distributions and
priors we want to find. Corresponding algorithms
aredescribed in the sub-section Probabilistic Clustering.
The advantage of probabilistic methods is the
interpretability of the constructed clusters and those are
having concise cluster representation also allows
inexpensive computation of intra-clusters measures of fit
that give rise to a global objective functioncluster with a
certain model whose unknown parameters have to be
found.
Similarity calculation is the main part in our
proposed work. We use cosine similarity and explained in
our proposed work. It means algorithm the keywords or
tokens are to be clustered up to some criteria to be reached.
A key limitation of k-means is its cluster model. The
concept is based on spherical clusters that are separable in
a way so that the mean value converges towards the cluster
center. The clusters are expected to be of similar size, so
http://www.ijettjournal.org
Page 386
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 7- Dec 2013
that the assignment to the nearest cluster center is the
correct assignment. When for example applying k-means
with a value of k=3 onto the well-known Iris flower data
set, the result often fails to separate the three Iris species
contained in the data set. With k=2, the two visible clusters
(one containing two species) will be discovered, whereas
with k=3 one of the two clusters will be split into two even
parts. In fact, k=2 is more appropriate for this data set,
despite the data set containing 3 classes. As with any other
clustering algorithm, the k-means result relies on the data
set to satisfy the assumptions made by the clustering
algorithms. It works well on some data sets, while failing
on others.
III.PROPOSED WORK
In our work we initially construct distributed hash
table for the documents contains terms or keywords or
tokens and location of the term in that document. The
purpose of this distributed hash table is to refer the
summary of the document in the clustering process.
Then we calculate the weight of individual text
document. In this we introduced new concepts called
cluster holder and cluster summary. Cluster holder contains
the all requirements of the documents. Cluster Summary
maintains the total gist of the document contains keywords,
keyword count in each document, keyword present in
which document in each node in the distributed Systems,
document weight of the documents and similarity between
the documents and clusters.
Gist of the document:It contains keywords and keyword
frequency in the document.
Document Weight:It is total terms present in the document
and which is calculated in the form of probability.
Cosine Similarity is the distance measure
5. Using hash table, term frequencies, and document
weights perform clustering process.
Form cluster holder in each node as P={p1,p2,....}
Where P is cluster holders in the network.
6. Maintain cluster summary in each node in the network.
In this it follows the following steps.
7. Order clusters in ascending order with respect to the
document weights.
8. If nay new document occurs in the distributed systems, it
perform the comparison process using the similarity
between the terms and the documents using cluster
summary and document weights following the above
steps.
By using this clustering algorithm we can cluster data in
the distributed system very easily. Searching process
performance also increases. It is mainly designed for the
distributed networks to reduce load on the single system. It
results best processing time when testing process in the
simulation.
Experimental Analysis:
The experimental results shown below:
In each node after reading of the input documents ,
documents has to cluster and maintains cluster summary
such as total number of keywords with their frequency in
the clusters present in each and every node,
In our case the actual score is the cosine similarity
between document and cluster centroids and it is defined as
Cos(d,c)=∑
(
)
(
)
| | | |
Next Clustering, Consider two nodes have some
documents. On these documents we perform
AggloromativeHierarchal Clustering algorithm.
Algorithm is shown as follows:
1.For each node input documents.
2.Find frequncies of terms in in each document represented
as td
3.Find document weight in each node as Dw
In this every node it maintains cluster holder. Cluster
holder the node which contains the overall gist of the
documents. Each can request for the cluster holder of
another node to cluster the documents in the distributed
systems.
4. Contruct distribited hash table represented as DHn.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 387
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 7- Dec 2013
The above cluster holder also maintained in the node. After
clustering of the documents it will show the new document
belongs to which cluster in which node.
For new document , the calculations and the assigning is
shown above.
IV.CONCLUSION
We designed a method for clustering of text in
distributed systems. Reducing complexity of calculations
our work very useful. In searching applications also it is
very helpful. For reducing the resources work load and the
processing it works efficiently. Compared to traditional
process in distribution systems text clustering process
faster.
technology infrastructures,”Int J Digit Libr, vol. 5, no. 4,
pp. 266 – 274, 2005.
[2] P. Cudr´e-Mauroux, S. Agarwal, and K. Aberer,
“Gridvine: Aninfrastructure for peer information
management,” IEEE InternetComputing, vol. 11, no. 5,
2007.
[3] J. Lu and J. Callan, “Content-based retrieval in hybrid
peer-topeernetworks,” in CIKM, 2003.
[4] J. Xu and W. B. Croft, “Cluster-based language models
for distributedretrieval,” in SIGIR, 1999.
[5] O. Papapetrou, W. Siberski, and W. Nejdl, “PCIR:
CombiningDHTs and peer clusters for efficient full-text
P2P indexing,”Computer Networks, vol. 54, no. 12, pp.
2019–2040, 2010.
[6] S. Datta, C. R. Giannella, and H. Kargupta,
“Approximate distributedK-Means clustering over a peerto-peer network,” IEEETKDE, vol. 21, no. 10, pp. 1372–
1388, 2009.
[7] M. Eisenhardt, W. M¨ uller, and A. Henrich,
“Classifying documentsby distributed P2P clustering.” in
INFORMATIK, 2003.
[8] K. M. Hammouda and M. S. Kamel, “Hierarchically
distributedpeer-to-peer document clustering and cluster
summarization,”IEEE Trans. Knowl. Data Eng., vol. 21,
no. 5, pp. 681–698, 2009.
[9] H.-C. Hsiao and C.-T.King, “Similarity discovery in
structuredP2P overlays,” in ICPP, 2003.
[10] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H.
Balakrishnan,“Chord: A scalable peer-to-peer lookup
service for internet applications,”in SIGCOMM, 2001.
[11] K. Aberer, P. Cudr´e-Mauroux, A. Datta, Z.
Despotovic,M. Hauswirth, M. Punceva, and R. Schmidt,
“P-Grid: a selforganizingstructured P2P system,” SIGMOD
Record, vol. 32, no. 3,pp. 29–33, 2003.
[12] A. I. T. Rowstron and P. Druschel, “Pastry: Scalable,
decentralizedobject location, and routing for large-scale
peer-to-peer systems,”in IFIP/ACM Middleware, Germany,
2001.
[13] C. D. Manning, P. Raghavan, and H. Schtze,
Introduction
toInformation
Retrieval.
Cambridge
University Press, 2008.
[14] M. Steinbach, G. Karypis, and V. Kumar, “A
comparison of documentclustering techniques,” in KDD
Workshop on Text Mining,2000.
[15] G. Forman and B. Zhang, “Distributed data clustering
can beefficient and exact,” SIGKDD Explor.Newsl., vol. 2,
no. 2, pp. 34–38, 2000.
[16] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H.
Kargupta,“Distributed data mining in peer-to-peer
networks,” IEEE InternetComputing, vol. 10, no. 4, pp.
18–26, 2006.
REFERENCES
[1] Y. Ioannidis, D. Maier, S. Abiteboul, P. Buneman, S.
Davidson,E. Fox, A. Halevy, C. Knoblock, F. Rabitti, H.
Schek, andG. Weikum, “Digital library information-
ISSN: 2231-5381
http://www.ijettjournal.org
Page 388
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 7- Dec 2013
BIOGRAPHIES
ChiranjeeviJami
is a Student in
M.Tech(SE)
in Sarada Institute of
science
Technology
And
Management,Srikakulam. He Received
his B.Tech(IT) from Aditya Institute of
Technology
And
Management(AITAM), Tekkali. His
interesting
areas
are
Data
warehousing,java and oracle database.
Chanti.Suragala is working as an
Asst.professor in Sarada Institute of
Science, Technology And Management,
Srikakulam, Andhra Pradesh. He
received his M.Tech (CSE) from Aditya
Institute
of
Technology
And
Management, Tekkali. JNTU Kakinada
Andhra Pradesh. His research areas
include Image Processing.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 389
Download