An Efficient High Dimensional Data Clustering over Decentralized Architecture S.Gayatri ,S.Kalyani

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 7 – Oct 2014
An Efficient High Dimensional Data Clustering over
Decentralized Architecture
S.Gayatri1,S.Kalyani2
Final M.Tech Student1, Assistant Professor2
Dept of CSE ,Vignan’s Institute of enginering for womenAndhra Pradesh.
2
Information Technology ,Vignan’s Institute of enginering for womenAndhra Pradesh.
1
Abstract:Clustering is one of the most traditional data
mining techniques for knowledge extraction of mass
data storages and high dimensional dataset in the
years of research, Un preprocessed data (Articles,
prepositions ,conjunctions ,adverbs…..etc.) may leads
to irrelevant clusters. In this paper we are proposing
feature set extraction of preprocessing of input
documents and document weights can be computed
based on term frequency and inverse document
frequencies then computes the mutation based
centroid for each iteration, except initial iteration
during clustering of documents.
I. INTRODUCTION
In recent years, personalized information
services play an important role in people’s life. There
are two important problems worth researching in the
fields. One is how to get and describe user personal
information, i.e. building user model, the other is how
to organize the information resources, i.e. document
clustering. Personal information is described exactly
only if user behavior and the resource what they look
for or search have been accurately analyzed. The
effectiveness of a personalized service depends on
completeness and accuracy of user model. The basic
operation is organizing the information resources. In
this paper we focus on document clustering.
At present, as millions of scientific
documents available on the Web. Indexing or
searching millions of documents and retrieving the
desired information has become an increasing
challenge and opportunity with the rapid growth of
scientific documents. Clustering plays an important
role in analysis of user interests in user model. So
high-quality scientific document clustering plays a
more and more important role in the real word
applications such as personalized service and
recommendation systems.
ISSN: 2231-5381
Clustering is a classical method in data
mining research. Clustering algorithms group a set of
documents into subsets or clusters. The algorithms’
goal is to create clusters that are coherent internally,
but clearly different from each other. In other words,
documents within a cluster should be as similar as
possible; and documents in one cluster should be as
dissimilar as possible from documents in other
clusters.
II. RELATED WORK
Flat clustering creates a flat set of clusters
without any explicit structure that would relate
clusters to each other. Hierarchical clustering creates
a hierarchy of clusters.
A second important distinction can be made
between hard and soft cluster. HARD CLUSTERING
algorithms. Hard clustering computes a hard
assignment – each document. SOFT CLUSTERING
is a member of exactly one cluster. The assignment of
soft clustering algorithms is soft – a document’s
assignment is a distribution over all clusters. In a soft
assignment, a document has fractional membership in
several clusters.
The cluster hypothesis states the fundamental
assumption we make when using clustering in
information retrieval. Documents in the same cluster
behave similarly with respect to relevance to
information needs. The hypothesis states that if there
is a document from a cluster that is relevant to a
search request, then it is likely that other documents
from the same cluster are also relevant. This is
because clustering puts together documents that share
many terms. The cluster hypothesis essentially is the
contiguity.
Search result clustering where clustering by
search results we mean the documents that were
returned in response to a query. The default
http://www.ijettjournal.org
Page 336
International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 7 – Oct 2014
presentation of search results in information retrieval
is a simple list. Users scan the list from top to bottom
until they have found the information they are looking
for. Instead, search result clustering clusters the
search results, so that similar documents appear
together. It is often easier to scan a few coherent
groups than many individual documents. This is
particularly useful if a search term has different word
senses.
compute the file Weight of the datasets or
preprocessed feature set in terms of term frequency
and inverse document frequencies and computes the
file relevance matrix to reduce the time complexity
while clustering datasets. We are using a most widely
used similarity measurement i.e cosine similarity
III. PROPOSED SYSTEM
dm is centroid (Document weight)
In this paper we are proposing an efficient
and secure data mining technique with optimized kmeans and cryptographic approach, for cluster the
similar type of information, initially data points need
to be share the information which is at the individual
data holders or players. In this paper we are
emphasizing on mining approach not on
cryptographic technique, For secure transmission of
data various cryptographic algorithms and key
exchange protocols available.
dn is document weight or file Weight from the
Cos(dm,dn)= (dm * dn )/Math.sqrt(dm * dn )
Where
In the following examplediagram shows a simple way
to retrieve similarity between documents in at
individual data holders by computing cosine
similarity prior clustering as follows.
In the below table D(d1,d2….dn) represents set
of documents at data holder or player and their
respective cosine similarities, it reduces the time
complexity while computing the similarity between
the centroids and documents while clustering.
Individual peers at data holders initially preprocess
raw data by eliminating the unnecessary features
from datasets after the preprocessing of datasets,
d1
d2
d3
d4
d5
d1
1.0
0.48
0.66
0.89
0.45
d2
0.77
1.0
0.88
0.67
0.88
d3
0.45
0.9
1.0
0.67
0.34
d4
d5
0.32
0.67
0.47
0.55
0.77
0.79
1.0
0.89
0.34
1.0
single random selection ate every iteration, the
following algorithm shows the optimized k-means
algorithm as follows
In our approach we are enhancing K Means
algorithm with recentoird computation instead of
Algorithm :
1: Select K points as initial centroids for initial iteration
2: until Termination condition is met (user specified maximum no of iterations)
3: get_relevance(dm,dn)
Where dmis the document M file Weight from relevance matrix
dnis the document N file Weight from relevance matrix
ISSN: 2231-5381
http://www.ijettjournal.org
Page 337
International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 7 – Oct 2014
4: Assign each point to its closest centroid to form K clusters
5: Recompute the centroid with intra cluster data points (i.e average of any k data points in the individual cluster).
Ex: (P11+P12+….P1k) / K
All points from the same cluster
6. Compute new centorid for merged cluster
In the traditional approach of k means
algorithm it randomly selects a new centroid, in our
approach we are enhacing by prior construction of
relevance matrix and by considering the average k
random selection of document Weight for new
centroid calculation .
[8] K. M. Hammouda and M. S. Kamel, “Hierarchically distributedpeerto-peer document clustering and cluster summarization,”IEEE Trans.
Knowl. Data Eng., vol. 21, no. 5, pp. 681–698, 2009.
[9] H.-C. Hsiao and C.-T.King, “Similarity discovery in structuredP2P
overlays,” in ICPP, 2003.
[10] Privacy Preserving Clustering By Data TransformationStanley R. M.
Oliveira ,Osmar R. Za¨ıane
IV. CONCLUSION
We are concluding our current research work with
efficient privacy preserving data clustering over
distributed networks. Quality of the clustering
mechanism enhanced with preprocessing, relevance
matrix and centroid computation in k-means
algorithm and cryptographic technique solves the
secure transmission of data between data holders or
players and saves the privacy preserving cost by
forward the relevant features of the dataset instead of
raw datasets. We can enhance security by establishing
an efficient key exchange protocol and cryptographic
techniques while transmission of data between data
holders or players.
REFERENCES
BIOGRAPHIES
S gayatri completed her b.tech gayatri
engg for woman. She is pursuing
m.tech vignan engg colllege for
woman, ap, india. Her interested areas
are data mining, network security.
S. kalyani working as asst. professor ,
in information technology ,Vignan’s
Institute of engineering for women
Andhra Pradesh. Her interested areas
are data mining network security.
[1] Privacy-Preserving Mining of Association Rules From Outsourced
Transaction Databases Fosca Giannotti, LaksV. S. Lakshmanan, Anna
Monreale, Dino Pedreschi, and Hui (Wendy) Wang.
[2] Privacy Preserving Decision Tree Learning Using Unrealized Data
SetsPui K. Fong and Jens H. Weber-Jahnke, Senior Member, IEEE
Computer Society
[3]Anonymization of Centralized and Distributed Social Networks by
Sequential Clustering Tamir assa and Dror J. Cohen
[4] Privacy Preserving ClusteringS. Jha, L. Kruger, and P. McDaniel
[5] Tools for Privacy Preserving Distributed Data Mining , Chris Clifton,
Murat Kantarcioglu, Xiaodong Lin, Michael Y. Zhu
[6] S. Datta, C. R. Giannella, and H. Kargupta, “Approximate distributed
K-Means clustering over a peer-to-peer network,” IEEETKDE, vol. 21,
no. 10, pp. 1372–1388, 2009.
[7] M. Eisenhardt, W. M¨ uller, and A. Henrich, “Classifying
documentsby distributed P2P clustering.” in INFORMATIK, 2003.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 338
Download