An Empirical Model of Secure Clustering Technique over Decentralized Architecture Y.V.Siva Prasad

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 1 – Oct 2014
An Empirical Model of Secure Clustering Technique
over Decentralized Architecture
Y.V.Siva Prasad1, P.Rajasekhar2
Final M.tech Student1, Assistant Professor2
1,2
Dept of CSE,avanthi institute Of Engineering.& tech, makavarapallem,A.p
Abstract: Privacy preserving over data mining in distributed
networks is still an important research issue in the field of
Knowledge and data engineering or community based clustering
approaches, privacy is an important factor while datasets or data
integrates from different data holders or players for mining.
Secure mining of data is required in open network. In this paper
we are proposing an efficient privacy preserving data clustering
technique in distributed networks.
I. INTRODUCTION
In data mining there are so many databases
connects with each other in large communication networks.
But in this key problem is privacy in these databases.
Hence the concept of privacy preserving data mining
arises. Many researches arise in this concept to secure data
from intruders in the network. There are many datasets also
share in the network.
In this problem there are so many solutions are
introduced in previous researches. There are decision tree
based classifications, clustering based classification. To
secure the private datasets all the solutions are depends on
knowledge. There are different approaches to data owners
to integrate with distributed bases. In classification
approaches tree based methods are used in previous
approaches.
In privacy point of view the idle databases are
deeply verified. In starting phases there is construction of
databases in distributed networks are very hard to manage.
In previous approaches transformations are used to dataset
before the algorithms are applied. In another approach
secure multipart computations are used.
There are many advantages in distributed
networks by introducing data mining concepts. In this
environments there is very hard to gather all information of
central site. Users have very flexible to download data
from centralized location to maintain data mining
operations. In distributing mining provides various
methods to refer the limitations of data mining. And it
ISSN: 2231-5381
provides computing, and communication. In increases the
gaining in domain for data in applications.
Data analysis plays main role in next generation
of data drive in problem solving methods. In peer to peer
networks or distribution networks there are so many
challenges are there. Those are Scalability, efficiency,
frequency of users, and decentralization of data. Coming to
scalability there are many users and the data storing
increases in network. There is threat in maintains the
resources to share and exchange in network. Efficiency is
main one in challenges because of increasing the scalability
of users and resources there is an increase in load and
working of the algorithms.
Decentralization is although some peer-to-peer
systemsstill use central servers for various purposes the
entire process of this method is considered unsuitable for
mosttasks. This is especially true for tasks involving
largeamounts of data. The algorithms need to be designedto
transfer minimal data and use no centralized coordination.
For a real-world P2P system, data at the peers
often change. Thus, the algorithms have to work
incrementally. This is especially right for sensor networks
where generally the data is considered to be streaming.
Any algorithms that need to begin from scratch whenever
the data changes are thus inappropriate for our goals. So
the incremental methods are required. From the rate of data
change may be higher than the rate of computation in some
applications the methods has to be able to report are
incomplete solution at any time. These are called anytime
algorithms.
In distributed systems can be very large.
Therefore anyattempt to synchronize between the entire
networks islikely to fail due to connection latency, or
limited bandwidthin case of sensor networks. Thus, any
algorithmdeveloped for peer-to-peer system should not
take theroute of global synchronization.
II. RELATED WORK
Clustering is the method of grouping a set of items in such
a way that items in the same group are more similar to each
http://www.ijettjournal.org
Page 5
International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 1 – Oct 2014
other than to those in other groups. It is a key task of data
mining and a generalmethod for statistical data processing
and utilized in many fields that are including machine
learning and pattern recognition etc.
In cluster processingis not a singleparticular algorithm but
the basicwork to be solved. It can be solved by different
algorithms that differentiate significantly in their notion of
what constitutes a cluster and how to efficiently find them.
The proper clustering algorithm and propertyconfiguration
depends upon on individual data set and wish to use of the
results. Cluster processing as such is not an automatic
work but acyclic process of interactive multi-objective
optimization involves testing and failure. It is very
necessary to manipulate data preprocessing.
Let us assume that the data is one dimensional
(single attribute y) and are partitioned across s sites. Each
site has nl data items ( ). Let denote the cluster
membership for the ith cluster for the jth data point at the
(t)th EM round. In the E step, the values (mean for cluster
i), (variance for cluster i) and (Estimate of proportion of
items i) are computed using the following sum:
The second part of the summation in all these
cases is local to every site. It’s easy to see that sharing this
value does not reveal to the other sites. It’s also not
necessary to share nl, and the inner summation values, but
just computing n and the global summation for the values
above using the secure sum technique described earlier.
In the M step, the z values can be partitioned and
computed locally given the global , and . This also does
not involve any data sharing across sites.
In term clustering there are a manynumber of
tokens with same meaningsincluding automatic
classification and numerical. The subtle differences are
often in the usage of the results: while in data mining, the
resulting groups are the matter of interest, in automatic
classification the resulting discriminative power is of
interest. This often leads to misunderstandings between
researchers coming from the fields of data mining and
machine learning, since they use the same tokens and often
the same algorithms, but have different goals.
The scenario described is where two parties with
database D1 and D2 wish to apply the decision tree
algorithm on the joint database D1 U D2 without revealing
any unnecessary information about their database. The
technique described uses secure multi party computation
ISSN: 2231-5381
under the semi honest adversary model and attempts to
reduce the number of bits communicated between the two
parties.
The traditional ID3 algorithm computes a decision
tree by choosing at each tree level the best attribute to split
on at that level ad thus partition the data. The tree building
is complete when the data in uniquely partitioned into a
single class value or there are no attributes to split on. The
selection of best attribute uses information gain theory and
selects the attribute that minimizes the entropy of the
partitions and thus maximizes the information gain.
In the PPDM scenario, the information gain for
every attribute has to be computed jointly over all the
database instances without divulging individual site data.
We can show that this problem reduces to privately
computing x ln x in a protocol which receives x1 and x2 as
input where x1 + x2 = x.
The basic frequency mining problem can be
described as follows. There are n customers and each
customer has a Boolean value . The problem is to find out
the total number of 1s and 0s without knowing the
customer values i.e. computing the sum without revealing
each . We cannot use the secure sum protocol because of
the following restrictions.
•
Each customer can send only one flow of
communication to the miner and then there’s no further
interaction.
•
The
between themselves.
customers
never
communicate
The technique presented in [8] uses the additively
homomorphic property of a variant of the ElGamal
encryption. This is described below:
Let G be a group in which discrete logarithm is
hard and let g be a generator in G. Each customer has two
pairs of private/public key pair and. The sum and , along
with G and the generator g is known to everyone. Each
customer sends to the miner the two values and. The miner
computes. For the value of d for which, we can show that
this represents the sum . Since 0 < d < n, this is easy to find
by encrypt and compare. We can also that assuming all the
keys are distributed properly when the protocol starts, the
protocol for mining frequency protects each honest
customer’s privacy against the miner and up to (n-2)
corrupted customers.
http://www.ijettjournal.org
Page 6
International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 1 – Oct 2014
III. PROPOSED SYSTEM
We propose an empirical approach for secure clustering
over distributed networks with enhanced k means and
cryptographic mechanism. Clustering approach groups
similar type of data objects from large amount of data
points and cryptographic algorithm maintains privacy or
data confidentiality while transmission of data over
network
Every individual node can be treated as data holder of
player in distributed network. Data holder initially
preprocess the document by removing the inconsistent
feature set from data objects after preprocessing document
weight or file relevance score can be computed to convert
the categorical object to numerical value. Document weight
can be computed interms of frequency of the keyword as
term frequency (i.e. number of occurrences of a keyword in
a document) and inverse document frequency(i.e number
of occurrences of a keyword in all documents), here
frequency indicates occurrences of a keyword in a
document
d1
1.0
0.48
0.66
0.89
0.45
d1
d2
d3
d4
d5
d2
0.77
1.0
0.88
0.67
0.88
Cosine Similarity matrix can be constructed over document
weights with the following similarity measure,
Pre
construction of cosine similarity matrix reduces the time
complexity and improves the performance
We are using a most widely used similarity measurement
i.e cosine similarity
Cos(dm,dn)= (dm * dn)/Math.sqrt(dm * dn)
Where
dm is centroid (Document weight)
dn is document weight or file Weight from the
Cosine similarity of the document can be retrieved
efficiently by simple intersection of row and column of
respective document or object ids.
Sample cosine
similarity matrix as follows
d3
0.45
0.9
1.0
0.67
0.34
d4
0.32
0.47
0.77
1.0
0.34
d5
0.67
0.55
0.79
0.89
1.0
Fig1: Cosine Similarity Matrix
In the above cosine similarity matrix table
D(d1,d2…...dn) represents set of documents at data holder
or player and their respective cosine similarities between
the data objects , it minimizes time and space complexities
by transmitting the preprocessed documents instead of raw
documents , while computing the similarity between the
centroids and documents while clustering.
In our approach we are enhancing K Means clustering
algorithm with intra cluster average based re-centroid
computation instead of random selection of centroid from
new cluster for every iteration, the following algorithm
shows the optimized k-means algorithm as follows
Step 2: While (iterations of clustering < = user specified
maximum number of iterations)
Step3: Cosine similarity (di,dj)
Where di is file relevance score of the data object I
djis file relevance score of data object j
Step 4: Assign each data object to its most similar or
closest centroid
Algorithm:
Step 5: Re-compute new centroid with intra cluster data
points (i.e average of any k data points or file relevance
scores in the individual cluster).
Step1 : Select any K number of data objects points from
data set as centroids
Ex: (P11+P12+….P1k) / K
ISSN: 2231-5381
http://www.ijettjournal.org
Page 7
International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 1 – Oct 2014
All data points from the same cluster for each new centroid
BIOGRAPHIES
6. Continue step 2 to step 5.
Y.V.Siva Prasad completed B.Tech in
avanthi
institute
of
engg.&
tech,makavarapallem,narsipatnam. He is
pursuingM.Tech inavanthi institute Of
Engineering.&tech,makavarapallem,narsi
patnam. His areas of Interest are Data
mining Network security and Cloud Computing.
In traditional approach of k means clustering data objects
can be grouped based on random selection of centroid, in
this paper we are computing the centroid with intra cluster
averages for every iteration. It improves the performance
by preprocessing the documents and it obviously leads to
time and space complexity issue over networks. Privacy
can be maintained by triple DES algorithm, random session
key can be generated to encrypt the preprocessed data
objects and forwards to requested data holder, at receiver
end these cipher data objects can be decrypted with same
key
P.Rajasekhar working as Assitant
Professor in avanthi institute Of
Engineering.&tech,
makavarapallem,
narsipatnam. His areas of interest are
data mining, network security and cloud
computing.
IV. CONCLUSION
We are concluding our current research work with efficient
privacy preserving data clustering over distributed
networks. Quality of the clustering mechanism enhanced
with preprocessing, relevance matrix and centroid
computation in k-means algorithm and cryptographic
technique solves the secure transmission of data between
data holders or players and saves the privacy preserving
cost by forward the relevant features of the dataset instead
of raw datasets. We can enhance security by establishing
an efficient key exchange protocol and cryptographic
techniques while transmission of data between data holders
or players.
REFERENCES
[1] Privacy-Preserving Mining of Association Rules From Outsourced
Transaction DatabasesFoscaGiannotti, Laks V. S. Lakshmanan, Anna
Monreale, Dino Pedreschi, and Hui (Wendy) Wang.
[2] Privacy Preserving Decision Tree Learning Using Unrealized Data
SetsPui K. Fong and Jens H. Weber-Jahnke, Senior Member, IEEE
Computer Society
3) Anonymization of Centralized and Distributed Social Networks by
Sequential ClusteringTamirTassa and Dror J. Cohen
[4] Privacy Preserving ClusteringS. Jha, L. Kruger, and P. McDaniel
[5] Tools for Privacy Preserving Distributed Data Mining , Chris Clifton,
Murat Kantarcioglu, Xiaodong Lin, Michael Y. Zhu
[6] S. Datta, C. R. Giannella, and H. Kargupta, “Approximate
distributedK-Means clustering over a peer-to-peer network,” IEEETKDE,
vol. 21, no. 10, pp. 1372–1388, 2009.
[7] M. Eisenhardt, W. M¨ uller, and A. Henrich, “Classifying
documentsby distributed P2P clustering.” in INFORMATIK, 2003.
[8] K. M. Hammouda and M. S. Kamel, “Hierarchically distributedpeerto-peer document clustering and cluster summarization,”IEEE Trans.
Knowl. Data Eng., vol. 21, no. 5, pp. 681–698, 2009.
[9] H.-C. Hsiao and C.-T.King, “Similarity discovery in structuredP2P
overlays,” in ICPP, 2003.
[10] Privacy Preserving Clustering By Data TransformationStanley R. M.
Oliveira ,Osmar R. Za¨ıane
ISSN: 2231-5381
http://www.ijettjournal.org
Page 8
Download