A Novel and Privacy Preserving Unsupervised Learning Approach between Data Holders

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 1 – Oct 2014
A Novel and Privacy Preserving Unsupervised
Learning Approach between Data Holders
Potnuru Srilatha1, M.VenkataBalaji Chadrashekar2
1
1,2
Final Year M.Tech Student, 2Assoc. Professor
Dept. of CSE,Aditya Institute of Technology and Management(AITAM), Tekkali,Srikakulam,Andhra Pradesh
Abstract: Privacy preserving over data mining in distributed
networks is still an important research issue in the field of
Knowledge and data engineering or community based clustering
approaches, privacy is an important factor while datasets or data
integrates from different data holders or players for mining.
Secure mining of data is required in open network. In this paper
we are proposing an efficient privacy preserving data clustering
technique in distributed networks.
In existing approaches the researches mainly
focused on protecting the multiparty computation which
describes the privacy preserving clustering limitations in
the multi user cases. That means a multiple data owners
have their own databases and exchange each other to
calculate the central clustering output. The result is
calculated in particular iterations of hard communications
between the data owners.
I. INTRODUCTION
In present days there are rapidly growing
technology and knowledge discovery out of information
and it becomes more prevalent. Generally clustering is
basic approach for analyzing the data mining tools. It is a
method for grouping of similar properties of items or
objects. It is used form minimizing the processing time
between the segregation of data. For clustering process a
basic known method is k-means clustering algorithm.
It is applied in many applications such as in
pattern recognition, real life, customer behavior analysis
etc. The data which is used in clustering methods are very
sensitive information such as financial records by the back.
There are some privcy issues to be considered in this
confidential information. The data player means the bank
should not leak any others information such as the
customers which means do not observe by the others.
Moreover there are so many issues that other third party
members include in clustering analysis. [1][2]
The database in the financial organizations should
not retrieve the information to others. And in other
application of clustering is in cloud computing that is it
introduced software in service. In this a data owner sends
the clustering data to service provider in order to get the
profits such as cost complexity, pay on used services only.
The service provider execute the clustering process and
sends the formatted data to the data owner. In this study
there is need of privacy and without accessing the actual
data sent by the data owner.
ISSN: 2231-5381
In other research that is perturbation method that
the data owner creates a perturbed database from the actual
database by including the noise. The created data base can
be browsed by any other members. So who can gather the
information of the perturbed databases he/she must perform
the clustering methods on the collected databases. In some
other researches propose a solution for privacy preserving
clustering problem that the data partitioning based on
various attributes and decentralized to many other parties.
In this k-means algorithm is used which results o(nrk) and
n is the number of data items, r is the number of users and
k is the number of clusters. [5,6]
In many of the clustering algorithms calculate the
distances which mean the distance between the data points
and compare with the distance of the divided points. If the
distances calculated are more accurate we can calculate
more accurate mining outputs. Otherwise the databases are
going to be distorted by various distortions. So our
proposed methods are initiative methods to ensure correct
results.
In traditional approaches there are so many
methods which uses additive noise and multiplicative noise
to increase the efficiency of the privacy. The additive noise
distort more in clustering result because it do not consider
the pair-wise distances of actual data. The disturbing in
clustering can result is measured by difference between the
clustering outputs of the actual data and many features are
provided such as variation of information etc, are also
studied for compare the clusters.[8]
http://www.ijettjournal.org
Page 32
International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 1 – Oct 2014
II. RELATED WORK
There is an example for the privacy problem in
clustering methods found in business organizations. Many
companies have large set of records of their customers’ and
their buying actions. These companies have to decide that
cooperatively and apply data clustering method on their
datasets for their benefit since this brings them a profit over
other companies. The aim is to divide a market into various
partitions of customers where any partition may be selected
as a market to be achieved with various marketing
combinations. These companies like to convert their data in
a way that the privacy of customers cannot be disturbed.
Some of the researchers introduced methods for
assuring the moderate disclosure when the exploring of the
detailed data. In this approach initially it builds a decision
tree using the training data then it exchanges the
confidential data only to confidential class labeled. This
approach is more closure to the root but low precision. In
other approach also considered building of the decision tree
using the training data which values of each records are
perturbed means added with noise that random values from
probability distribution. The resulted records are different
from actual records and not accurate compared with the
original data records .
In other approach focused on information loss. It
is based on expectation maximization that converges to
maximum similar records of the actual data though a large
amount of data is there in the original database. It also has
some features of Bayesian classification method. In other
traditional approaches introduced association rule mining
which more efficient in categorical items that has preserve
the privacy of every transaction. This idea raised because
of some items in every transaction are copied by novel
items and not present in original transactions. In this some
information is taken and some other information is raised
to gain the optimal privacy in data. Basically the schema is
very flexible to retrieve the association rules and protect
with uniform randomization.
requirements of privacy such as in case of medical records)
or corporate secrecy can prevent such collaborative data
mining. In privacy preserving that data mining [3, 4]
considers itself with algorithms and methods which allow
data owners to outsource in such endeavors without any
collaborators. It is being needed to retrieve any items from
their own databases. The two aims to get in this
configuration are privacy that is no “leakage” of actual data
and more efficient in communication and minimize the
amount of data to passed between the users.
Large databases and data streams: Algorithms that
work on large databases and data streams must be efficient
in terms of I/O access. Since secondary-storage access can
be significantly more expensive than main-memory access,
an algorithm should make a small number of passes over
the data, ideally no more than one pass. Also, since main
memory capacity typically lags significantly behind
secondary storage, these algorithms should have sub-linear
space complexity. Although the notion of data streams has
been formalized relatively recently, a surprisingly large
body of results already exists for stream data mining [10, 1,
2, 8].
Most of the privacy-preserving protocols available
in the literature convert existing data mining algorithms or
distributed data mining algorithms into privacy-preserving
protocols. The resulting protocols can sometimes leak
additional information. For example, in the privacypreserving clustering protocols of [4,1], the two
collaborating parties learn the candidate cluster centers at
the end of each iteration. As explained in [2], this
intermediate information can sometimes reveal some of the
parties’ original data, thus breaching privacy. While it is
possible to modify these protocols to avoid revealing the
candidate cluster centers [7], the resulting protocols have
very high communication complexity. In this paper, we
describe a k-clustering algorithm that we specifically
designed to be converted into a communication-efficient
protocol for private clustering.
Private decentralized data mining:- The Data
distributed among different databases owned by various
data players. Distributed data mining is acquiring of
knowledge from various databases in each of which is
owned by various organizations.
The threat to privacy becomes real since data
mining techniques are able to derive highly sensitive
knowledge from unclassified data that is not even known to
database holders. Worse is the privacy invasion occasioned
by secondary usage of data when individuals are unaware
of “behind the scenes” use of data mining techniques [10]
When distributed data mining is possible to get
more knowledge or knowledge that has large applicability
than use of data owned by a single property of
The error-sum-of-squares (ESS) objective
function is defined as the sum of the squares of the
ISSN: 2231-5381
http://www.ijettjournal.org
Page 33
International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 1 – Oct 2014
feature set from data objects after preprocessing document
weight or file relevance score can be computed to convert
the categorical object to numerical value. Document weight
can be computed in terms of frequency of the keyword as
term frequency (i.e. number of occurrences of a keyword in
a document) and inverse document frequency(i.e number
of occurrences of a keyword in all documents), here
frequency indicates occurrences of a keyword in a
document
distances between points in the database to their
nearest cluster centers. The k-clustering problem
requires the partitioning of the data into k clusters
with the objective of minimizing the ESS. Lloyd’s
algorithm (more popularly known as the k-means
algorithm) is a popular clustering tool for the
problem. This algorithm converges to a local
minimum for the ESS in a finite number of iterations.
Ward’s algorithm for hierarchical agglomerative
clustering also makes use of the notion of ESS.
Although Ward’s algorithm has been observed to
work well in practice, it is rather slow (O(n3)) and
does not scale well to large databases.
Cosine Similarity matrix can be constructed over document
weights with the following similarity measure,
Pre
construction of cosine similarity matrix reduces the time
complexity and improves the performance
We are using a most widely used similarity measurement
i.e cosine similarity
III. PROPOSED SYSTEM
We propose an empirical approach for secure
clustering over distributed networks with enhanced k
means and cryptographic mechanism. Clustering approach
groups similar type of data objects from large amount of
data points and cryptographic algorithm maintains privacy
or data confidentiality while transmission of data over
network
Cos(dm,dn)= (dm * dn)/Math.sqrt(dm * dn)
Every individual node can be treated as data holder of
player in distributed network. Data holder initially
preprocess the document by removing the inconsistent
Cosine similarity of the document can be retrieved
efficiently by simple intersection of row and column of
respective document or object ids.
Sample cosine
similarity matrix as follows
d1
d2
d3
d4
d5
d1
1.0
0.48
0.66
0.89
0.45
d2
0.77
1.0
0.88
0.67
0.88
Where
dm is centroid (Document weight)
dn is document weight or file Weight from the
d3
0.45
0.9
1.0
0.67
0.34
d4
0.32
0.47
0.77
1.0
0.34
d5
0.67
0.55
0.79
0.89
1.0
In the above cosine similarity matrix table
D(d1,d2…...dn) represents set of documents at data holder
or player and their respective cosine similarities between
the data objects , it minimizes time and space complexities
by transmitting the preprocessed documents instead of raw
documents , while computing the similarity between the
centroids and documents while clustering.
new cluster for every iteration, the following algorithm
shows the optimized k-means algorithm as follows
In our approach we are enhancing K Means clustering
algorithm with intra cluster average based re-centroid
computation instead of random selection of centroid from
Step 2: While( iterations of clustering < = user specified
maximum number of iterations)
ISSN: 2231-5381
Algorithm:
Step1 : Select any K number of data objects points from
data set as centroids
http://www.ijettjournal.org
Page 34
International Journal of Engineering Trends and Technology (IJETT) – Volume 16 Number 1 – Oct 2014
Step3 : Cosine similarity(di,dj)
REFERENCES
Where diis file relevance score of the data object I
djis file relevance score of data object j
Step 4: Assign each data object to its most similar or
closest centroid
Step 5: Re-compute new centroid with intra cluster data
points (i.e average of any k data points or file relevance
scores in the individual cluster).
Ex: (P11+P12+….P1k) / K
All data points from the same cluster for each new centroid
6. Continue step 2 to step 5.
In traditional approach of k means clustering data
objects can be grouped based on random selection of
centroid, in this paper we are computing the centroid with
intra cluster averages for every iteration. It improves the
performance by preprocessing the documents and it
obviously leads to time and space complexity issue over
networks. Privacy can be maintained by triple DES
algorithm, random session key can be generated to encrypt
the preprocessed data objects and forwards to requested
data holder, at receiver end these cipher data objects can be
decrypted with same key
[1] Privacy-Preserving Mining of Association Rules From Outsourced
Transaction DatabasesFoscaGiannotti, Laks V. S. Lakshmanan, Anna
Monreale, Dino Pedreschi, and Hui (Wendy) Wang.
[2] Privacy Preserving Decision Tree Learning Using Unrealized Data
SetsPui K. Fong and Jens H. Weber-Jahnke, Senior Member, IEEE
Computer Society
3) Anonymization of Centralized and Distributed Social Networks by
Sequential ClusteringTamirTassa and Dror J. Cohen
[4] Privacy Preserving ClusteringS. Jha, L. Kruger, and P. McDaniel
[5] Tools for Privacy Preserving Distributed Data Mining , Chris Clifton,
Murat Kantarcioglu, Xiaodong Lin, Michael Y. Zhu
[6] S. Datta, C. R. Giannella, and H. Kargupta, “Approximate
distributedK-Means clustering over a peer-to-peer network,” IEEETKDE,
vol. 21, no. 10, pp. 1372–1388, 2009.
[7] M. Eisenhardt, W. M¨ uller, and A. Henrich, “Classifying
documentsby distributed P2P clustering.” in INFORMATIK, 2003.
[8] K. M. Hammouda and M. S. Kamel, “Hierarchically distributedpeerto-peer document clustering and cluster summarization,”IEEE Trans.
Knowl. Data Eng., vol. 21, no. 5, pp. 681–698, 2009.
[9] H.-C. Hsiao and C.-T.King, “Similarity discovery in structuredP2P
overlays,” in ICPP, 2003.
[10] Privacy Preserving Clustering By Data TransformationStanley R. M.
Oliveira ,Osmar R. Za¨ıane
BIOGRAPHIES
She completed herB.TECH in Dept. of
CSIT,SARADA Institute of Technology
And Management (SITAM),AmpoloRoad
,Srikakulam, Andhra Pradesh. She
Pursuing M.Tech in Dept. of CSE, Aditya
Institute
of
Technology
and
Management(AITAM), Tekkali, Srikakulam, Andhra
Pradesh. Her interesting areas are data mining, network
security and cloud computing.
IV. CONCLUSION
We are concluding our current research work with efficient
privacy preserving data clustering over distributed
networks. Quality of the clustering mechanism enhanced
with preprocessing, relevance matrix and centroid
computation in k-means algorithm and cryptographic
technique solves the secure transmission of data between
data holders or players and saves the privacy preserving
cost by forward the relevant features of the dataset instead
of raw datasets. We can enhance security by establishing
an efficient key exchange protocol and cryptographic
techniques while transmission of data between data holders
or players.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 35
Download