A Novel and Secure Mining of Data in Distributed Architecture

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume17 Number5–Nov2014
A Novel and Secure Mining of Data in Distributed Architecture
M L V A Priya,S Venkata Suryanarayana2, Dr. K S N Prasad3
1
M.Tech Student,2Assistant Professor,3Associate Professor
Dept of CSE, GVIT, Bhimavaram, A.P, India.
Abstract:Data confidentialityover data mining in
distributed networks is still an important and interesting
research issue in the field of Knowledge and data
engineering or community based clustering approaches,
privacy is a basic factor while datasets or data integrates
from different data holders or players for mining. Secure
mining of data is required in open network. In this paper
we are proposing an efficient privacy preserving data
clustering technique in distributed networks in
decentralized architecture.
I.INTRODUCTION
In distributed networks or open environments nodes ,it can
be either centralized or decentralized architecture
,communicates with each other openly fordata transmission
work , there is a rapid research work going on secure
mining of data.Various researchers work on privacy
preserving techniques while mining of data either in
classification, association rule mining or clustering.
Randomization and perturbation approaches
available for privacy preserving process and it can be
maintained in two ways, one is cryptographic approach
here real data sets can be converted to unrealized datasets
by encoding the real datasets and the second one
imputation methods, here some fake values imputed
between there real dataset and extracted while mining with
some rules[1][2].
Clustering is a process of grouping similar type of
objects based on distance (for numerical data) or similarity
(for categorical data) between data objects. In distributed
environment data holders or players maintains individual
data sets and every node or vertex is connected with each
other by an edge along with their quasi identifiers [3].
The graphical notation of the nodes accompanied by the
attributes provided by the information of demographical
such as age, mobile, address and profile improve the
structure of the network. Researchers show interest over
social networks for many disciplines activities like market
research, sociology, psychology and epidemiology due the
sensitivity of the data in social network ,the data presence
is less over the network.so there is a need to anonymize
the data
to avoid the prevention of the sensitive
information of the data of the particular user is protected
ISSN: 2231-5381
privacy in order to secure the data of the particular user and
the anonymization of the data is obtained
Identifying and removing of attributes like names
or social security numbers is insufficient from the data
over the network, individuals information can be obtained
by the graphical representation of the node using the
structure of the released graph. Finally in the social
network the data is descripted accompanied by the nodes
and suggested a unique anonymization technique and also
categorized the data based on name by the clustering .
This algorithm uses significantly the graph to represent the
information losses by the anonymization and observe the
privacy preservation of the data over the network with
different users using the network.
II. RELATED WORK
In social network, nodes can be represented as
vertices and those vertices V (v1,v2…vn) connected through
set of edges E in a undirected graph G (V,E) and nonidentifying attribute to describe node is known as quasiidentifier. Clustering can be performed on the quasi
identifiers like age and gender, distributed clustering
groups similar type of objects based on minimum distance
between the nodes.
In social network there is a problem for privacy
preservation therefore to split the data between several
users we follow the distributed setting over the network.
The main aim is to protect the data or information of the
user about the links over the network without knowing to
the other user about the anonymized view of the data over
the network with unified method to provide the privacy.
Now a centralized setting implements an anonymization
algorithm to identify the variants using a sequence
clustering denote as Sq. This algorithm
efficiently
performs over the algorithm SaNGreeA because of
campan and truta based on clustering by achieving the
anonymity over the network. According to the knowledge
regarding the privacy this is one of the best ways to provide
privacy preservation of distributed social network.
Set of entities has relations between them are
known as networks and available in open for all the user in
the network are known as social network. Consider some
amount of population in which information provided for
http://www.ijettjournal.org
Page 243
International Journal of Engineering Trends and Technology (IJETT) – Volume17 Number5–Nov2014
Node1
Node2
Node i
……………………………………….....
Req
Pre-processed
Documents
Clustering
Req
Pre-processed
Documents
Req
Node n
New Document
Fig1: proposed Architecture
individual among the total people and also creates relation
like friendship, correspondence, collaboration and so on.
Similarly we consider about information network in which
it describes a scientific journals and refers the links. These
are denoted by the graphical notation of the nodes which
corresponds to the entities and provide edge relation
between the users. In real time network the data will much
more complicated and also it contains information in
addition such as asymmetric interaction over the financial
transaction in which more than one user is involved and in
social network the co-membership is used to modeled as a
hyper graph.
In this paper we are proposing an efficient and
secure data mining technique with optimized k-means and
cryptographic approach, for cluster the similar type of
information, initially data points need to be share the
information which is at the individual data holders or
players.
In this paper we are proposing architecture, Every data
holder or player clusters the documents itself after
preprocessing and computes the local and global
frequencies of the documents for calculation of file Weight
or document weights. Distributed k means algorithm is one
of the efficient distributed clustering algorithms. In our
work we are working towards optimized clustering in
distributed networks by improving traditional clustering
algorithm. In our approach if a new dataset placed at data
holder, it requests the other data holders to forward the
relevant features from other data holders instead of entire
datasets to cluster the documents.
Individual peers at data holders initially
preprocess raw data by eliminating the unnecessary
features from datasets after the preprocessing of datasets,
compute the file Weight of the datasets or preprocessed
feature set in terms of term frequency and inverse
document frequencies and computes the file relevance
matrix to reduce the time complexity while clustering
datasets. We are using a most widely used similarity
measurement i.e. cosine similarity
III. PROPOSED SYSTEM
ISSN: 2231-5381
In this paper we are emphasizing on mining
approach not on cryptographic technique, For secure
transmission of data, various cryptographic algorithms and
key exchange protocols available. The following diagram
shows architecture of the proposed work as follows
Cos(dm,dn)= (dm * dn)/Math.sqrt(dm * dn)
Where
dm is centroid (Document weight)
http://www.ijettjournal.org
Page 244
International Journal of Engineering Trends and Technology (IJETT) – Volume17 Number5–Nov2014
In the following example diagram shows a simple way to
retrieve similarity between documents in at individual data
holders by computing cosine similarity prior clustering as
follows.
dn is document weight or file Weight from the
d1
d2
d3
d4
d5
d1
1.0
0.48
0.66
0.89
0.45
d2
0.77
1.0
0.88
0.67
0.88
d3
0.45
0.9
1.0
0.67
0.34
d4
0.32
0.47
0.77
1.0
0.34
d5
0.67
0.55
0.79
0.89
1.0
Fig2: Similarity Matrix
In the above table D(d1,d2….dn ) represents set of
documents at data holder or player and their respective
cosine similarities, it reduces the time complexity while
computing the similarity between the centroids and
documents while clustering.
In our approach we are enhancing K Means algorithm
with re-centoird computation instead of single random
selection ate every iteration, the following algorithm shows
the optimized k-means algorithm as follows
Algorithm :
1: Select K points as initial centroids for initial iteration
2: until Termination condition is met (user specified
maximum no of iterations)
3: get_relevance(dm,dn)
Where dmis the document M file Weight from relevance
matrix
dnis the document N file Weight from relevance matrix
4: Assign each point to its closest centroid to form K
clusters
5: Recompute the centroid with intra cluster data points
(i.e. average of any k data points in the individual cluster).
Ex: (P11+P12+….P1k) / K
All points from the same cluster
6. Compute new centroid for merged cluster
In the traditional approach of k means
algorithm it randomly selects a new centroid, in our
approach we areenhancing by prior construction of
relevance matrix and by considering the average k random
selection of document Weight for new centroid calculation
.
Triple DES :
Triple DES is the common name for the Triple Data
Encryption Algorithm (TDEA) block cipher. It is so named
because it applies the Data Encryption Standard (DES)
cipher algorithm three times to each data block. Triple DES
provides a relatively simple method of increasing the key
ISSN: 2231-5381
size of DES to protect against brute force attacks, without
requiring a completely new block cipher algorithm.
The standards define three keying options:
 Keying option 1: All three keys are independent.
 Keying option 2: K1 and K2 are independent, and K3
= K1.
 Keying option 3: All three keys are identical, i.e. K1 =
K2 = K3.
Keying option 1 is the strongest, with 3 x 56 = 168
independent key bits.
Keying option 2 provides less security, with 2 x 56 = 112
key bits. This option is stronger than simply DES
encrypting twice, e.g. with K1 and K2, because
it protects against meet-in-the-middle attacks.
Keying option 3 is no better than DES, with only 56 key
bits. This option provides backward compatibility with
DES, because the first and second DES operations simply
cancel out. It is no longer recommended by the National
Institute of Standards and Technology (NIST) and not
supported by ISO/IEC 18033-3.
In general Triple DES with three independent keys (keying
option 1) has a key length of 168 bits (three 56-bit DES
keys), but due to the meet-in-the-middle attack the
effective security it provides is only 112 bits. Keying
option 2, reduces the key size to 112 bits. However, this
option is susceptible to certain chosen-plaintext or knownplaintext attacks and thus it is designated by NIST to have
only 80 bits of security.
IV. CONCLUSION
We are concluding our current research work with
efficient privacy preserving data clustering over distributed
networks. Quality of the clustering mechanism enhanced
with preprocessing, relevance matrix and centroid
computation in k-means algorithm and cryptographic
technique solves the secure transmission of data between
data holders or players and saves the privacy preserving
cost by forward the relevant features of the dataset instead
of raw datasets. We can enhance security by establishing
http://www.ijettjournal.org
Page 245
International Journal of Engineering Trends and Technology (IJETT) – Volume17 Number5–Nov2014
an efficient key exchange protocol and cryptographic
techniques while transmission of data between data holders
or players.
Dr. K S N Prasad working as Associate
professor, CSE Department in GVIT,
Bhimavaram, A.P, India. His interesting
REFERENCES
areas are Data Mining, Network security.
[1] Privacy-Preserving Mining of Association Rules From
Outsourced Transaction Databases Fosca Giannotti, Laks
V. S. Lakshmanan, Anna Monreale, Dino Pedreschi, and
Hui (Wendy) Wang.
[2] Privacy Preserving Decision Tree Learning Using
Unrealized Data SetsPui K. Fong and Jens H. WeberJahnke, Senior Member, IEEE Computer Society
3) Anonymization of Centralized and Distributed Social
Networks by Sequential Clustering Tamir Tassa and Dror
J. Cohen
[4] Privacy Preserving ClusteringS. Jha, L. Kruger, and P.
McDaniel
[5] Tools for Privacy Preserving Distributed Data Mining ,
Chris Clifton, Murat Kantarcioglu, Xiaodong Lin, Michael
Y. Zhu
[6] S. Datta, C. R. Giannella, and H. Kargupta,
“Approximate distributedK-Means clustering over a peerto-peer network,” IEEETKDE, vol. 21, no. 10, pp. 1372–
1388, 2009.
[7] M. Eisenhardt, W. M¨ uller, and A. Henrich,
“Classifying documentsby distributed P2P clustering.” in
INFORMATIK, 2003.
[8] K. M. Hammouda and M. S. Kamel, “Hierarchically
distributedpeer-to-peer document clustering and cluster
summarization,”IEEE Trans. Knowl. Data Eng., vol. 21,
no. 5, pp. 681–698, 2009.
[9] H.-C. Hsiao and C.-T.King, “Similarity discovery in
structuredP2P overlays,” in ICPP, 2003.
[10]
Privacy
Preserving
Clustering
By
Data
Transformation Stanley R. M. Oliveira ,Osmar
R.
Za¨ıane
BIOGRAPHIES
Mr.M L V APriya pursuing M.Tech in
CSE Department in GVIT, Bhimavaram,
A.P, India. Her interesting areas are Data
Mining, Network security.
Mr. S VenkataSuryanarayana working
as Assistant professor in CSE
Department in GVIT, Bhimavaram,
A.P, India. His interesting areas are Data
Mining, Network security.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 246
Download