Nested Clustering Based Rotation using Radial Basis Function for PPDM Hariharan.R ,Durairaj.K

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 9 - Mar 2014
Nested Clustering Based Rotation using Radial Basis
Function for PPDM
Hariharan.R1,Durairaj.K2
Student, Department of Information Technology,
Sathyabama University, Chennai, India.
ABSTRACT-Privacy preserving in data mining is most
needed technique in the current world. The people they
don’t want to share their sensitive information with
unauthorized user, So they want to hide the information
from unauthorized person. In data mining there are lot
of domains are used for preserving the data’s like
cryptography , k-anonymity, perturbation, and lot of
technology’s are used but there are lot of drawbacks
also occurring in existing method, like loss of data
quality or limited preserving and so on. So in this paper
we are going to seen about how the neural network is
used for preserving the data’s. specially the method is
called “ radial basis function ” is used to preserving the
data and in this paper we are going to seen about
performance of our method metrics and limitation and
feature work of our concept.
KEYWORDS- privacy preservation in data mining , Radial
basis function , Randomization
I. INTRODUCTION
Data mining is extraction of interesting (non-trivial,
implicit, previously unknown and potentially useful)
patterns or knowledge from huge amount of data ,the data
mining is otherwise called as knowledge discovery in
database , knowledge extraction ,pattern analysis data
archeology ,information harvesting and business
intelligence etc. The simple search and query processing
also data mining. Data mining is mainly applicable in
market analysis and risk analysis.
Data mining software is one of a number of analytical
tools for analyzing data. It allows users to analyze data
from many different dimensions or angles, categorize it,
and summarize the relationships identified..Although data
mining is a relatively new term, the technology is not.
Companies have used powerful computers to sift through
volumes of supermarket scanner data and analyze market
research reports for years. However, continuous
innovations in computer processing power, disk storage,
and statistical software are dramatically increasing the
accuracy of analysis while driving down the cost.
Text mining, web mining ,stream data mining are some
other application of data mining .The huge amount of data
is stored in database . the datas are increasing day by day
rapidly.so that datas are given to data mining technique the
data mining is extract the data from large amount of data
set. In that data base some of the attribute are very sensitive
for example the shopping card and medical data base are
treated as sensitive attribute of particular person so that
should be hided from others or else it may create some
problem or prestige issue to that particular person .so we go
ISSN: 2231-5381
to the concept for preservation their sensitive data that is
called as privacy preservation in data mining .each
individual data can not identify by the others.
The hospital data is given to some research purpose. The
patient database is consider as very confidential in hospital
management system. Even though its sensitive the data
need to be given to research worker for analysis report of
human. In the same case the hospital management want
hide the individual disease. So they remove the unique
attribute for the patient security issues. Name, address,
phone number are all consider as unique identifiers. Even
though some of the attribute will show the particular patient
from the database. That is in the particular area code in the
particular age with the particular salary the person will
identify. Some times that may misuse by criminals. So the
hospital management will change the data’s in to the
nearest value it may higher or lower. But not the huge
difference. Then only the analysis report will give some
accurate data with the secure for the patient also.
The health insurance directly claimed the money from
the company using some common identification or unique
identification. In this case the company need know about
the amount need to be client but not the disease and some
sensitive information. The medi-clime also communicate
with the company database but the salary and some main
attribute are need to hide from the mediclime. So the data’s
are preserved using some privacy preservation technique.
The online transaction is sharing the two database like
bank and selling company .so the bank know only the
amount of purchase but not the product of purchase . the
selling company also only know amount is given by card
not the balance amount of that card holder .so this sensitive
information is hide using ppdm. PPDM is one of the
technique in this current world to prevent the data from
[7]the unauthorized person .The data base is given to the
mining there are some possibility for the piracy the data
and it may affect the user so need to remove the unique
attribute, that is pre-processing and common attribute also
change to the nearest value.From the lot of technologies we
use the perturbation technique in this paper.The
perturbation technique is modify the data .the number of
output is equal to number of input and the out put value is
nearly equal to original data but not same as original
data.the data quality and preservation rate all other metric
also depend up on technique applied in particular paper.
http://www.ijettjournal.org
Page 460
International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 9 - Mar 2014
II. RELATED WORK
IV. TYPES OF PPDM
Database is cluster[3] and is again cluster in to two subcluster. All sub-clusters are analyze first about the efficient
data for anonymize. The efficiency is not enough means,the
sub-cluster is merge with its adjacent sub-cluster. After this
adjustment all sub-cluster have enough data to modify. To
choose a greater cluster calculate the amount of data in
each sub-cluster. The range value should be determined for
sub-cluster. If its exceeds its again divided into sub-cluster.
After this checking process the centroid value is replaced in
attribute value.Different approaches for privacy
preservation data mining. [8]Data’s are partition in to
horizontally and vertically. Mainly two approach is keep
the data that approaches are generalization and
bucketization. The high dimensional data is also handle this
technique. Discus three phases are Attribute Partition,
Column generalization and Tuple partition. Four techniques
also implemented are Generalization, Bucketization,
Multiset-based Generalization, One Attribute per Column
Slicing.K-anonymity is proposed[1] for privacy
preservation kactus[5] algorithm is proposed for ppdm. The
decision tree make the original data set. In that tree
structure each node have weighted value. K-anonymity[10]
value fixed for the preservation. The weight value of each
value has equated and check whether its equal or it. If its
equal then it will mounted with parent, like this process is
continued. Finally the preserved data is get.[9]Tree based
data perturbation method for privacy preservation
.proposing a KD tree based perturbation. The data set is
partition in to subset, that also divided in to smaller subset
which have the more homogenizes that is each partition has
three value that value is placed by the average value of
that last partitioning. It is mainly applicable for numerical
attribute. The out coming data is pertibated from input data
and for getting the database structure use the conquered
method. [6] Proposed rotation based transformation(RBT)
for privacy preservation in data mining. The numerical
attributes its mainly focused on this paper. Using the
method called isometric transformation. The numerical data
is taken its apply in isometric formula after this process the
data value should be changed nearly equal to original
value.this all process are done after clustering the database.
So the metrics value has to be check for this process. From
the checking result the datas are present inside the cluster
only after the pertibation. So this is also one of the best
method for privacy preservation.This paper[4] propose how
the neural network is secure the data in data mining
approach. Two main methods are discussing in that paper
that are back propagation algorithm and ELM algorithm
.Back propagation algorithm is preserve the data in two
approach secure multi party addition and secure multi party
multiplication.ELM algorithm is efficient for single neuron
layer learning system .
There are lot of technologies are used in privacy
preservation in data mining. Some of technique are
discussed here,
III. TYPES OF ATTRIBUTE
The data base contain lot of attribute of a person but
mainly is divided into two types of attribute that is
numerical attribute and categorical attribute the attribute
which have only integer that’s called numerical attribute
and the non numerical values are categorical attribute .
ISSN: 2231-5381
a .Additive Perturbation
Additive noise perturbation is adding the random
noise(b) to the original dataset(a) after performing the
additive perturbation can estimate the probability
distribution of original numeric data value. The perturbed
value of an attribute can be estimated, with a confidence
(c). Then the privacy is estimated by (b−a) with confidence
‘c’.
b .Multiplication Perturbation
This is similar as a additive perturbation but here we
have to multiply the noise (b) with the original dataset (a).
Here we get a accurate dataset and the noise value is in
floating with the range of zero to one will give some good
output. The perturbated data (c) is nearer to original data, ie
c=a*b.
c .Kanonymity
The concept of k-anonymization[7] is introduced by
Samarati and Sweeney k-anonymity is a method to privacy
preserving in terms of data repetition in quasi identifiers.
d .Cryptography-Based Techniques
Cryptography is also used for privacy preserving in data
mining. From the database select the sensitive attribute and
apply the any of the cryptography technique like key
generation and adding key value and get the new altered
database.
V. PROBLEM DEFINITION
There are lot of technology are available for PPDM like
anonoimzation randomization and slicing etc. there is lot of
drawbacks also available in existing methodology so As
there are no methodology for privacy preservation in
effective manner in utility in less execution time and a
control over anonymization. So we have desired to design a
concept based on artificial intelligent specially in neural
network for preserving the data using perturbation
technique named as Nested Clustering Based Rotation
using Radial Basis Function(NCBRBF) for PPDM.
VI. ARTIFICIAL INTELLIGENT
Artificial intelligent is top technology in current world.
In today each and every field a scientist try to implement
the artificial intelligent because they get some effective
solutions for the problem what they research. Artificial
intelligent is give more expectation result in all field.
Because the designing of the system is very effective
manor. For example the medical and science and some
technologies the scientist have tried and get efficient result.
Artificial intelligent is act based on the neural network
technology.The neurons work is based on the experience or
training given to that neurons.
http://www.ijettjournal.org
Page 461
International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 9 - Mar 2014
Fig 1:Block Diagram for NCBRBF
VII. NEURAL NETWORK
Neural network are widely used for learning system and
knowledge representation and they are applied in different
fields including medical diagnosis, pattern recognition
,security ,fraud detection and other knowledge discover
systems.NN learning system process information in the
same way as biological nervous system ,using an enormous
number of highly interconnected processing elements.
The word “neural network” derived from neuron from
human brain why because in a human system is act
according to the neuron system in a human mind. If
something happen in front of human it will capture by
human eyes and it sends the message to brain. In the brain
contains number of neurons. The brain acts depend on the
neuron function. The human reaction also depends on
neurons command. Like the artificial process also given the
efficient output what they expect that system construction
based on the neuron performance. So that system give
efficient output what they expected more or less. So that
system construct using number of neurons. So that system
is called neural networks.
i.
RADIAL BASIS FUNCTION(RBF)
Radial Basis Function is one of the method which is
used in Neural networks. The word Radial Basis is derived
from the shape of the function. After making this Radial
network it will get a bell shape and if we cut in between
any place we will get a round shape which have a same
radius at the mid-point. So this function is called as a radial
basis function.
ISSN: 2231-5381
In our case this paper tried with a new concept that is
PPDM is mounded with the RBF function. First design a
network with the some set of neurons with the help of
newrb function. In that function need to set the value for p
and T value P represent input sample data and T is Target
value and goal ,spread, MN,DF values are set default for
our convience may change the value for attribute for newrb.
Each neuron handle the multiple work. First take some
sample data and give it as a input to the network while
passing the data the developer trained that network for
giving the input with the corresponding output. If the P
value is equal to T our network is build good. In case of the
non equal value for P and T try the different values for
newrbf attributes till get the same value for output like a
input. After this process give the data set as input and get
the same as output for the demonstration purpose .but Our
main motive is getting the perturbated data. So add some
noise in that network. So we will get the perturbated value.
But how much of amount modified is important. That data
variation is need in smaller size only not in biggest level.
So we have to concentrate on this place and adding the
error is also like that only. In this place the data’s from
database are splited in some interval with some[2] special
condition. And the random number is generated based on
the intervals .for particular intervals the some set of random
number will generated and it will multiply with original
data and get the perturbated output. That value is modified
very near only so the data quality is good the main use of
the network build is the the output generated is based on
the input.
http://www.ijettjournal.org
Page 462
International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 9 - Mar 2014
VIII. EXPERIMENTAL SETUP
In our case is implemented for the Adult dataset, in this
place Adult dataset refers a survey taken in particular area
in US, which deals day to day activity of that area Adult,
which contains 32561 records, of which 30722 are
complete. There are 14 attributes in the data base, of which
we have taken {Age, Work-class, Education, Hours/week,
sex, race, Marital-status, Salary}. {Salary} is considered as
the sensitive attribute.
IX. ALGORITHM
IX
INPUT: FILTERED DATABASE
Fig 3.1: Original Data Flow
Using RBF
OUTPUT: ANONYMIZED DATABASE
METHOD:
STEP1: TAKE ‘N’ NUMBER OF SAMPLE
DATA.
STEP2: BUILD THE NETWORK
>>NET=NEWRB(P,T,GOAL,SPREAD,K,KI);
P INPUT
T TARGET OUTPUT
K MAX NO OF NEURONS
KI NO OF NEURONS TO ADD
BETWEEN DISPLAY
STEP3: GIVE THE FILTERED DATABASE AS
INPUT.
STEP4: ADD THE ERROR SIGNAL USING
RANDOMIZATION TECHNIQUE.
Fig 3.2: Flow of Sampling Data’s
Fig 3.3: Flow of Perturbated Data’s
Fig 2: Neural Network Building
ISSN: 2231-5381
http://www.ijettjournal.org
Page 463
International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 9 - Mar 2014
Fig 4: Sample RBF Networks with Different Values
TABLE 1:RBF COMPARISON WITH EXISTING METHODOLOGY
Parameters
Units
original data
AP
MP
IR
RBF
Loss of Data
Cont.
20.399
50.525
16.481
17.857
15.481
Bias in Mean
Cont.
39.68
39.68
41.76
39.51
25.81
Bias in Standard deviation
Rate of Classification Error
Cont.
%
15.408
18.947
33.555
15.332
10.345
189.47
13.2
15.7
0
9.2
Computational time
Sec
Nil
O(2n)
O(2n)
O(n)
O(n)+1
Regression
Cont.
Nil
0
0.009
0.001
0.001
Privacy preservation rate
%
Nil
0.041
0.038
0
0.442
Rand Index
%
Nil
11.3
16.4
9.1
12.4
Measure of Privacy
Cont.
Nil
35.37
15.194
2.43
40.52
From the graph (Fig 4) shows the build of a network
the test sample also given and got the expected output .so
proceed the further process with this balanced network.
And give the total database to the network and get the
modified output with quality data.
X .PERFORMANCE ANALYSIS:
From table1gives the performance of RBF. Compare
to existing methodology this RBF gives better output result.
Additive
Perturbation(AP),
Multiplicative
Perturbation(MP), Isometric Transformation(IR) are
existing methodology which is previously used technique,
which gives the best result. Privacy preservation rate and
Measure of privacy is good. Very less amount of data loss
and classification error. Refer to previous methods
computational time is low. Over all analysis of
performance is very good in this RBF technique.
ISSN: 2231-5381
XI. CONCLUSION
In this project, the proposed new technology is Radial
Basis Function for Privacy Preservation Data Mining. This
project using Neural Network technique to implement the
good network. It generate the high performance accurate
output.RBF metrics compares the various of the existing
methods and explore the good quality. We conclude the
Radial Basis Function (RBF) is the best technology for
PPDM with secure manner.
In future, the method can be enhanced by using a
parameter for rotation which depends on the distance
between the sub-cluster to improve the quality of data as
well as the preservation rate.
http://www.ijettjournal.org
Page 464
International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 9 - Mar 2014
ACKNOWLEDGMENT
We would like to thank Sathyabama University for
giving us a platform to enhance our knowledge. We would
like to express our special thank of gratitude to our guide
Mrs.V.Rajalakshmi M.Tech, Assistant professor in
Sathyabama University who motivated us to prepare this
paper and peachy guidance to this paper and also we would
like to express our deepest thanks to all those who made us
possible to complete this work.
REFERENCES
[1]
[2]
[3]
[4]
Chuang-Cheng Chiu and Chieh-Yuan Tsai “A kAnonymity Clustering Method for Effective Data
Privacy Preservation” Springer-Verlag Berlin
Heidelberg 2007, ADMA 2007, LNAI 4632, pp. 89–
99, 2007.
Li Liu , Murat Kantarcioglu, Bhavani Thuraisingham
“The applicability of the perturbation based privacy
preserving data mining for real-world data” Data &
Knowledge Engineering 65 (2008) 5–21.
V.Rajalakshmi,G.S.AnandhaMala
ANONYMIZATION BASED ON NESTED
CLUSTERING FOR PRIVACY PRESERVATION
IN DATA MINING” Indian Journal of Computer
Science and Engineering (IJCSE) Vol. 4 No.3 JunJul 2013.
Saeed Samet , Ali Miri “Privacy-preserving backpropagation and extreme learning machine
algorithms” Data & Knowledge Engineering 79–80
(2012) 40–61.
ISSN: 2231-5381
[5]
Slava Kisilevich, Lior Rokach, Yuval Elovici,
Member, IEEE, and Bracha Shapira “Efficient
Multidimensional Suppression for K-Anonymity”
IEEE TRANSACTIONS ON KNOWLEDGE AND
DATA ENGINEERING, VOL. 22, NO. 3, MARCH
2010.
[6] Stanley R. M. Oliveira, Osmar R. Zaiane “Data
Perturbation by Rotation for Privacy-Preserving
Clustering” Technical Report TR 04-17 August 2004
[7] Sweeney, L., Achieving k-anonymity privacy
protection using generalization and suppression.
2002.
[8] Tiancheng Li, Ninghui Li, Senior Member, IEEE,
Jian Zhang, Member, IEEE, and Ian Molloy
“Slicing: A New Approach for Privacy Preserving
Data Publishing” IEEE TRANSACTIONS ON
KNOWLEDGE AND DATA ENGINEERING,
VOL. 24, NO. 3, MARCH 2012
[9] Xiao-Bai Li and Sumit Sarkar “A Tree-Based Data
Perturbation Approach for Privacy-Preserving Data
Mining”
IEEE
TRANSACTIONS
ON
KNOWLEDGE AND DATA ENGINEERING,
VOL. 18, NO. 9, SEPTEMBER 2006
[10] Yingjie Wu, Zhihui Sun, Xiaodong Wang , “Privacy
Preserving k-Anonymity for Re-publication of
Incremental Datasets “, 2009 World Congress on
Computer Science and Information Engineering .
http://www.ijettjournal.org
Page 465
Download