A Novel Privacy Preserving Supervised Learning Approach in Data mining V Sangeeta

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
A Novel Privacy Preserving Supervised Learning
Approach in Data mining
K Raghaveswara rao, V Sangeeta
M.Tech Scholar , Associate professor
Computer Science and Engineering at Pydah College of Engineering and Technology, Visakhapatnam
Abstract:- In this paper we are proposing an
efficient privacy preserving supervised learning
approach with ID3 and Advanced Encryption
Standard(AES). The main objective of the approach
is to provide security during the mining of data over
the networks, Confidentiality and sensitivity of data
provided with our architecture during mining of
data. Data owner can securely achieve his classified
results without losing integrity of data after
receiving the mined results from the analyst.
I. INTRODUCTION
Classification is process of grouping together
documents or data that have similar properties or are
related. Our understanding of the data and documents
become greater and easier once they are classified. We
can also infer logic based on the classification. Most of
all it makes the new data to be sorted easily and
retrieval faster with better results.
Dewey Decimal Classification is the system most used
in the libraries. It is hierarchical; there are ten parent
classes which are further divided into ten further
divisions which also are in turn divided into ten
sections. Each book is assigned a number according to
its class, division and section alphabetically. Dewey
Decimal Classification is very successful in libraries but
unfortunately it can’t be implemented in Information
Retrieval. Somebody needs to have a central catalogue
of all the documents in the web and whenever a new
document is added the central committee would have to
look at it classify it assign a number and publish it in
the web. This is in strong violation of the way the
internet works. Some authority controlling the contents
of the web will restrict the amount of data that can be
added into the web. We need a web that allows
everyone to upload their content in the web together
with a Machine Learning technique that finds these new
data and classifies them as they come.
Confidentiality issues in data mining. A key problem
that arises in any en masse collection of data is that of
confidentiality. The need for privacy is sometimes due
to law (e.g., for medical databases) or can be motivated
by business interests. However, there are situations
ISSN: 2231-5381
where the sharing of data can lead to mutual gain. A
key utility of large databases today is research, whether
it be scientific, or economic and market oriented. Thus,
for example, the medical field has much to gain by
pooling data for research; as can even competing
businesses with mutual interests. Despite the potential
gain, this is often not possible due to the confidentiality
issues which arise. We address this question and show
that highly efficient solutions are possible. Our scenario
is the following: Let P1 and P2 be parties owning
(large) private databases D1 and D2. The parties wish to
apply a data-mining algorithm to the joint databases D1,
D2 without revealing any unnecessary information
about their individual databases. That is, the only
information learned by P1 about D2 is that which can
be learned from the output of the data mining algorithm,
and vice versa. We do not assume any “trusted” third
party who computes the joint output.
II. RELATED WORK
Previous work in privacy-preserving data mining has
addressed two issues [1]. In one, the aim is to preserve
customer privacy by perturbing the data values [4]. In
this scheme random noise data is introduced to distort
sensitive values, and the distribution of the random data
is used to generate a new data distribution which is
close to the original data Distribution without revealing
the original data values. The estimated original data
distribution is used to reconstruct the data, and data
mining techniques, such as classifiers and Association
rules are applied to the reconstructed data set. Later
refinement of this approach has tightened estimation of
original values based on the distorted data [3]. The data
distortion approach has also been applied to Boolean
values in research work.
Perturbation methods and their privacy protection have
been criticized because some methods may derive
private information from the reconstruction step [10].
Different to the original noise additive method in [4],
many distinctive perturbation methods have been
proposed. One important category is multiplicative
perturbation method. In the view of geometric property
of the data, multiplying the original data values with a
random noise matrix is to rotate the original data
http://www.ijettjournal.org
Page 4158
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
matrix, so it is also called rotated based perturbation. In
[5], authors have given a sound proof of “Rotation
invariant Classifiers” to show some data mining tools
can be directly applied to the rotation based perturbed
data. In the later work [11], Liu et al have proposed
multiplicative random projection which provided more
enhanced privacy protection. There are some other
interesting techniques, such as condensation based
approach [2], matrix decomposition [21] and so on. As
pointed out in [13], these recently research on
perturbation based approaches apply the data mining
techniques directly on the perturbed data skipping the
reconstruction step. Choosing the suitable data mining
techniques is determined by the method which noise has
been introduce. To our knowledge, very few works
focus on mapping or modifying the data mining
techniques to meet the perturbation data needs.
The other approach uses cryptographic tools to build
data Mining models. For example, in [11], the goal is to
securely Build an ID3 decision tree where the training
set is distributed between two parties. Different
solutions were Given to address different data mining
problems using cryptographic techniques (e.g., [6, 9,
19]). This approach treats privacy-preserving data
mining as a special case of secure multi-party
computation and not only aims for preserving individual
privacy but also tries to preserve leakage of any
information other than the final result. In this paper we
are introducing an efficient privacy preserving
cryptographic approach for the classification of the
datasets without exposing the user sensitive information
to the external world
III. PROPOSED WORK
In this paper we are proposing a cryptographic
classification approach
The above architecture describes as follows
Abstract view of proposed work:
Step1: Read input Synthetic training and testing
datasets.
Step2: Forward the datasets to encoder of AES
algorithm.
Step3: Forward the unrealized datasets (Encrypted
Datasets) to Analyst.
Step4: Analyst applies the ID3 classification based on
Information gain in terms of entropy.
Step5: ID3 classifies the data by analyzing the testing
data with training data and returns the classified
or tested data.
Step6: Analyst forwards the cipher classification rules
to Data owner.
Step7: After receiving Cipher classified data, Data
owner forward the data to AES decoder.
Step8: After the decryption Data owner receives the
Plain classified data.
A) Initialize Training Datasets for Machine Learning
Datasets are the collection of tuples with respect to
different attributes and possible values for each attribute
and with class labels, is given for the classification
process for analyzing the testing set behaviour with
machine learning approach. Synthetic dataset can be
gathered for the classification of results. Initially data
set can be forwarded to the encoder, encoder returns the
cipher dataset.
B) Unrealized Dataset Creation
Usually data can be passed to the analysts for the
machine learning purpose, but there is a privacy
preserving issue regarding the confidential information.
So in this paper we introduced AES algorithm for the
privacy issue. After applying this mechanism dataset
can be constructed as unrealized dataset. i.e cipher
dataset can be passed to the analyst for the classification
instead of plain sensitive or confidential information.
C) AES Algorithm
Figure1. Privacy preserving Architecture
ISSN: 2231-5381
Our paper uses an advanced cryptographic algorithm for
secure data transmission the system mainly works on
substitution and affine transformation techniques
1. Key Expansion—round keys are derived from
the cipher key using key schedule.
2. Initial Round
http://www.ijettjournal.org
Page 4159
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
1. Sub Bytes—a non-linear substitution
step where each byte is replaced
with another according to a lookup
table.
2. Shift Rows—a transposition step
where each row of the state is shifted
cyclically a certain number of steps.
3. Mix Columns—a mixing operation
which operates on the columns of
the state, combining the four bytes in
each column.
4. Add Round Key—each byte of the
state is combined with the round key
using bitwise xor Rounds
3. Final Round (no Mix Columns)
1. Sub Bytes
2. Shift Rows
3. Add Round Key
Complete implementation of the sub-bytes,
shift-rows, mix columns and add round key as
follows[21], for implementation details we had
used built in algorithm from the dot-net
namespaces.
D) Classification with ID3
ID3 is one of the efficient Machine learning approaches
for implementing the decision trees. Decision trees are
used for classification purpose. Tree can be constructed
based on the attribute based entropy or information gain
values. We can efficiently analyze the classification
rules by sending the testing data on to the training
datasets.
E) Retrieval of Original classified results
After generating the classification results, results can be
passed to the Data owner, there administrator can
perform attribute oriented decryption for the resulted
set. Original data set can reconstruct by the decoder and
classified rules can be obtained finally at the data owner
end.
a.
Methodology
Initially data set can be forwarded to the encoder,
encoder returns the cipher dataset. Data is passed to the
analysts for the machine learning , but there is a need
for privacy preserving, regarding the confidential
information. So, in this paper we implemented AES
algorithm for preserving the privacy. After applying the
algorithm dataset is converted into unrealized dataset
(cipher dataset). The cipher dataset is passed to the
analyst for the classification instead of plain data. The
analyst finds class labels basing on the information
gain. Then it is used to construct the decision tree. The
rules in cipher test format are formed basing on the
ISSN: 2231-5381
decision tree. Now, the formed rules are sent back to
the data owner. Data owner decrypts the rules and
extracts the original data set.
b.
Experimental Analysis
ID3 builds a decision tree from a fixed set of examples.
The resulting tree is used to classify future samples. The
example has several attributes and belongs to a class
(like yes or no). The leaf nodes of the decision tree
contain the class name whereas a non-leaf node is a
decision node. The decision node is an attribute test
with each branch (to another decision tree) being a
possible value of the attribute. ID3 uses information
gain to help it decide which attribute goes into a
decision node. The advantage of learning a decision tree
is that, a program rather than a knowledge engineer,
elicits knowledge from an expert.
Gain measures how well a given attribute separates
training examples into targeted classes. The one with
the highest information (information being the most
useful for classification) is selected. In order to define
gain, we first borrow an idea from information theory
called entropy. Entropy measures the amount of
information in an attribute.
This is the formula for calculating homogeneity of a
sample.
It helps to measure the information gain with respect
to the attributes
Gain( A)  E (Current set )   E (all child sets )
Read input synthetic training data set shown in the
below figure 2. The Unrealized data set is forwarded to
the analyst is shown in figure 3. The ID3 classifies the
data by analyzing the testing data with training data
with decision tree and eligible data shows the below
figure 4. Analyst forwards the cipher classification rules
to data owner. After receiving cipher classified data,
data owner forwards the data to AES decoder. After the
decryption data owner receives the plain classified data
is shown in figure 5.
http://www.ijettjournal.org
Page 4160
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
Figure2.
Original Data at Owner
Figure3. Unrealized Dataset
ISSN: 2231-5381
http://www.ijettjournal.org
Page 4161
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
Figure 4. Tree and Eligible Data
Final eligible Data After decryption at Data owner end can be shown as follows
Figure5. Eligible Data after Classification and Decryption
ISSN: 2231-5381
http://www.ijettjournal.org
Page 4162
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
problem. We can minimize the computational complexity
in AES by in changing the traditional GPU.
Comparative Analysis:
Classification is process of grouping together documents or
data that have similar properties or are related. Our
understanding of the data and documents become greater
and easier once they are classified. We can also infer logic
based on the classification. Most of all it makes the new
data to be sorted easily and retrieval faster with better
results.
Recent proposal of privacy preserving during classification
in data mining, mostly works on two approaches those are
perturbation and randomization-based approaches and
Cryptographic approaches, During the initial approach we
inject fake values in to real dataset and converts into
unrealized dataset. In the Cryptographic approach we
convert the Plain data to cipher by using an cryptographic
approach.
The main drawback with the Previous approach is data
retrievability, after retrieving the classified data from the
analyst and the rules which are classified may not be
optimal due to imputation of the fake values in the real
dataset, maintain the details of fake imputation rules for
entire dataset(Both training and testing datasets) is a time
consuming process, Our proposed approach provides more
security from the third parties but obviously computation
complexity depends on the number of records in the
datasets obviously. For optimal security we are considering
our cryptographic approach with AES.
IV. CONCLUSION & FUTURE WORK
In this paper we proposed an efficient privacy preservation
technique during classification of unreal datasets. It
prevents the data owner from the un authorized access and
privacy issues. Our proposed approach works efficiently
with our violating the classification properties. Meanwhile,
an accurate decision tree can be built directly from those
unreal data sets. Finally the results yield accurate results
even though classification applies on the cipher dataset.
One of the shortcoming of ID algorithm is its inability to
handle noisy data, which will lead to over fitting..The
second drawback in ID3 is, attributes in training dataset
and testing data must be matched, it leads to failure during
classification in case of missing of any attribute and it may
leads to incorrect predictions. Computational complexity is
one of the main drawback when data is more. We can
improve our research of privacy preserving and optimality
with validation set pruning and some fuzziness for initial
ISSN: 2231-5381
REFERENCES
[1] Pui K. Fong and Jens H. Weber-Jahanke. “Privacy
Preserving Decision Tree Learning Using Unrealized Data
Sets.” Senior Member, IEEE Computer Society, 2012.
[2] S. Ajmani, R. Morris, and B. Liskov, “A Trusted ThirdParty Computation Service,” Technical Report MIT-LCSTR-847, MIT, 2001.
[3] S.L. Wang and A. Jafari, “Hiding Sensitive Predictive
Association Rules,” Proc. IEEE Int’l Conf. Systems, Man
and Cybernetics, pp. 164-169, 2005.
[4] R. Agrawal and R. Srikant, “Privacy Preserving Data
Mining,” Proc. ACM SIGMOD Conf. Management of Data
(SIGMOD ’00), pp. 439-450, May 2000.
[5] Q. Ma and P. Deng, “Secure Multi-Party Protocols for
Privacy Preserving Data Mining,” Proc. Third Int’l Conf.
Wireless Algorithms, Systems, and Applications (WASA
’08), pp. 526-537, 2008.
[6] J. Gitanjali, J. Indumathi, N.C. Iyengar, and N. Sriman,
“A Pristine Clean Cabalistic Foruity Strategize Based
Approach for Incremental Data Stream Privacy Preserving
Data Mining,” Proc. IEEE Second Int’l Advance
Computing Conf. (IACC), pp. 410-415, 2010.
[7] N. Lomas, “Data on 84,000 United Kingdom Prisoners
is
Lost,”
Retrieved
Sept.
12,
2008,
http://news.cnet.com/83011009_3-10024550-83.html, Aug. 2008.
[8] BBC News Brown Apologises for Records Loss.
Retrieved
Sept.
12,
2008,
http://news.bbc.co.uk/2/hi/uk_news/politics/ 7104945.stm,
Nov. 2007.
[9] D. Kaplan, Hackers Steal 22,000 Social Security
Numbers from Univ. of Missouri Database, Retrieved Sept.
2008,
http://www.scmagazineus.
com/Hackers-steal22000-Social-Security-numbers-from- Univ.-of-Missouridatabase/article/34964/, May 2007.
[10] D. Goodin, “Hackers Infiltrate TD Ameritrade client
Database,”
Retrieved
Sept.
2008,
http://www.channelregister.co.uk/2007/
09/15/ameritrade_database_burgled/, Sept. 2007.
[11] L. Liu, M. Kantarcioglu, and B. Thuraisingham,
“Privacy Preserving
Decision Tree Mining from Perturbed Data,” Proc. 42nd
Hawaii Int’l Conf. System Sciences (HICSS ’09), 2009.
[12] Y. Zhu, L. Huang, W. Yang, D. Li, Y. Luo, and F.
Dong, “Three New Approaches to Privacy-Preserving Add
to Multiply Protocol and Its Application,” Proc. Second
Int’l Workshop Knowledge Discovery and Data Mining,
(WKDD ’09), pp. 554-558, 2009.
[13] J. Vaidya and C. Clifton, “Privacy Preserving
Association Rule Mining in Vertically Partitioned Data,”
http://www.ijettjournal.org
Page 4163
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 9- Sep 2013
Proc Eighth ACM SIGKDD Int’l Conf. Knowledge
Discovery and Data Mining (KDD ’02), pp. 23- 26, July
2002.
[14] M. Shaneck and Y. Kim, “Efficient Cryptographic
Primitives for Private Data Mining,” Proc. 43rd Hawaii
Int’l Conf. System Sciences (HICSS), pp. 1-9, 2010.
[15] C. Aggarwal and P. Yu, Privacy-Preserving Data
Mining:, Models and Algorithms. Springer, 2008.
[16] L. Sweeney, “k-Anonymity: A Model for Protecting
Privacy,” Int’l J. Uncertainty, Fuzziness and Knowledgebased Systems, vol. 10, pp. 557-570, May 2002.
[17] J. Dowd, S. Xu, and W. Zhang, “Privacy-Preserving
Decision Tree Mining Based on Random Substitions,”
Proc. Int’l Conf. Emerging Trends in Information and
Comm. Security (ETRICS ’06), pp. 145-159, 2006.
[18] S. Bu, L. Lakshmanan, R. Ng, and G. Ramesh,
“Preservation of Patterns and Input-Output Privacy,” Proc.
IEEE 23rd Int’l Conf. Data Eng., pp. 696-705, Apr. 2007.
[19] S. Russell and N. Peter, Artificial Intelligence. A
Modern Approach 2/ E. Prentice-Hall, 2002.
[20] P.K. Fong, “Privacy Preservation for Training Data
Sets in Database: Application to Decision Tree Learning,”
master’s thesis, Dept. of Computer Science, Univ. of
Victoria, 2008.
[21]http://en.wikipedia.org/wiki/Advanced_Encryption_Sta
ndard
ISSN: 2231-5381
V Sangeeta completed her M.Tech in
Andhra University, Visakhapatnam in
year 2006. She is currently working as an
Associate professor and Head Of the
Department of Computer Science and
Engineering at Pydah College of
Engineering and Technology, JNTUK University. She is
pursuing her Ph.D degree in computer science at Andhra
University. Her research focuses on Data Mining and
Warehousing. Her areas of interest include Computer
Networks, Network security, Operating Systems and
Computer Organization.
K Raghaveswara Rao completed MCA
in Aditya Institute of Technology and
Management in year 2007. He is
pursuing M.Tech in Computer Science
and Engineering from Pydah College of
Engineering
and
Technology,
Visakhapatnam Dist, AP. His areas of
interest include Data Mining and Warehousing, Computer
Networks and Computer Organization.
http://www.ijettjournal.org
Page 4164
Download