A Cryptographic Privacy Preserving Approach over Classification Sivasankar Vakkalagadda

advertisement
International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue7- July 2013
A Cryptographic Privacy Preserving Approach over Classification
Sivasankar Vakkalagadda *, Satyanarayana Mummana#
Final M.Tech Student , Assistant Professor
Department of CSE, Avanthi Institute of Engineering & Technology, Visakhapatnam. Andhra Pradesh
Abstract:We proposed an efficient privacy preserving technique
during the classification of data. We introduce a
Cryptographic based approach that protects centralized
sample data sets utilized for decision tree mining of data.
Preservation of privacy is applied to sanitize the samples prior
to their release to third parties in order to mitigate the threat
of their inadvertent disclosure or reveal. In contrast to other
sanitization approaches, our approach does not affect the
accuracy efficiency of results of data mining .The decision tree
can be built directly from the pre-processed data sets, it
means originals do not need to be formed. Moreover, this
approach provides an efficient privacy preserving technique
over data mining and can be applied at any time during the
data collection process so that privacy protection can be in
effect even while samples are still being collected.
I. INTRODUCTION
Explosive progress in networking, storage, and processor
technologies has led to the creation of ultra large databases
that record unprecedented amount of transactional
information. In tandem with this dramatic growth in digital
data, concerns about informational privacy have emerged
globally [1] [2] [3].
Privacy issues are further exacerbated now that the
World Wide Web makes it easy for the new data to be
automatically collected and added to databases [4] [5] [6]
[17] The concerns over massive collection of data are
naturally extending to analytic tools applied to relevant
data. During the mining, with its promise to efficiently
discover worthy, non-obvious information from large
databases, is particularly vulnerable to misuse [9]
[10][18].A fruitful direction for future research in data
mining will be the development of techniques that
incorporate Privacy reasons. Particularly, we address the
following question. Because the primary task in data
mining is the development of models about computed data,
can we develop accurate models without access to Precise
information in individual data records? The underlying
assumption is that a person will be willing to selectively
divulge information in exchange of value such models can
provide Example of the value provided include Filtering to
weed out unwanted information, better search results with
less effort, and Automatic triggers [11]. A recent survey of
web users
[12] classified 17% of respondents as privacy fun
damentalists who will not provide data to a web site even if
privacy protection measures are in place. How-
ISSN: 2231-5381
ever, the concerns of 56% of respondents constituting the
pragmatic majority were significantly reduced by the
presence of privacy protection measures. The remaining
27% were marginally concerned and generally willing to
provide data to web sites, although they often expressed a
mild general concern about privacy. Another recent survey
of web users [Wes99] found that 86% of respondents
believe that participation in information for benefits
programs is a matter of individual privacy choice. A
resounding 82% said that having a privacy policy would
matter; only 14% said that was not important as long as
they got benefit. Furthermore, people are not equally
protective of every field in their data records [14] [16].
Specifically, a person may not divulge at all the values of
certain fields; may not mind giving true values of certain
fields, may be willing to give not true values but modified
values of certain fields. Given a population that satisfies the
above assumptions, we address the concrete problem of
building decision-tree classifiers [14] [15] and show that it
is possible to develop accurate models while respecting
users' privacy concerns. Classification is one the most used
tasks in data mining. Decision-tree classifiers are relatively
fast, efficient and yield comprehensible models, and obtain
similar and sometimes better accuracy than other
classification methods [13]. We introduce a new
perturbation and randomization based approach that
protects centralized sample data sets utilized for decision
tree mining of data.
II. RELATED WORK
Previous work in privacy-preserving data mining has
addressed two issues. In one, the aim is to preserve
customer privacy by perturbing the data values [1]. In this
scheme random noise data is introduced to distort sensitive
values, and the distribution of the random data is used to
generate a new data distribution which is close to the
original data Distribution without revealing the original
data values. The estimated original data distribution is used
to reconstruct the data, and data mining techniques, such as
classifiers and Association rules are applied to the
reconstructed data set and after refinement of this approach
have tightened estimation of original values based on the
distorted data [2]. The data distortion approach has also
been applied to Boolean values in research work.
Perturbation methods and their privacy protection
have been criticized because some methods may derive
private information from the reconstruction step [9].
http://www.ijettjournal.org
Page 3191
International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue7- July 2013
Different to the original noise additive method in [1], many
distinctive perturbation methods have been proposed. One
important category is multiplicative perturbation method.
During the concern of geometric property data multiplying
the original data values with a random noise matrix is to
rotate the matrix representation of the original data, so it is
also called rotated based perturbation. In [4], authors have
given a sound proof of “Rotation invariant Classifiers” to
show some data mining tools can be directly applied to the
rotation based perturbed data. In the later work [11], Liu et
al have proposed multiplicative random projection which
provided more enhanced privacy protection. There are
some other interesting techniques, such as condensation
based approach [10] matrix decomposition and so on. As
pointed out in [12], these recently research on perturbation
based approaches apply the data mining techniques directly
on the perturbed data skipping the reconstruction step.
Choosing the suitable data mining techniques is determined
by the method which noise has been introduced and our
knowledge, very few works focus on mapping or
modifying the data mining techniques to meet the
perturbation data needs. The other approach uses
cryptographic tools to build data mining models. For
example, in [10], the goal is to securely Build an ID3
decision tree where the training set is distributed between
two parties. Different solutions were given to address
different data mining problems using cryptographic
techniques (e.g., [6, 8, 18]). This approach treats privacypreserving data mining as a special case of secure multiparty computation and not only aims for preserving
individual privacy but also tries to preserve leakage of any
information other than the final result.
In this paper we are introducing an efficient privacy
preserving cryptographic approach for the classification of
the datasets without exposing the user sensitive information
to the external world
III. PROPOSED WORK
In this paper we are proposing a cryptographic
classification approach
Encoder
Plain Dataset
Cipher Dataset
Cipher Dataset
Cipher Classification Rules
Analyst
Data
Owner
Cipher Rules
CC rules
Dataset
Original Rules
Decoder
Classifier
Figure1.Privacy preserving Architecture
The above architecture describes as follows
Initialize Training Datasets for Machine Learning:
Datasets are the collection of tuples with respect to
different attributes and possible values for each attribute
and with class labels, is given for the classification process
for analyzing the testing set behaviour with machine
learning approach. Synthetic dataset can be gathered for the
classification of results. Initially data set can be forwarded
to the encoder, encoder returns the cipher dataset.
Unrealized Dataset Creation
Usually data can be passed to the analysts for the
machine learning purpose, but there is a privacy preserving
issue regarding the confidential information. So in this
paper we introduced AES algorithm for the privacy issue.
After applying this mechanism dataset can be constructed
as unrealized dataset. i.e cipher dataset can be passed to the
analyst for the classification instead of plain sensitive or
confidential information.
Classification with ID3 :
ID3 is one of the efficient Machine learning
approaches for implementing the decision trees. Decision
trees are used for classification purpose. Tree can be
constructed based on the attribute based entropy or
information gain values. We can efficiently analyze the
classification rules by sending the testing data on to the
training datasets.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 3192
International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue7- July 2013
Retrieval of Original classified results
After generating the classification results, results can be
passed to the Data owner, there administrator can perform
attribute oriented decryption for the resulted set. Original
data set can reconstruct by the decoder and classified rules
can be obtained finally at the data owner end.
Gain measures how well a given attribute separates
training examples into targeted classes. The one with the
highest information is selected and in order to define gain,
we first borrow an idea from information theory called
entropy and it measures the amount of information in an
attribute
This is the formula for calculating homogeneity of a
sample.
Experimental Analysis
ID3 builds a decision tree from a static sanitized
examples and the resulting tree is used to classify future
samples and example has several attributes and belongs to
a class (like yes or no decision label) and the leaf nodes of
the decision tree contain the class name whereas a non-leaf
node is a conditional node and that decision node is an
attribute test with each branch (to another decision tree)
being a possible value of the attribute. ID3 uses
information gain to help it decide which attribute goes into
a decision node. The advantage of learning a decision tree
is that a program, better than a knowledge engineer, draw
out knowledge from an expert.
Figure2.
It helps to measure the information gain with respect to the
attributes
Gain( A)  E (Current set )   E ( all child sets )
Our Experimental result purposes we are using a synthetic
dataset, the following dataset at Data owner side before
converting to unrealized dataset, after converting the
dataset to unrealized dataset, data owner forwards to the
analyst.
Original Data at Owner
information gain and analyzes the testing data with training
or unrealized dataset.
At analyst end ,he constructs the decision tree for
Unrealized dataset which is encrypted ,based on
ISSN: 2231-5381
http://www.ijettjournal.org
Page 3193
International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue7- July 2013
Figure3. Unrealized Dataset
Decision tree constructed with the class labels based on
information gain, in terms of entropy, the tree can be
shown as follows.
Figure 4. Tree and Eligible Data
ISSN: 2231-5381
http://www.ijettjournal.org
Page 3194
International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue7- July 2013
Final eligible Data After decryption at Data owner end can be shown as follows
Figure5.
Eligible
Data
IV. CONCLUSION
In this paper we proposed an efficient privacy
preservation technique during classification of unreal
datasets. It prevents the data owner from the un authorized
access and privacy issues, Our proposed approach works
efficiently with our violating the classification properties.
Meanwhile, an accurate decision tree can be built directly
from those unreal data sets. Finally the results yield
accurate results even though classification applies on the
cipher dataset.
after
Classification
and
Decryption
[6] J. Gitanjali, J. Indumathi, N.C. Iyengar, and N. Sriman,
“A Pristine Clean Cabalistic Foruity Strategize Based
Approach for Incremental Data Stream Privacy Preserving
Data Mining,” Proc. IEEE Second Int’l Advance
Computing Conf. (IACC), pp. 410-415, 2010.
[7] S. Bu, L. Lakshmanan, R. Ng, and G. Ramesh,
“Preservation of Patterns and Input-Output Privacy,” Proc.
IEEE 23rd Int’l Conf. Data Eng., pp. 696-705, Apr. 2007.
REFERENCES
[1] R. Agrawal and R. Srikant, “Privacy Preserving Data
Mining,” Proc. ACM SIGMOD Conf. Management of Data
(SIGMOD ’00), pp. 439-450, May 2000.
[2] S.L. Wang and A. Jafari, “Hiding Sensitive Predictive
Association Rules,” Proc. IEEE Int’l Conf. Systems, Man
and Cybernetics, pp. 164-169, 2005.
[3] S. Ajmani, R. Morris, and B. Liskov, “A Trusted ThirdParty Computation Service,” Technical Report MIT-LCSTR-847, MIT, 2001.
[4] Q. Ma and P. Deng, “Secure Multi-Party Protocols for
Privacy Preserving Data Mining,” Proc. Third Int’l Conf.
Wireless Algorithms, Systems, and Applications (WASA
’08), pp. 526-537, 2008.
[5] N. Lomas, “Data on 84,000 United Kingdom Prisoners
is
Lost,”
Retrieved
Sept.
12,
2008,
http://news.cnet.com/83011009_3-10024550-83.html, Aug. 2008.
ISSN: 2231-5381
[8] S. Russell and N. Peter, Artificial Intelligence. A
Modern Approach 2/ E. Prentice-Hall, 2002.
[9] D. Goodin, “Hackers Infiltrate TD Ameritrade client
Database,”
Retrieved
Sept.2008,http://www.channelregister.co.uk/2007/09/15/a
meritrade_database_burgled/, Sept. 2007.
[10] L. Liu, M. Kantarcioglu, and B. Thuraisingham,
“Privacy Preserving
Decision Tree Mining from Perturbed Data,” Proc. 42nd
Hawaii Int’l Conf. System Sciences (HICSS ’09), 2009.
[11] Y. Zhu, L. Huang, W. Yang, D. Li, Y. Luo, and F.
Dong, “Three New Approaches to Privacy-Preserving Add
to Multiply Protocol and Its Application,” Proc. Second
Int’l Workshop Knowledge
Discovery and Data Mining, (WKDD ’09), pp. 554-558,
2009.
[12] J. Vaidya and C. Clifton, “Privacy Preserving
Association Rule Mining in Vertically Partitioned Data,”
http://www.ijettjournal.org
Page 3195
International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue7- July 2013
Proc Eighth ACM SIGKDD Int’l Conf. Knowledge
Discovery and Data Mining (KDD ’02), pp. 23- 26, July
2002.
[13] J. Dowd, S. Xu, and W. Zhang, “Privacy-Preserving
Decision Tree Mining Based on Random Substitions,”
Proc. Int’l Conf. Emerging Trends in Information and
Comm. Security (ETRICS ’06), pp. 145-159, 2006.
(HICSS), pp. 1-9, 2010.
[14] C. Aggarwal and P. Yu, Privacy-Preserving Data
Mining:, Models and Algorithms. Springer, 2008.
BIOGRAPHIES
Satyanarayana Mummana is working
as an Asst. Professor in Avanthi Institute
of
Engineering
&
Technology,
Visakhapatnam, Andhra Pradesh. He has
received his Masters degree (MCA) from
Gandhi Institute of Technology and
Management (GITAM), Visakhapatnam and M.Tech (CSE)
from Avanthi Institute of Engineering & Technology,
Visakhapatnam. Andhra Pradesh. His research areas include
Image Processing, Computer Networks, Data Mining,
Distributed Systems, Cloud Computing.
[15] L. Sweeney, “k-Anonymity: A Model for Protecting
Privacy,” Int’l J. Uncertainty, Fuzziness and Knowledgebased Systems, vol. 10, pp. 557-570, May 2002.
[16] M. Shaneck and Y. Kim, “Efficient Cryptographic
Primitives for Private Data Mining,” Proc. 43rd Hawaii
Int’l Conf. System Sciences
[7] BBC News Brown Apologises for Records Loss.
Retrieved
Sept.
12,
2008,
http://news.bbc.co.uk/2/hi/uk_news/politics/ 7104945.stm,
Nov. 2007.
Sivasankar
Vakkalagadda
Completed his B.Tech and pursuing
M.Tech in from Avanthi Institute of
Engineering
&
Technology,
Visakhapatnam.
Andhra
Pradesh
Interesting areas are Java and data
mining and web technologies and
Oracle database.
[8] D. Kaplan, Hackers Steal 22,000 Social Security
Numbers from Univ. of Missouri Database, Retrieved Sept.
2008,
http://www.scmagazineus.
com/Hackers-steal22000-Social-Security-numbers-from- Univ.-of-Missouridatabase/article/34964/, May 2007.
[19] P.K. Fong, “Privacy Preservation for Training Data
Sets in Database: Application to Decision Tree Learning,”
master’s thesis, Dept. of Computer Science, Univ. of
Victoria, 2008.
[20] R. Buyya, C. S. Yeo, and S. Venugopal, “Marketoriented cloud computing: Vision, hype, and reality for
delivering it services as computing utilities,” in Proc. IEEE
Conf. High Performance Comput. Commun.,
Sep. 2008, pp. 5–13.
[21] W. K. Wong, D. W. Cheung, E. Hung, B. Kao, and N.
Mamoulis,
“Security in outsourcing of association rule mining,” in
Proc. Int. Conf.
Very Large Data Bases, 2007, pp. 111–122.
[22] F. Giannotti, L. V. Lakshmanan, A. Monreale, D.
Pedreschi, and H. Wang, “Privacy-preserving data mining
from outsourced databases,” in Proc. SPCC2010
Conjunction with CPDP, 2010, pp. 411–426.
[23] S. J. Rizvi and J. R. Haritsa, “Maintaining data
privacy in association rule mining,” in Proc. Int. Conf.
Very Large Data Bases, 2002, pp. 682–
693
ISSN: 2231-5381
http://www.ijettjournal.org
Page 3196
Download