Secure Classification Approach Over Out sourced Datasets RamkishorPondreti ,JayanthiRaoMadina

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013
Secure Classification Approach Over Out sourced
Datasets
RamkishorPondreti*,JayanthiRaoMadina#
*
1
MTech scholar, #Assistantprofessor
Department of Software Engineering , SISTAM college, Srikakulam, Andhra Pradesh
2
Dept of CSE , SISTAM college, Srikakulam, Andhra Pradesh
Abstract:-In this paper we are proposing an efficient secure
classification approach over outsourced datasets in data
mining, Our approach is an efficient and empirical model of
privacy preserving technique during the classification of data.
We introduce a Cryptography based approach that protects
centralized sample data sets utilized for decision tree data
mining. Privacy preservation is applied to sanitize the samples
prior to their release to third parties in order to mitigate the
threat of their inadvertent disclosure or theft. In contrast to
other sanitization methods, our approach does not affect the
accuracy of data mining results. The decision tree can be built
directly from the sanitized data sets, such that the originals do
not need to be reconstructed. Moreover, this approach can be
applied at any time during the data collection process so that
privacy protection can be in effect even while samples are still
being collected.
I. INTRODUCTION
DATA mining is widely used by researchers for
science and business purposes. Data collected (referred to as
“sample data sets” or “samples” in this paper) from
individuals (referred to in this paper as “information
providers”) are important for decision making or pattern
recognition. Therefore, privacy-preserving processes have
been developed to sanitize private information from the
samples while keeping their utility[1].
Due to increasing concerns related to privacy,
various privacy-preserving data mining techniques have
been developed to address different privacy issues These
techniques usually operate under various assumptions and
employ different methods. In this paper, we will focus on
the perturbation method that is extensively used in privacy
preserving data mining[2] .
Data modification techniques maintain privacy by
modifying attribute values of the sample data sets. Essentially,
data sets are modified by eliminating or unifying uncommon
elements among all data sets. These similar data sets act as
masks for the others within the group because they cannot be
distinguished from the others; every data set is loosely linked
with a certain number of information providers. k-anonymity
[15] is a data modification approach that aims to protect
private information of the samples by generalizing attributes.
k-anonymity trades privacy for utility. Further, this approach
can be applied only after the entire data collection process has
been completed[3][4].
Privacy issues are further exacerbated now that
the World Wide Web makes it easy for the new data to be
ISSN: 2231-5381
automatically collected and added to databases [4] [5] [6]
[17] The concerns over massive collection of data are
naturally extending to analytic tools applied to data. Data
mining, with its promise to efficiently discover valuable,
non-obvious information from large databases, is
particularly vulnerable to misuse [9] [10][18].A fruitful
direction for future research in data mining will be the
development of techniques that incorporate
We introduce a new perturbation and
randomization based approach that protects centralized
sample data sets utilized for decision tree data mining.
Privacy preservation is applied to sanitize the samples prior
to their release to third parties in order to mitigate the threat
of their inadvertent disclosure or theft. In contrast to other
sanitization methods, our approach does not affect the
accuracy of data mining results. The decision tree can be
built directly from the sanitized data sets, such that the
originals do not need to be reconstructed. Moreover, this
approach can be applied at any time during the data
collection process so that privacy protection can be in
effect even while samples are still being collected.
II. RELATED WORK
Various researchers introduced
various
mechanisms for privacy preserving data mining techniques,
this approaches not only concentrating on specific domain
in data mining like clustering, classification, association
rule mining and other. Even though various approaches
released by the various researchers, they are not optimal a
have their individual advantages and disadvantages in their
proposed architectures.
Our approach concentrating on privacy preserving
technique for classification, here it involves the unrealized
datasets, Data owner can not manipulate the data directly
with out sending it to analyst, Privacy is also important
issue while transmission data between data owner and
analysis. Traditional approaches of randomization and
perturbations approach provides the security as privacy
preserving and Traditional mechanism of classification
analyzes the testing data or sample with training data,
there is no concept of privacy preserving. Privacy
preserving techniques like randomization and perturbations
approach may not optimal because there is a chance of
loss(data integrity problem).
Perturbation-based approaches attempt to achieve
privacy protection by distorting information from the original
http://www.ijettjournal.org
Page 198
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013
data sets. The perturbed data sets still retain features of the
originals so that they can be used to perform data mining
directly or indirectly via data reconstruction. Random
substitutions [16] is a perturbation approach that randomly
substitutes the values of selected attributes to achieve privacy
protection for those attributes, and then applies data
reconstruction when these data sets are needed for data
mining. Even though privacy of the selected attributes can be
protected, the utility is not recoverable because the
reconstructed data sets are random estimations of the originals.
In this paper we are introducing an efficient
privacy preserving cryptographic approach for the
classification of the datasets without exposing the user
sensitive information to the external world
III. PROPOSED WORK
In this paper we are proposing a cryptographic
classification approach
Encoder
Plain Dataset
Cipher Dataset
Cipher Dataset
Cipher Classification Rules
Analyst
Data
Owner
Cipher Rules
CC rules
Dataset
Original Rules
Decoder
Classifier
Figure1.Privacy preserving Architecture
The above architecture describes as follow. Our empirical
model of privacy preserving technique with integrated
approach of cryptography with AES algorithm and
Classification with ID3 algorithm.
Data owner Maintains the training datasets with
required datasets ,For making the realized datasets to
unrealized datasets through the encoding approach of AES
algorithm and forwards the unrealized datasets to analyst
for classification,the main advantage with the unrealized
datasets are classification does not require the semantics of
the testing and training datasets.
ISSN: 2231-5381
Initialize Training Datasets for Machine Learning:
Datasets are the collection of tuples with respect to
different attributes and possible values for each attribute
and with class labels, is given for the classification process
for analysing the testing set behaviour with machine
learning approach. Synthetic dataset can be gathered for the
classification of results. Initially data set can be forwarded
to the encoder, encoder returns the cipher dataset.
Unrealized Dataset Creation
Usually data can be passed to the analysts for the
machine learning purpose, but there is a privacy preserving
issue regarding the confidential information. So in this
paper we introduced AES algorithm for the privacy issue.
After applying this mechanism dataset can be constructed
as unrealized data set. i.e cipher dataset can be passed to
the analyst for the classification instead of plain sensitive
or confidential information.
Classification with ID3
ID3 is one of the efficient Machine learning
approaches for implementing the decision trees. Decision
trees are used for classification purpose. Tree can be
constructed based on the attribute based entropy or
information gain values. We can efficiently analyze the
classification rules by sending the testing data on to the
training datasets.
Retrieval of Original classified results
After generating the classification results, results can be
passed to the Data owner, there administrator can perform
attribute oriented decryption for the resulted set. Original
data set can reconstruct by the decoder and classified rules
can be obtained finally at the data owner end.
IV. EXPERIMENTAL ANALYSIS
ID3 builds a decision tree from a fixed set of
examples. The resulting tree is used to classify future
samples. The example has several attributes and belongs to
a class (like yes or no). The leaf nodes of the decision tree
contain the class name whereas a non-leaf node is a
decision node. The decision node is an attribute test with
each branch (to another decision tree) being a possible
value of the attribute. ID3 uses information gain to help it
decide which attribute goes into a decision node. The
advantage of learning a decision tree is that a program,
rather than a knowledge engineer, elicits knowledge from
an expert.
Gain measures how well a given attribute separates
training examples into targeted classes. The one with the
highest information (information being the most useful for
classification) is selected. In order to define gain, we first
borrow an idea from information theory called entropy.
Entropy measures the amount of information in an attribute
This is the formula for calculating homogeneity of a
sample.
Entropy(S)=∑
It helps to measure the information gain with respect to the
attributes
Gain(A)=E(Current st)-∑ E(all child sets)
http://www.ijettjournal.org
Page 199
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013
Our Experimental result purposes we are using a synthetic
dataset, the following dataset at Data owner side before
Figure2.
converting to unrealized dataset, after converting the
dataset to unrealized dataset, data owner forwards to the
analyst.
Original Data at Owner
information gain and analyzes the testing data with training
or unrealized dataset.
At analyst end ,he constructs the decision tree for
Unrealized dataset which is encrypted ,based on
Figure3. Unrealized Dataset
Decision tree constructed with the class labels based on
information gain ,in terms of entropy, the tree can be
shown as follows.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 200
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013
Figure 4. Tree and Eligible Data
Final eligible Data After decryption at Data owner end can be shown as follows
Figure5.
Eligible
Data
IV. CONCLUSION ANDD FUTURE WORK
In this paper we proposed an efficient privacy preservation
technique during classification of unreal datasets. It
prevents the data owner from the un authorized access and
privacy issues, Our proposed approach works efficiently
with our violating the classification properties. Meanwhile,
an accurate decision tree can be built directly from those
ISSN: 2231-5381
after
Classification
and
Decryption
unreal data sets. Finally the results yield accurate results
even though classification applies on the cipher dataset.
Classifies the testing data with training data
without losing its data integrity Cryptography mechanism
provided for secure classification ,during the data
transmission between Data owner and analyst. AES already
proved secure cryptographic approach than the traditional
approaches.
http://www.ijettjournal.org
Page 201
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013
REFERENCES
BIOGRAPHIES
[1] Privacy Preserving Decision Tree LearningUsing Unrealized
RamkishorPondreti
is working as an
Assistant Professor in Aditya Institute of
Technology And Management, Tekkali. He
received B.Tech from Aditya Institute of
Technology And Management, Tekkali. He
received M.Tech from Avanthi Institute of
Engineering & Technology, Visakhapatnam.He is pursuing
M.Tech in Sarada Institute of Science, Technology
and
Management, Srikakulam, Andhra Pradesh. Interesting areas are
Data Structures, Java and Oracle database.
Data Sets Pui K. Fong and Jens H. Weber-Jahnke, Senior
Member, IEEE Computer Society.
[2] Privacy Preserving Decision Tree Mining from Perturbed Data
Li Liu.
[3] L. Sweeney, “k-Anonymity: A Model for Protecting Privacy,”
Int’l
J. Uncertainty, Fuzziness and Knowledge-based Systems, vol. 10,
pp. 557-570, May 2002
[4] Q. Ma and P. Deng, “Secure Multi-Party Protocols for Privacy
Preserving Data Mining,” Proc. Third Int’l Conf. Wireless
Algorithms, Systems, and Applications (WASA ’08), pp. 526537, 2008.
[5] N. Lomas, “Data on 84,000 United Kingdom Prisoners is
Lost,” Retrieved Sept. 12, 2008, http://news.cnet.com/83011009_3-10024550-83.html, Aug. 2008.
[6] J. Gitanjali, J. Indumathi, N.C. Iyengar, and N. Sriman, “A
Pristine Clean Cabalistic Foruity Strategize Based Approach for
Incremental Data Stream Privacy Preserving Data Mining,” Proc.
IEEE Second Int’l Advance Computing Conf. (IACC), pp. 410415, 2010.
[7] S. Bu, L. Lakshmanan, R. Ng, and G. Ramesh, “Preservation
of Patterns and Input-Output Privacy,” Proc. IEEE 23rd Int’l
Conf. Data Eng., pp. 696-705, Apr. 2007.
[8] S. Russell and N. Peter, Artificial Intelligence. A Modern
Approach 2/ E. Prentice-Hall, 2002.
[9] D. Goodin, “Hackers Infiltrate TD Ameritrade client
Database,”
Retrieved
Sept.2008,http://www.channelregister.co.uk/2007/09/15/ameritrad
e_database_burgled/, Sept. 2007.
[10] L. Liu, M. Kantarcioglu, and B. Thuraisingham, “Privacy
Preserving
Decision Tree Mining from Perturbed Data,” Proc. 42nd Hawaii
Int’l Conf. System Sciences (HICSS ’09), 2009.
[11] Y. Zhu, L. Huang, W. Yang, D. Li, Y. Luo, and F. Dong,
“Three New Approaches to Privacy-Preserving Add to Multiply
Protocol and Its Application,” Proc. Second Int’l Workshop
Knowledge
Discovery and Data Mining, (WKDD ’09), pp. 554-558, 2009.
[12] J. Vaidya and C. Clifton, “Privacy Preserving Association
Rule Mining in Vertically Partitioned Data,” Proc Eighth ACM
SIGKDD Int’l Conf. Knowledge Discovery and Data Mining
(KDD ’02), pp. 23- 26, July 2002.
[13] J. Dowd, S. Xu, and W. Zhang, “Privacy-Preserving
Decision Tree Mining Based on Random Substitions,” Proc. Int’l
Conf. Emerging Trends in Information and Comm. Security
(ETRICS ’06), pp. 145-159, 2006.(HICSS), pp. 1-9, 2010.
[14] C. Aggarwal and P. Yu, Privacy-Preserving Data Mining:,
Models and Algorithms. Springer, 2008.
JayanthiRaoMadina is working as a HOD in
Sarada Institute of Science, Technology And
Management, Srikakulam, Andhra Pradesh.
He received his M.Tech (CSE) from Aditya
Institute of Technology And Management,
Tekkali. Andhra Pradesh. His research areas include Image
Processing, Computer Networks, Data Mining, Distributed
Systems.
[15] L. Sweeney, “k-Anonymity: A Model for Protecting
Privacy,” Int’l J. Uncertainty, Fuzziness and Knowledge-based
Systems, vol. 10, pp. 557-570, May 2002.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 202
Download