Privacy Preserving Heuristic Approach for Intermediate Data Sets in Cloud

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 5 - Mar 2014
Privacy Preserving Heuristic Approach for
Intermediate Data Sets in Cloud
Ms. C. Celcia#1 , Mrs. T. Kavitha*2
#
Student,*Assistant Professor
Department of Computer Science and Engineering
Regional Centre of Anna University
Tirunelveli (T.N) India
Abstract— Cloud computing is the sharing of computing
resources which lessen the upfront investment cost of IT
infrastructure. So many organizations are moving their business
into cloud. In data intensive applications, while processing
original data set many intermediate data sets will be generated.
The intermediate data sets are often stored in cloud in order to
reduce the cost of recomputing them. Intermediate data sets may
contain sensitive information. Preserving the privacy of the
intermediate data sets is a challenging problem because
adversaries may recover sensitive information by analyzing
multiple intermediate data sets. Encrypting all intermediate data
sets is neither efficient nor cost effective. It may be very time
consuming to encrypt and decrypt all the intermediate data sets.
Privacy preserving heuristic approach identifies which
intermediate data set needs to be encrypted and which do not
based on the privacy requirements of the data holders. In this,
encryption is integrated with data anonymization for costeffective privacy preserving.
Keywords— Privacy preserving, Cloud computing, Intermediate
data set, Encryption, Privacy requirement, Privacy Leakage
I. INTRODUCTION
Cloud computing can be defined as the set of hardware,
networks, storage, services, and interfaces that combine to
deliver aspects of computing as a service. Cloud services
include the delivery of software, infrastructure, and storage
over the internet either as separate components or a complete
platform based on user demand. Cloud computing does not
require a user to be in a specific place to gain access to it.
Companies may find that cloud computing allows them to
reduce the cost of information management, since they are not
required to own their own servers and can use capacity leased
from third parties. Additionally, the cloud-like structure
allows companies to upgrade software more quickly.
Cloud computing enables businesses of all sizes to quickly
procure and use a wide range of enterprise-class IT systems on
a pay-per use basis from anywhere at any time.Servers,
storage and network resources are better utilized as the Cloud
is shared by multiple users, thus cutting down on waste at a
global level. Cloud computing is more environment-friendly
and energy efficient. Applications provided through the Cloud
can be accessed from any device – a computer, a smartphone,
an iPad, etc. Any device that has access to the Internet can
leverage the power of the Cloud. Cloud customers can save
ISSN: 2231-5381
huge capital investment of IT infrastructure, and concentrate
on their own core business. Therefore, many companies or
organizations have been migrating or building their business
into cloud.
A. Privacy Preserving
Cloud computing offers many benefits to its users. However,
Numerous potential customers are still hesitant to take the
advantage of cloud due to security and privacy concerns.
Privacy preserving is an important research topic in cloud
computing.Since the data itself contains sensitive
information .When it is outsourced to a third party, privacy
breach of the data can occur. Privacy refers to the right to selfdetermination, that is, the right of individuals to ‘know what is
known about them’, be aware of stored information about
them, control how that information is communicated and
prevent its abuse. Without the knowledge of the physical
location of the server or of how the processing of personal
data is configured, end-users consume cloud services without
any information about the processes involved. Data in the
cloud are easier to manipulate, but also easier to lose control
of. For instance, storing personal data on a server somewhere
in cyberspace could pose a major threat to individual privacy.
When processing the original data set in cloud intermediate data
sets are generaeted. Users have to pay for storage and
computation. So they may store some valuable data sets in cloud
in order to reduce the cost of regenerating the data sets again.
Storing the intermediate data in cloud elaborates the attack
surfaces so that privacy requirements of data holders are at
risk of being violated. Usually, multiple parties access and
process the intermediate data, but it is rarely controlled by
original data set holders. In this situation an adversary can
collect the intermediate data sets together and get the privacysensitive information from them. This leads to considerable
economic loss or severe social reputation impairment to data
owners. The most straight forward approach for preserving the
privacy of data sets is encryption. Usually, the volume of
intermediate data sets is huge. Hence, encrypting all
intermediate data sets will lead to high overhead and low
efficiency when they are frequently accessed or processed.
http://www.ijettjournal.org
Page 235
International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 5 - Mar 2014
Cost of encrypting all intermediate data sets are high. Hence it
is planned to reduce the cost by encrypting only the selective
intermediate data sets rather than all. For this a novel upper
bound privacy leakage constraint-based approach is used to
identify which intermediate data sets need to be encrypted and
which do not, so that privacy-preserving cost can be saved
while the privacy requirements of data holders can still be
satisfied. The major part of this paper is threefold. They are
prevent information leakage beyond the data provider’s policy.
But Airavat cannot confine every computation performed by
untrusted code .
Silverline [7] is a set of tools that automatically identifies all
functionally encryptable data in a cloud application, assigns
encryption keys to specific data subsets to minimize key
management complexity while ensuring robustness to key
compromise, and provides transparent data access at the user
device while preventing key compromise even from malicious
 It demonstrates the possibility of ensuring privacy clouds . Silverline provides a substantial first step towards
leakage requirements without encrypting all intermediate simplifying the complex process of incorporating data
data sets when encryption is incorporated with confidentiality into these storage-intensive cloud applications.
anonymization to preserve privacy.
Its aim is to improve the confidentiality of application data
 It identifies which data sets need to be encrypted for stored on third-party computing clouds. But there are several
preserving privacy while the rest of them do not.
disadvantages of Silverline. Not all data on the cloud is

It reduces privacy preserving cost over existing encrypted. Cloud can learn some metadata. Executing
approaches, which provides benefits to cloud users who inequality comparisons on encrypted cells fail. Data encrypted
utilize cloud sevices in a pay-as-you-go fashion.
with a single key that is shared with all the registered users in
an application are vulnerable to a variety of attacks by the
II. RELATED WORK
cloud.
Sedic [8] provides a solution to the privacy threat that is to
A. Cost effective privacy preserving
split a task, keeping the computation on the private data
within an organization’s private cloud while moving the rest
This work provides the various approaches for privacy to the public commercial cloud. Sedic leverages the special
preserving in cloud computing. Encryption is the technique to features of MapReduce to automatically partition a computing
preserve the privacy of data. Storing data in a third party’s job according to the security levels of the data it works on,
cloud system causes serious concern on data confidentiality. and arrange the computation across a hybrid cloud.
In order to provide strong confidentiality for messages in MapReduce’s distributed file system is modified to
storage servers, user can encrypt data by cryptographic strategically replicate data, moving sanitized data blocks to
method. Encrypting all the data sets, a straight forward and the public cloud. Over this data placement, map tasks are
effective approach is widely adopted in [1], [2], [3]. However, carefully scheduled to outsource as much workload to the
processing on encrypted data sets efficiently is quite a public cloud as possible, given sensitive data always stay on
challenging task, because most existing applications only run the private cloud.
on unencrypted data sets. Although recent progress has been Sedic is designed to protect data privacy during map-reduce
made in homomorphic encryption which theoretically allows operations, when the data involved contains both public and
performing computation on encrypted data sets, applying private records. This protection is achieved by ensuring that
current algorithms are rather expensive due to their the sensitive information within the input data, intermediate
inefficiency [4].
outputs and final results will never be exposed to untrusted
Anonymization based approach [5] proposes the anonymity nodes during the computation. It involves overhead of
algorithm that processes the data and anonymises all or some transferring data between private and public cloud.
information before releasing it in the cloud. When required, Encryption and fragmentation approach [9] couples
the cloud service provider makes use of the background encryption together with data fragmentation. Encryption will
knowledge it has and incorporates the details with the be applied only when explicitly demanded by the privacy
anonymous data to mine the needed knowledge. This requirements. Privacy requirements are enforced by splitting
approach differs from the traditional cryptography technology information over two independent database servers in order to
for preserving user’s privacy as it gets rid of key management break associations of sensitive information and by encrypting
and thus it stands simple and flexible. While anonymising is information whenever necessary. The information to be
easier, the attributes that has to be made anonymous varies protected is first split into different fragments in such a way to
and it depends on the cloud service provider. This approach break the sensitive associations represented through
will be suitable only for limited number of services. Thus, the confidentiality constraints and to minimize the amount of
method has to be bettered by automating the anonymisation.
information represented only in encrypted format. The
Airavat [6] is a MapReduce-based system which provides resulting fragments may be stored at the same server or at
strong security and privacy guarantees for distributed different servers. Finally, the encryption key is given to the
computations on sensitive data. Airavat is a novel integration authorized users needing to access the information. Users that
of mandatory access control and differential privacy. It do not know the encryption key as well as the storing server(s)
enables many privacy-preserving MapReduce computations are able neither to access sensitive information nor to
without the need to audit untrusted code. Its objective is to reconstruct the sensitive associations. But the protection of
ISSN: 2231-5381
http://www.ijettjournal.org
Page 236
International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 5 - Mar 2014
fragmented data when the information stored in the fragments
may change over time is difficult.
III. PROPOSED SYSTEM
when processing original data sets in data intensive
applications, in order to curtail the overall expenses by
avoiding frequent recomputation to obtain these data sets.
2) Privacy leakage quantification:
On processing the original data set intermediate data sets will
be generated. Sensitive intermediate data set tree is used to
capture the propagation of privacy-sensitive information
among the data sets.Privacy leakage of the intermediate data
sets are quantified. By considering the privacy requirement of
the data holder privacy leakage upper bound constraint is
calculated. Intermediate data sets whose privacy leakage is
lesser than the upperbound will be unencrypted and whose
privacy leakage is greater than the upperbound will be
encrypted. By doing so the cost effective global privacy
preserving solution is obtained.
The privacy sensitive information is generally regarded as the
association between sensitive data and individuals. Privacy
leakage of the intermediate data set is quantified. And a
threshold value is given by the data holder. Threshold value
should not exceed the maximum privacy leakage of the single
data set. If the privacy leakage threshold is minimum more
data sets need to be encrypted. If it is maximum more data
sets may remain unencrypted. The sum of the privacy leakage
of the unencrypted data sets should not exceed the threshold
value given by the data holder.
PLs(d*) ≜ H(S,Q) – H*(S,Q)
Where H(S,Q) = log(|QI| . |SD|) and
H∗ (S, Q) = −
p(s, q). log(p(s, q))
є
, є
3) Privacy Leakage Constraint Decomposition:
The privacy leakage constraint is decomposed into different
layers. So there is different threshold value for each layer. The
privacy leakage incurred by the unencrypted data set in the
layer can never be larger than the threshold value in that layer.
A local encryption solution in the layer is feasible if it satisfies
the privacy leakage constraint. A set of feasible solutions
exists in a layer which constitutes global solution. A
compressed tree is created from layer 1 to H where H is the
height of the tree.
The construction is achieved via three steps.
1. The data sets in EDi are compressed into one
encrypted node.
2. All offspring data sets of the data sets in UDi are
Fig.1 System Architecture
omitted.
3. The data sets in UDi are compressed into one node.
1) Process Original data set:
The data holder will store the data into cloud after
encryption. The original dataset is encrypted for
confidentiality. The data users have to register themselves by
giving the username and password. Then only they can able to
decrypt the data that the data holder has stored in cloud. DES
algorithm is used for encryption. Only the authenticated users
can process the dataset. Storage and computation services in
cloud are equivalent from an economical perspective because
they are charged in proportion to their usage. Thus cloud
customers can store valuable intermediate data sets selectively
ISSN: 2231-5381
( )≤
,1 ≤ ≤
∈
The threshold ԑi,1 ≤ i ≤ H, is calculated by
є =є
http://www.ijettjournal.org
( )
−
∈
є = є
Page 237
International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 5 - Mar 2014
4) Cost Calculation:
Cost of storing the intermediate data set is calculated by the
size of the intermediate data set, frequency of accessing that
data set and the price set up by cloud service vendors. If the
frequency of accessing the intermediate data set is larger then
more cost will be incurred if the intermediate data set is
encrypted.
The privacy preserving cost rate is denoted as
.
≜
.
є
Where Si is the size of the intermediate data set, fi is the
frequency of accessing the stored intermediate data set, and
PR is the price for encryption and decryption. The cost of
privacy preserving should be minimum in order to get the
optimal result. Data holder will give privacy requirements that
is the privacy leakage threshold allowed by a data holder, the
privacy leakage caused by the unencrypted data sets should be
under a given threshold.
Fig. 2 Privacy Leakage Quantification
(
)≤ ,
∈ .
(
) is the privacy leakage of the multiple data
Where
) is the unencrypted data sets.
sets and (
5) Cost Effective Solution:
Usually, more than one feasible global encryption solution
exists under the PLC, because there are many alternative
local solutions in each layer. Further, each intermediate
data set has various size and frequency of usage, leading to
different overall cost with different solutions. Therefore it
is desired to find a feasible solution with minimum
privacy-preserving cost under privacy leakage constraints.
Heuristic approach is used to reduce privacy-preserving
cost. It prefers to encrypt the data sets which incur less
cost but disclose more privacy sensitive information. Data
sets with higher privacy-preserving cost and lower privacy
leakage are expected to remain unencrypted. Thus cost is
reduced in this technique instead of encrypting all data sets.
IV. SIMULATION AND RESULTS
Data holders store their data into cloud. Only the authenticated
users can decrypt and download the data. While processing
the data intermediated data sets are generated. Privacy leakage
of the intermediate data sets are calculated. Based on the
privacy requirement of the data holder intermediate data sets
are encrypted selectively. Cost of encrypting the data sets are
also calculated. The data set which incurs less cost for
encryption and leaks more privacy is selected for encryption
and others remain unencrypted. The privacy leakage of the
unencrypted data set is lesser than the threshold value given
by the data holder. When adversary sees the data set he cannot
infer any information from them.
ISSN: 2231-5381
Fig. 3 Encryption Based On Threshold Value
Fig. 4 Privacy Preseving Cost
http://www.ijettjournal.org
Page 238
International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 5 - Mar 2014
Fig. 5 Minimum Privacy Preserving Cost
Fig. 8 Adversary View
Prvacy Preserving Cost
Cost for Selective Encryption decreases dramatically when
the threshold value increases.Whereas cost in all encryption
approach remains the same for all threshold value.
Fig. 6 Heuristic Value
80
70
60
50
40
30
20
10
0
All
Encryption
Cost
Selective
Encryption
Cost
0.05 0.1 0.2 0.3 0.4 0.5
Privacy Leakage
Fig. 9 Cost Comparison
Fig. 7 Data sets to be encryted
ISSN: 2231-5381
V. CONCLUSIONS
This paper has proposed an approach that identifies which part
of intermediate data sets needs to be encrypted while the rest
does not, in order to save the privacy preserving cost. A tree
structure has been modeled from the generation relationships
of intermediate data sets to analyze privacy propagation
among data sets. The problem of saving privacy-preserving
cost as a constrained optimization problem which is addressed
by decomposing the privacy leakage constraints have been
modeled. A practical heuristic algorithm has been designed
accordingly. Evaluation results on real-world data sets and
larger extensive data sets have demonstrated the cost of
preserving privacy in cloud can be reduced significantly with
this approach over existing ones where all data sets are
encrypted.
http://www.ijettjournal.org
Page 239
International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 5 - Mar 2014
REFEENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
H. Lin and W. Tzeng, “A Secure Erasure Code-Based Cloud Storage
System with Secure Data Forwarding,” IEEE Trans. Parallel and
Distributed Systems, vol. 23, no. 6, pp. 995-1003, June 2012.
N. Cao, C. Wang, M. Li, K. Ren, and W. Lou, “Privacy-Preserving
Multi-Keyword Ranked Search over Encrypted Cloud Data,” Proc.
IEEE INFOCOM ’11, pp. 829-837, 2011.
M. Li, S. Yu, N. Cao, and W. Lou, “Authorized Private Keyword
Search over Encrypted Data in Cloud Computing,” Proc. 31st Int’l
Conf. Distributed Computing Systems (ICDCS ’11), pp. 383-392, 2011.
C. Gentry, “Fully Homomorphic Encryption Using Ideal Lattices,”
Proc. 41st Ann. ACM Symp. Theory of Computing (STOC ’09), pp.
169-178, 2009.
Wang J, Zhao Y et al. (2009). Providing Privacy Preserving 1. in
cloud computing, International Conference on Test and Measurement,
vol 2, 213–216.
B.C.M. Fung, K. Wang, R. Chen, and P.S. Yu, “Privacy-Preserving
Data Publishing: A Survey of Recent Developments,” ACM
Computing Survey, vol. 42, no. 4, pp. 1-53, 2010.
I. Roy, S.T.V. Setty, A. Kilzer, V. Shmatikov, and E. Witchel, “Airavat:
Security and Privacy for Mapreduce,” Proc. Seventh USENIX Conf.
Networked Systems Design and Implementation (NSDI ’10), p. 20,
2010.
K.P.N. Puttaswamy, C. Kruegel, and B.Y. Zhao, “Silverline: Toward
Data Confidentiality in Storage-Intensive Cloud Applications,” Proc.
Second ACM Symp. Cloud Computing (SoCC ’11), 2011.
K. Zhang, X. Zhou, Y. Chen, X. Wang, and Y. Ruan, “Sedic: PrivacyAware Data Intensive Computing on Hybrid Clouds,” Proc. 18th ACM
Conf. Computer and Comm. Security (CCS ’11), pp. 515-526, 2011.
V. Ciriani, S.D.C.D. Vimercati, S. Foresti, S. Jajodia, S. Paraboschi,
and P. Samarati, “Combining Fragmentation and Encryption to Protect
Privacy in Data Storage,” ACM Trans. Information and System
Security, vol. 13, no. 3, pp. 1-33, 2010.
Xuyun Zhang, Chang Liu, Surya Nepal, Suraj Pandey, and Jinjun Chen,
Member, IEEE , “A Privacy Leakage Upper Bound Constraint-Based
Approach for Cost-Effective Privacy Preserving of Intermediate Data
Sets in Cloud,” IEEE transactions on parallel and distributed systems,
vol. 24, no. 6, june 2013.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 240
Download