International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 5 - Mar 2014 Privacy Preserving Heuristic Approach for Intermediate Data Sets in Cloud Ms. C. Celcia#1 , Mrs. T. Kavitha*2 # Student,*Assistant Professor Department of Computer Science and Engineering Regional Centre of Anna University Tirunelveli (T.N) India Abstract— Cloud computing is the sharing of computing resources which lessen the upfront investment cost of IT infrastructure. So many organizations are moving their business into cloud. In data intensive applications, while processing original data set many intermediate data sets will be generated. The intermediate data sets are often stored in cloud in order to reduce the cost of recomputing them. Intermediate data sets may contain sensitive information. Preserving the privacy of the intermediate data sets is a challenging problem because adversaries may recover sensitive information by analyzing multiple intermediate data sets. Encrypting all intermediate data sets is neither efficient nor cost effective. It may be very time consuming to encrypt and decrypt all the intermediate data sets. Privacy preserving heuristic approach identifies which intermediate data set needs to be encrypted and which do not based on the privacy requirements of the data holders. In this, encryption is integrated with data anonymization for costeffective privacy preserving. Keywords— Privacy preserving, Cloud computing, Intermediate data set, Encryption, Privacy requirement, Privacy Leakage I. INTRODUCTION Cloud computing can be defined as the set of hardware, networks, storage, services, and interfaces that combine to deliver aspects of computing as a service. Cloud services include the delivery of software, infrastructure, and storage over the internet either as separate components or a complete platform based on user demand. Cloud computing does not require a user to be in a specific place to gain access to it. Companies may find that cloud computing allows them to reduce the cost of information management, since they are not required to own their own servers and can use capacity leased from third parties. Additionally, the cloud-like structure allows companies to upgrade software more quickly. Cloud computing enables businesses of all sizes to quickly procure and use a wide range of enterprise-class IT systems on a pay-per use basis from anywhere at any time.Servers, storage and network resources are better utilized as the Cloud is shared by multiple users, thus cutting down on waste at a global level. Cloud computing is more environment-friendly and energy efficient. Applications provided through the Cloud can be accessed from any device – a computer, a smartphone, an iPad, etc. Any device that has access to the Internet can leverage the power of the Cloud. Cloud customers can save ISSN: 2231-5381 huge capital investment of IT infrastructure, and concentrate on their own core business. Therefore, many companies or organizations have been migrating or building their business into cloud. A. Privacy Preserving Cloud computing offers many benefits to its users. However, Numerous potential customers are still hesitant to take the advantage of cloud due to security and privacy concerns. Privacy preserving is an important research topic in cloud computing.Since the data itself contains sensitive information .When it is outsourced to a third party, privacy breach of the data can occur. Privacy refers to the right to selfdetermination, that is, the right of individuals to ‘know what is known about them’, be aware of stored information about them, control how that information is communicated and prevent its abuse. Without the knowledge of the physical location of the server or of how the processing of personal data is configured, end-users consume cloud services without any information about the processes involved. Data in the cloud are easier to manipulate, but also easier to lose control of. For instance, storing personal data on a server somewhere in cyberspace could pose a major threat to individual privacy. When processing the original data set in cloud intermediate data sets are generaeted. Users have to pay for storage and computation. So they may store some valuable data sets in cloud in order to reduce the cost of regenerating the data sets again. Storing the intermediate data in cloud elaborates the attack surfaces so that privacy requirements of data holders are at risk of being violated. Usually, multiple parties access and process the intermediate data, but it is rarely controlled by original data set holders. In this situation an adversary can collect the intermediate data sets together and get the privacysensitive information from them. This leads to considerable economic loss or severe social reputation impairment to data owners. The most straight forward approach for preserving the privacy of data sets is encryption. Usually, the volume of intermediate data sets is huge. Hence, encrypting all intermediate data sets will lead to high overhead and low efficiency when they are frequently accessed or processed. http://www.ijettjournal.org Page 235 International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 5 - Mar 2014 Cost of encrypting all intermediate data sets are high. Hence it is planned to reduce the cost by encrypting only the selective intermediate data sets rather than all. For this a novel upper bound privacy leakage constraint-based approach is used to identify which intermediate data sets need to be encrypted and which do not, so that privacy-preserving cost can be saved while the privacy requirements of data holders can still be satisfied. The major part of this paper is threefold. They are prevent information leakage beyond the data provider’s policy. But Airavat cannot confine every computation performed by untrusted code . Silverline [7] is a set of tools that automatically identifies all functionally encryptable data in a cloud application, assigns encryption keys to specific data subsets to minimize key management complexity while ensuring robustness to key compromise, and provides transparent data access at the user device while preventing key compromise even from malicious It demonstrates the possibility of ensuring privacy clouds . Silverline provides a substantial first step towards leakage requirements without encrypting all intermediate simplifying the complex process of incorporating data data sets when encryption is incorporated with confidentiality into these storage-intensive cloud applications. anonymization to preserve privacy. Its aim is to improve the confidentiality of application data It identifies which data sets need to be encrypted for stored on third-party computing clouds. But there are several preserving privacy while the rest of them do not. disadvantages of Silverline. Not all data on the cloud is It reduces privacy preserving cost over existing encrypted. Cloud can learn some metadata. Executing approaches, which provides benefits to cloud users who inequality comparisons on encrypted cells fail. Data encrypted utilize cloud sevices in a pay-as-you-go fashion. with a single key that is shared with all the registered users in an application are vulnerable to a variety of attacks by the II. RELATED WORK cloud. Sedic [8] provides a solution to the privacy threat that is to A. Cost effective privacy preserving split a task, keeping the computation on the private data within an organization’s private cloud while moving the rest This work provides the various approaches for privacy to the public commercial cloud. Sedic leverages the special preserving in cloud computing. Encryption is the technique to features of MapReduce to automatically partition a computing preserve the privacy of data. Storing data in a third party’s job according to the security levels of the data it works on, cloud system causes serious concern on data confidentiality. and arrange the computation across a hybrid cloud. In order to provide strong confidentiality for messages in MapReduce’s distributed file system is modified to storage servers, user can encrypt data by cryptographic strategically replicate data, moving sanitized data blocks to method. Encrypting all the data sets, a straight forward and the public cloud. Over this data placement, map tasks are effective approach is widely adopted in [1], [2], [3]. However, carefully scheduled to outsource as much workload to the processing on encrypted data sets efficiently is quite a public cloud as possible, given sensitive data always stay on challenging task, because most existing applications only run the private cloud. on unencrypted data sets. Although recent progress has been Sedic is designed to protect data privacy during map-reduce made in homomorphic encryption which theoretically allows operations, when the data involved contains both public and performing computation on encrypted data sets, applying private records. This protection is achieved by ensuring that current algorithms are rather expensive due to their the sensitive information within the input data, intermediate inefficiency [4]. outputs and final results will never be exposed to untrusted Anonymization based approach [5] proposes the anonymity nodes during the computation. It involves overhead of algorithm that processes the data and anonymises all or some transferring data between private and public cloud. information before releasing it in the cloud. When required, Encryption and fragmentation approach [9] couples the cloud service provider makes use of the background encryption together with data fragmentation. Encryption will knowledge it has and incorporates the details with the be applied only when explicitly demanded by the privacy anonymous data to mine the needed knowledge. This requirements. Privacy requirements are enforced by splitting approach differs from the traditional cryptography technology information over two independent database servers in order to for preserving user’s privacy as it gets rid of key management break associations of sensitive information and by encrypting and thus it stands simple and flexible. While anonymising is information whenever necessary. The information to be easier, the attributes that has to be made anonymous varies protected is first split into different fragments in such a way to and it depends on the cloud service provider. This approach break the sensitive associations represented through will be suitable only for limited number of services. Thus, the confidentiality constraints and to minimize the amount of method has to be bettered by automating the anonymisation. information represented only in encrypted format. The Airavat [6] is a MapReduce-based system which provides resulting fragments may be stored at the same server or at strong security and privacy guarantees for distributed different servers. Finally, the encryption key is given to the computations on sensitive data. Airavat is a novel integration authorized users needing to access the information. Users that of mandatory access control and differential privacy. It do not know the encryption key as well as the storing server(s) enables many privacy-preserving MapReduce computations are able neither to access sensitive information nor to without the need to audit untrusted code. Its objective is to reconstruct the sensitive associations. But the protection of ISSN: 2231-5381 http://www.ijettjournal.org Page 236 International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 5 - Mar 2014 fragmented data when the information stored in the fragments may change over time is difficult. III. PROPOSED SYSTEM when processing original data sets in data intensive applications, in order to curtail the overall expenses by avoiding frequent recomputation to obtain these data sets. 2) Privacy leakage quantification: On processing the original data set intermediate data sets will be generated. Sensitive intermediate data set tree is used to capture the propagation of privacy-sensitive information among the data sets.Privacy leakage of the intermediate data sets are quantified. By considering the privacy requirement of the data holder privacy leakage upper bound constraint is calculated. Intermediate data sets whose privacy leakage is lesser than the upperbound will be unencrypted and whose privacy leakage is greater than the upperbound will be encrypted. By doing so the cost effective global privacy preserving solution is obtained. The privacy sensitive information is generally regarded as the association between sensitive data and individuals. Privacy leakage of the intermediate data set is quantified. And a threshold value is given by the data holder. Threshold value should not exceed the maximum privacy leakage of the single data set. If the privacy leakage threshold is minimum more data sets need to be encrypted. If it is maximum more data sets may remain unencrypted. The sum of the privacy leakage of the unencrypted data sets should not exceed the threshold value given by the data holder. PLs(d*) ≜ H(S,Q) – H*(S,Q) Where H(S,Q) = log(|QI| . |SD|) and H∗ (S, Q) = − p(s, q). log(p(s, q)) є , є 3) Privacy Leakage Constraint Decomposition: The privacy leakage constraint is decomposed into different layers. So there is different threshold value for each layer. The privacy leakage incurred by the unencrypted data set in the layer can never be larger than the threshold value in that layer. A local encryption solution in the layer is feasible if it satisfies the privacy leakage constraint. A set of feasible solutions exists in a layer which constitutes global solution. A compressed tree is created from layer 1 to H where H is the height of the tree. The construction is achieved via three steps. 1. The data sets in EDi are compressed into one encrypted node. 2. All offspring data sets of the data sets in UDi are Fig.1 System Architecture omitted. 3. The data sets in UDi are compressed into one node. 1) Process Original data set: The data holder will store the data into cloud after encryption. The original dataset is encrypted for confidentiality. The data users have to register themselves by giving the username and password. Then only they can able to decrypt the data that the data holder has stored in cloud. DES algorithm is used for encryption. Only the authenticated users can process the dataset. Storage and computation services in cloud are equivalent from an economical perspective because they are charged in proportion to their usage. Thus cloud customers can store valuable intermediate data sets selectively ISSN: 2231-5381 ( )≤ ,1 ≤ ≤ ∈ The threshold ԑi,1 ≤ i ≤ H, is calculated by є =є http://www.ijettjournal.org ( ) − ∈ є = є Page 237 International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 5 - Mar 2014 4) Cost Calculation: Cost of storing the intermediate data set is calculated by the size of the intermediate data set, frequency of accessing that data set and the price set up by cloud service vendors. If the frequency of accessing the intermediate data set is larger then more cost will be incurred if the intermediate data set is encrypted. The privacy preserving cost rate is denoted as . ≜ . є Where Si is the size of the intermediate data set, fi is the frequency of accessing the stored intermediate data set, and PR is the price for encryption and decryption. The cost of privacy preserving should be minimum in order to get the optimal result. Data holder will give privacy requirements that is the privacy leakage threshold allowed by a data holder, the privacy leakage caused by the unencrypted data sets should be under a given threshold. Fig. 2 Privacy Leakage Quantification ( )≤ , ∈ . ( ) is the privacy leakage of the multiple data Where ) is the unencrypted data sets. sets and ( 5) Cost Effective Solution: Usually, more than one feasible global encryption solution exists under the PLC, because there are many alternative local solutions in each layer. Further, each intermediate data set has various size and frequency of usage, leading to different overall cost with different solutions. Therefore it is desired to find a feasible solution with minimum privacy-preserving cost under privacy leakage constraints. Heuristic approach is used to reduce privacy-preserving cost. It prefers to encrypt the data sets which incur less cost but disclose more privacy sensitive information. Data sets with higher privacy-preserving cost and lower privacy leakage are expected to remain unencrypted. Thus cost is reduced in this technique instead of encrypting all data sets. IV. SIMULATION AND RESULTS Data holders store their data into cloud. Only the authenticated users can decrypt and download the data. While processing the data intermediated data sets are generated. Privacy leakage of the intermediate data sets are calculated. Based on the privacy requirement of the data holder intermediate data sets are encrypted selectively. Cost of encrypting the data sets are also calculated. The data set which incurs less cost for encryption and leaks more privacy is selected for encryption and others remain unencrypted. The privacy leakage of the unencrypted data set is lesser than the threshold value given by the data holder. When adversary sees the data set he cannot infer any information from them. ISSN: 2231-5381 Fig. 3 Encryption Based On Threshold Value Fig. 4 Privacy Preseving Cost http://www.ijettjournal.org Page 238 International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 5 - Mar 2014 Fig. 5 Minimum Privacy Preserving Cost Fig. 8 Adversary View Prvacy Preserving Cost Cost for Selective Encryption decreases dramatically when the threshold value increases.Whereas cost in all encryption approach remains the same for all threshold value. Fig. 6 Heuristic Value 80 70 60 50 40 30 20 10 0 All Encryption Cost Selective Encryption Cost 0.05 0.1 0.2 0.3 0.4 0.5 Privacy Leakage Fig. 9 Cost Comparison Fig. 7 Data sets to be encryted ISSN: 2231-5381 V. CONCLUSIONS This paper has proposed an approach that identifies which part of intermediate data sets needs to be encrypted while the rest does not, in order to save the privacy preserving cost. A tree structure has been modeled from the generation relationships of intermediate data sets to analyze privacy propagation among data sets. The problem of saving privacy-preserving cost as a constrained optimization problem which is addressed by decomposing the privacy leakage constraints have been modeled. A practical heuristic algorithm has been designed accordingly. Evaluation results on real-world data sets and larger extensive data sets have demonstrated the cost of preserving privacy in cloud can be reduced significantly with this approach over existing ones where all data sets are encrypted. http://www.ijettjournal.org Page 239 International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 5 - Mar 2014 REFEENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] H. Lin and W. Tzeng, “A Secure Erasure Code-Based Cloud Storage System with Secure Data Forwarding,” IEEE Trans. Parallel and Distributed Systems, vol. 23, no. 6, pp. 995-1003, June 2012. N. Cao, C. Wang, M. Li, K. Ren, and W. Lou, “Privacy-Preserving Multi-Keyword Ranked Search over Encrypted Cloud Data,” Proc. IEEE INFOCOM ’11, pp. 829-837, 2011. M. Li, S. Yu, N. Cao, and W. Lou, “Authorized Private Keyword Search over Encrypted Data in Cloud Computing,” Proc. 31st Int’l Conf. Distributed Computing Systems (ICDCS ’11), pp. 383-392, 2011. C. Gentry, “Fully Homomorphic Encryption Using Ideal Lattices,” Proc. 41st Ann. ACM Symp. Theory of Computing (STOC ’09), pp. 169-178, 2009. Wang J, Zhao Y et al. (2009). Providing Privacy Preserving 1. in cloud computing, International Conference on Test and Measurement, vol 2, 213–216. B.C.M. Fung, K. Wang, R. Chen, and P.S. Yu, “Privacy-Preserving Data Publishing: A Survey of Recent Developments,” ACM Computing Survey, vol. 42, no. 4, pp. 1-53, 2010. I. Roy, S.T.V. Setty, A. Kilzer, V. Shmatikov, and E. Witchel, “Airavat: Security and Privacy for Mapreduce,” Proc. Seventh USENIX Conf. Networked Systems Design and Implementation (NSDI ’10), p. 20, 2010. K.P.N. Puttaswamy, C. Kruegel, and B.Y. Zhao, “Silverline: Toward Data Confidentiality in Storage-Intensive Cloud Applications,” Proc. Second ACM Symp. Cloud Computing (SoCC ’11), 2011. K. Zhang, X. Zhou, Y. Chen, X. Wang, and Y. Ruan, “Sedic: PrivacyAware Data Intensive Computing on Hybrid Clouds,” Proc. 18th ACM Conf. Computer and Comm. Security (CCS ’11), pp. 515-526, 2011. V. Ciriani, S.D.C.D. Vimercati, S. Foresti, S. Jajodia, S. Paraboschi, and P. Samarati, “Combining Fragmentation and Encryption to Protect Privacy in Data Storage,” ACM Trans. Information and System Security, vol. 13, no. 3, pp. 1-33, 2010. Xuyun Zhang, Chang Liu, Surya Nepal, Suraj Pandey, and Jinjun Chen, Member, IEEE , “A Privacy Leakage Upper Bound Constraint-Based Approach for Cost-Effective Privacy Preserving of Intermediate Data Sets in Cloud,” IEEE transactions on parallel and distributed systems, vol. 24, no. 6, june 2013. ISSN: 2231-5381 http://www.ijettjournal.org Page 240