International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016) Big Data Privacy and Security: A Review Tejashree B. Patil1 and Ashish T. Bhole2 1 PG Student, 2Associate Professor Department of Computer Engineering, SSBTs College of Engineering and Technology, North Maharashtra University, Jalgaon, Maharashtra, India Abstract- Big data is a collection of large amount of data. Big Data is term for any collection of data sets which is large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Due to large scale, privacy and security is one of the critical challenges today for Big data which brings serious threat to protect the individual's sensitive information. Many existing techniques for protecting the privacy of individual's sensitive information such as anonymization method refer to hiding the sensitive data. The data anonymization method fails to preserving privacy of sensitive data. By using Mylar, the sensitive information can be protected from hacker. Mylar is a web application framework that protect the confidential information. Keywords- Big Data, Security & privacy, Data Anonymization, Mylar, Confidentiality. I. INTRODUCTION The buzzword big data is a catchword used to illustrate a great volume of structured as well as unstructured data. As the data size is very huge, it is difficult to use traditional database and software techniques to process it. In many organizations either the data is too large or it moves at extremely highspeed or it goes beyond existing processing capability. Big data [1] is likely to facilitate business in improving their operations and help in making faster and more intelligent decisions. information. In the case of big data, large volume and different type of data is being collected which may contain more personal information of individual's. To prevent the discloser of all these personal and sensitive information is termed as big data privacy [3, 4]. A practical and widely-adopted technique for data privacy preservation is to anonymize data [5]. Data anonymization refers to hiding identity and sensitive data so that the privacy of an individual is effectively preserved while certain aggregate information can be still exposed to data users for diverse analysis and mining tasks. So there is give lots of focus on technologies which handle the huge data and make it secured. Big data privacy and security is one of the hottest research topics in big data computing and service applications, because of the lack of research results and developed privacy preserving technologies and solutions to provide adequate big data privacy. Big data privacy faces the need to effectively enforce security policies to protect sensitive data. Securing such a huge data set from inside as well as outside is also one of the major challenging issues of big data [6]. Preventing the data leakage at the time of processing and protecting from the outside attacks requires a trusted data centric security model. Mylar is the system to protect the data confidentiality in a wide range of web applications against arbitrary server compromises. A. User Role-Based Methodology It is believed the term big data started with companies handling web search applications and looked-for queries on large distributed collection of data. The range of big data may be petabytes or exabytes of data consisting of huge number of records of millions of people related to sales, health care system, mobile information etc. Generally such data is un-structured data and is commonly unfinished and unapproachable [2]. Big data starts with large volume, heterogeneous, autonomous sources with distributed and decentralized control, and seeks to explore complex and evolving relationships among data. Information sharing is an ultimate goal for all systems involving multiple parties, so data privacy is an important factor in big data. The general meaning of privacy is preventing the discloser of sensitive ISSN: 2231-5381 Figure I: Application Scenario with Big Data Mining at the Core. Based on the stage division in knowledge discovery from data process can identify four different types of users [1], namely Data Provide, Data http://www.ijettjournal.org Page 201 International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016) Collector, Data Miner, Decision Maker [7] as shown in Figure I. By differentiating the four different user roles, we can explore the privacy issues in data mining in a principled way. All users care about the security of sensitive information, but each user role views the security issue from its own perspective. 1. Data Provider The major concern of a data provider is whether control the sensitivity of the data that provides to other. The user owns some data that are given by mining task. 2. Data Collector The user who collects data from data providers and then publish the data to the data miner. The data collected from data providers that may contain individuals' sensitive information. Directly releasing the data to the data miner will violate data provider privacy. 3. Data Miner The data miner applies mining algorithms to the data provided by data collector. 4. Decision Maker As shown in Figure I, A decision maker can get the big data mining results directly from the data miner, or from some Information Transmitter. It is likely that the information transmitter changes the mining results intentionally or unintentionally, which may cause serious loss to the decision maker. Each user role has its own privacy concern. There is need a lot of focus on data collector phase. If the data collector doesn't take enough precautions before delivering data to data miner or public that sensitive information may be disclosed. II. RELATED WORK Many techniques have been suggested and implemented for privacy preservation of large data set to protect confidential data, as describe next. Unlike Mylar, none of them can support a wide range of complex web applications, nor compute over encrypted data at the server, nor address the problem of securely managing access to shared data [8] . k-anonymity [9] as a property that each record is indistinguishable with at least k-1 records. In this method, privacy cannot be achieved if sensitive value has same value in equivalence class. ℓdiversity[10] refer as if every equivalence class of the ISSN: 2231-5381 table has ℓ-diversity if there are at least ℓ wellrepresented values for the sensitive attribute. Wang [11] presented, (α,k)-Anonymity model, a view of the table is said to be an (α, k) anonymization, if the modification of the table satisfies both k-anonymity and α-deassociation properties with respect to the quasi-identifier. t-closeness method, an equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t [12]. It preserves the privacy against homogeneity and background knowledge attacks. Before use Mylar the web application is designed and it is secure but, Keyword search is a common operation in web applications, but it is often impractical to run on the client because it would require downloading large amounts of data to the user’s machine. While there exist practical cryptographic schemes for keyword search, they require that data be encrypted with a single key [13]. This restriction makes it difficult to apply these schemes to web applications that have many users and hence have data encrypted with many different keys [14]. Several data sharing sites encrypt data in the browser before uploading it to the server, and decrypt it in the browser when a user wants to download the data [15]. The key is either stored in the URL’s hash fragment, or typed in by the user, and both the key and data are accessible to any JavaScript code from the page [16]. As a result, an active adversary could serve JavaScript code to a client that leaks the key. SUNDR [17] uses a special protocol that helps the authorized user to identify the modifications that attempted on the files by the unauthorized user in the network. Protects file system integrity, providing fork consistency in the face of a Malicious server. SPORC [18] and Depot [19] extend SUNDR’s design to build applications on top of an encrypted serve. These systems do not allow an application to perform server side computation, such as Mylar’s server-side keyword search. Furthermore, with SPORC, the application logic is determined at runtime, based on the URL that the user visits. CryptDB [20] aims to protect data confidentiality against the threat by executing SQL queries over encrypted data on the DBMS server. Consequently, while CryptDB protects against attacks on the database server, it provides no guarantees for users logged in during an attack on the application server. CryptDB cannot compute over data encrypted with different keys as in Mylar’s multi-key keyword search. ShadowCrypt [21] allows users to transparently switch to encrypted input/output for text- http://www.ijettjournal.org Page 202 International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016) based web applications. ShadowCrypt is designed to be secure against potentially malicious or compromised web applications. ShadowCrypt aims to ensure that any data entered into a secure input widget that encrypts the data with key k is only visible to principals with knowledge of the key k. ShadowCrypt runs as a browser extension, replacing input elements in a page with secure, isolated shadow inputs and encrypted text with secure, ShadowCrypt do not aim to protect against denial-of-service attacks by the application. Table I : Comparing Privacy Preservation Methods Sr. No. Author Name Proposed Concept Limitation 1 Yun Pan et al.[9] k-annonymity Not prevent Attribute leakage attack. t-closeness Does not preserving the privacy against identity disclosure Attack. 2 Ninghui Li et al.[12] 3 Benjamin et al.[10] ℓ-diversity Fails to preserve the privacy against skewness and similarity attacks. 4 Qiang Wang et al.[11] (α,k)Anonymity model Does not address identity disclosure attack. 5 A. J. Feldman et al.[18] SPORC Does not allow server side computation. 6 Raluca Ada Popa et al.[20] CryptDB Not handle the request if data encrypted with different key. ShadowCrypt Does not aim to protect against denial-ofservice attacks by the application 7 Warren He et al.[21] III. PROPOSED WORK In Big Data major challenge is security and privacy issues while sharing data and ever growing public databases. To prevent the leakage of ISSN: 2231-5381 sensitive data, Mylar is framework that protect the confidentially against the attackers. A. Problem Statement Big Data refers to the massive amounts of digital information. Big data phenomenon arises from the increasing number of data collected from various sources, including the internet. Due to its large scale, privacy and security are some of the critical challenges today for big data which brings serious threat to protect the individual's sensitive information. The existing anonymization method protects the privacy of individual's sensitive information. The data anonymization method fails to take into account privacy of sensitive data. The privacy problem can be solved by computing with encrypted data using Myler. Mylar, protect the data confidentiality against the attackers that will prevent the loss of confidential information. B. Objective Objectives are: 1. Authentication: It is the process of uniquely identifying the clients of your applications and services. These might be end users, other services, processes, or computers. 2. Authorization: It is the process that governs the resources and operations that the authenticated client is permitted to access. Resources include files, databases, tables, rows, and so on, together with system-level resources such as registry keys and configuration data. 3. Auditing: Effective auditing and logging is the key to non-repudiation. Non-repudiation guarantees that a user cannot deny performing an operation or initiating a transaction. 4. Confidentiality: Confidentiality, also referred to as privacy, is the process of making sure that data remains private and confidential, and that it cannot be viewed by unauthorized users or eavesdroppers who monitor the flow of traffic across a network. Encryption is frequently used to enforce confidentiality. 5. Integrity: Integrity is the guarantee that data is protected from accidental or deliberate (malicious) modification. 6. Availability: From a security perspective, availability means that systems remain http://www.ijettjournal.org Page 203 International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016) available for legitimate users other users cannot access the application. 2. Client-side library It intercepts data sent to and from the server, and encrypts or decrypts that data. Each user has a private-public key pair. The client-side library stores the private key of the user at the server, encrypted with the users password. When the user logs in, the client-side library fetches and decrypts the users private key. For shared data, Mylars client creates separate keys that are also stored at the server in encrypted form. 3. Server-Side library It performs computation over encrypted data at the server. Specifically, Mylar supports keyword search over encrypted data, because we have found that many applications use keyword search. 4. Identity provider (IDP) For some applications, Mylar needs a trusted identity provider service (IDP) to verify that a given public key belongs to a particular username. An application needs the IDP if the application has no trusted way of verifying the users who create accounts, and the application allows users to choose whom to share data with. The IDP helps Mylar perform this verification by signing the users public key and username. The IDP does not store per application state, and Mylar contacts the IDP only when a user first creates an account in an application; afterwards, the application server stores the certificate from the IDP [8]. 5. Hadoop C. Motivation Hackers try to access the sensitive data of user. The huge amounts of information are being collected on the servers this information contain the user sensitive data. How the users possibly feel that their data is safe with them. Try to prevent attackers from breaking into servers. Web applications are depend on servers to store and process confidential information. If anyone who gains access to the server can obtain all of the data stored there. Mylar is framework, which protects data confidentiality against attackers. IV. ARCHITECTURE OF MYLAR The architecture of Mylar is shown in Figure II. Mylar embraces the trend towards client-side web applications. Mylar design is suitable for platforms that: 1. Enable client-side computation on data received from the server. 2. Allow the client to intercept data going to the server and data coming from the server. 3. Separate application code from data, so that the HTML pages supplied by the server are static [8]. Figure II : Mylar Architecture Mylar architecture consists of the five following components: 1. Browser Extension It is responsible for verifying that the client-side code of a web application that is loaded from the serve has not been tampered with. ISSN: 2231-5381 Hadoop mainly consist of two component i.e HDFS (Hadoop Distributed File System) and MapReduce, HDFS used for Storing the Structured (relational data) and unstructured data (File, multimedia). HDFS having to component such as Name node, to store the Meta data. And Data node, to store the actual data, HDFS stores files system metadata and application data separate. MapReduce is a parallel processing framework which processes the large volume of data in parallel approach and provide high performance to process the data stored in HDFS. It process through two main components i.e. Job Tracker and Task http://www.ijettjournal.org Page 204 International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016) Tracker both are control the job and gives high performance in data processing. V. CONCLUSION Computing with encrypted data, using systems like Mylar, will become one of the primary strategies for protecting confidential information. Mylar stores sensitive data encrypted on the server and decrypts that data only in user’s browser. Mylar increases the security to the data in the database during the process of searching the data in big data, it ensures that client-side application code is authentic even if the server is malicious. Mylar introduces a cryptographic scheme to perform keyword search at the server over data encrypted with different keys. [10] [11] [12] [13] [14] [15] REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] Agrawal R., Srikant R.,``Privacy Preserving Data Mining.,''In the Proceedings of the ACM SIGMOD Conference.2000. P.Kamakshi,"Survey On Big Data and Related Privacy Issues", IJRET, 2014 Hirsch, Dennis D. "The Glass House Effect: Big Data, the New Oil, and the Power of Analogy" , Maine Law Review 66 (2014). Katal, Avita, Mohammad Wazid, and R. H. Goudar. "Big data: Issues, challenges, tools and Good practices." In Contemporary Computing (IC3), 2013 Sixth International Conference on, pp. 404-409. IEEE, 2013. Salini . S, Sreetha . V. Kumar, Neevan .R, "Survey on Data Privacy in Big Data with K-Anonymity ", Volume 2,International Journal of Innovative Research in Computer and Communication Engineering, Issue 5, May 2015. Krishna Mohan Pd Shrivastva1, M A Rizvi, Shailendra Singh, "Big Data Privacy Based On Differential Privacy a Hope for Big Data", 2014, IEEE. Lei Xu, Chunxiao Jiang, (Member, IEEE), Jian Wang, (Member, IEEE), Jian Yuan, (Member, IEEE), and Yong ren, (Member, IEEE), "Information Security in Big Data: Privacy and Data Mining", Volume 2, IEEE, October 20, 2014. Raluca Ada Popa, Emily Stark, Jonas Helfer, Steven Valdez, Nickolai Zeldovich, M. Frans Kaashoek, and Hari Balakrishnan MIT CSAIL and Meteor Development Group." Building web applications on top of encrypted data using Mylar . Yun Pan, Xiao-ling Zhu, Ting-gui Chen," Research on Privacy Preserving on K-anonymity", Jurnal of software, 2012. ISSN: 2231-5381 [16] [17] [18] [19] [20] [21] Benjamin C.M, Fung, Ke Wang, Ada Wai-Chee Fu and Philip S. Yu, "Introduction to Privacy-Preserving Data Publishing Concepts and techniques", ISBN:978-1-42009148-9,2010. Qiang Wang, Zhiwei Xu and Shengzhi Qu, “An Enhanced KAnonymity Model against Homogeneity Attack”, Journal of software,2011, Vol. 6, No.10, October 2011;1945-1952. Ninghui Li, Tiancheng Li, Suresh Vengakatasubramaniam,“tCloseness: Privacy Beyond k-Anonymity and ℓ-Diversity”, International Conference on Data Engineering, 2007, pp106115. A. Arasu, S. Blanas, K. Eguro, R. Kaushik, D. Kossmann, R. Ramamurthy, and R. Venkatesa,." Orthogonal security with Cipherbase", In Proceedings of the6th Biennial Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, Jan. 2013. S. Bajaj and R. Sion."TrustedDB: a trusted hardware based database with privacy and data confidentiality", In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pages 205–216, Athens, Greece, June 2011 G. Ateniese, K. Fu, M. Green, and S. Hohenberger, "Improved proxy re-encryption schemes with applications to secure distributed storage". In Proceedingsof the 13th Annual Network and Distributed SystemSecurity Symposium, San Diego, CA, Feb. 2006. D. Akhawe, P. Saxena, and D. Song, "Privilege separation in HTML5 applications". In Proceedings ofthe 21st Usenix Security Symposium, Bellevue, WA, Aug. 2012. J. Li, M. Krohn, D. Mazieres, and D. Shasha, "Secure untrusted data repository (SUNDR)". In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI), pages 91–106, San Francisco, CA, Dec. 2004. A. J. Feldman, W. P. Zeller, M. J. Freedman, and E. W. Felten," SPORC: Group collaboration using untrusted cloud resources". In Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI), Vancouver, Canada, Oct.2010. P. Mahajan, S. Setty, S. Lee, A. Clement, L. Alvisi, M. Dahlin, and M. Walfish, "Depot: Cloud storage with minimal trust". In Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI), Vancouver, Canada, Oct. 2010 Raluca Ada Popa, Catherine M. S. Redfield, Nickolai Zeldovich, and Hari Balakrishnan, "CryptDB: Protecting Confidentiality with Encrypted Query Processing", ACM, 2011. Warren He, Devdatta Akhawe, Sumeet Jain,"ShadowCrypt: Encrypted Web Applications for Everyone", ACM, November 2014. http://www.ijettjournal.org Page 205