Big Data Privacy and Security: A Review Tejashree B. Patil

advertisement
International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016)
Big Data Privacy and Security: A Review
Tejashree B. Patil1 and Ashish T. Bhole2
1
PG Student, 2Associate Professor
Department of Computer Engineering, SSBTs College of Engineering and Technology,
North Maharashtra University, Jalgaon, Maharashtra, India
Abstract- Big data is a collection of large amount of
data. Big Data is term for any collection of data sets
which is large and complex that it becomes difficult to
process using on-hand database management tools or
traditional data processing applications. Due to large
scale, privacy and security is one of the critical
challenges today for Big data which brings serious
threat to protect the individual's sensitive information.
Many existing techniques for protecting the privacy of
individual's
sensitive
information
such
as
anonymization method refer to hiding the sensitive
data. The data anonymization method fails to
preserving privacy of sensitive data. By using Mylar,
the sensitive information can be protected from hacker.
Mylar is a web application framework that protect the
confidential information.
Keywords- Big Data, Security & privacy, Data
Anonymization, Mylar, Confidentiality.
I.
INTRODUCTION
The buzzword big data is a catchword used to
illustrate a great volume of structured as well as
unstructured data. As the data size is very huge, it is
difficult to use traditional database and software
techniques to process it. In many organizations either
the data is too large or it moves at extremely highspeed or it goes beyond existing processing capability.
Big data [1] is likely to facilitate business in
improving their operations and help in making faster
and more intelligent decisions.
information. In the case of big data, large volume and
different type of data is being collected which may
contain more personal information of individual's. To
prevent the discloser of all these personal and
sensitive information is termed as big data privacy [3,
4]. A practical and widely-adopted technique for data
privacy preservation is to anonymize data [5]. Data
anonymization refers to hiding identity and sensitive
data so that the privacy of an individual is effectively
preserved while certain aggregate information can be
still exposed to data users for diverse analysis and
mining tasks. So there is give lots of focus on
technologies which handle the huge data and make it
secured.
Big data privacy and security is one of the
hottest research topics in big data computing and
service applications, because of the lack of research
results and developed privacy preserving technologies
and solutions to provide adequate big data privacy.
Big data privacy faces the need to effectively enforce
security policies to protect sensitive data. Securing
such a huge data set from inside as well as outside is
also one of the major challenging issues of big data [6].
Preventing the data leakage at the time of processing
and protecting from the outside attacks requires a
trusted data centric security model.
Mylar is the system to protect the data
confidentiality in a wide range of web applications
against arbitrary server compromises.
A. User Role-Based Methodology
It is believed the term big data started with
companies handling web search applications and
looked-for queries on large distributed collection of
data. The range of big data may be petabytes or
exabytes of data consisting of huge number of records
of millions of people related to sales, health care
system, mobile information etc. Generally such data is
un-structured data and is commonly unfinished and
unapproachable [2].
Big data starts with large volume,
heterogeneous, autonomous sources with distributed
and decentralized control, and seeks to explore
complex and evolving relationships among data.
Information sharing is an ultimate goal for all systems
involving multiple parties, so data privacy is an
important factor in big data. The general meaning of
privacy is preventing the discloser of sensitive
ISSN: 2231-5381
Figure I: Application Scenario with Big Data Mining at the
Core.
Based on the stage division in knowledge
discovery from data process can identify four different
types of users [1], namely Data Provide, Data
http://www.ijettjournal.org
Page 201
International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016)
Collector, Data Miner, Decision Maker [7] as shown
in Figure I. By differentiating the four different user
roles, we can explore the privacy issues in data mining
in a principled way. All users care about the security
of sensitive information, but each user role views the
security issue from its own perspective.
1.
Data Provider
The major concern of a data provider is
whether control the sensitivity of the data that
provides to other. The user owns some data
that are given by mining task.
2.
Data Collector
The user who collects data from data
providers and then publish the data to the
data miner. The data collected from data
providers that may contain individuals'
sensitive information. Directly releasing the
data to the data miner will violate data
provider privacy.
3.
Data Miner
The data miner applies mining algorithms to
the data provided by data collector.
4.
Decision Maker
As shown in Figure I, A decision maker can
get the big data mining results directly from
the data miner, or from some Information
Transmitter. It is likely that the information
transmitter changes the mining results
intentionally or unintentionally, which may
cause serious loss to the decision maker.
Each user role has its own privacy concern.
There is need a lot of focus on data collector phase. If
the data collector doesn't take enough precautions
before delivering data to data miner or public that
sensitive information may be disclosed.
II.
RELATED WORK
Many techniques have been suggested and
implemented for privacy preservation of large data set
to protect confidential data, as describe next. Unlike
Mylar, none of them can support a wide range of
complex web applications, nor compute over
encrypted data at the server, nor address the problem
of securely managing access to shared data [8] .
k-anonymity [9] as a property that each
record is indistinguishable with at least k-1 records. In
this method, privacy cannot be achieved if sensitive
value has same value in equivalence class. ℓdiversity[10] refer as if every equivalence class of the
ISSN: 2231-5381
table has ℓ-diversity if there are at least ℓ wellrepresented values for the sensitive attribute. Wang
[11] presented, (α,k)-Anonymity model, a view of the
table is said to be an (α, k) anonymization, if the
modification of the table satisfies both k-anonymity
and α-deassociation properties with respect to the
quasi-identifier. t-closeness method, an equivalence
class is said to have t-closeness if the distance between
the distribution of a sensitive attribute in this class and
the distribution of the attribute in the whole table is no
more than a threshold t [12]. It preserves the privacy
against homogeneity and background knowledge
attacks.
Before use Mylar the web application is
designed and it is secure but, Keyword search is a
common operation in web applications, but it is often
impractical to run on the client because it would
require downloading large amounts of data to the
user’s machine. While there exist practical
cryptographic schemes for keyword search, they
require that data be encrypted with a single key [13].
This restriction makes it difficult to apply these
schemes to web applications that have many users and
hence have data encrypted with many different keys
[14]. Several data sharing sites encrypt data in the
browser before uploading it to the server, and decrypt
it in the browser when a user wants to download the
data [15]. The key is either stored in the URL’s hash
fragment, or typed in by the user, and both the key and
data are accessible to any JavaScript code from the
page [16]. As a result, an active adversary could serve
JavaScript code to a client that leaks the key.
SUNDR [17] uses a special protocol that
helps the authorized user to identify the modifications
that attempted on
the files by the unauthorized user in the network.
Protects file system integrity, providing fork
consistency in the face of a
Malicious server. SPORC [18] and Depot [19] extend
SUNDR’s design to build applications on top of an
encrypted serve. These systems do not allow an
application to perform server side computation, such
as Mylar’s server-side keyword search. Furthermore,
with SPORC, the application logic is determined at
runtime, based on the URL that the user visits.
CryptDB [20] aims to protect data
confidentiality against the threat by executing SQL
queries over encrypted data on the DBMS server.
Consequently, while CryptDB protects against attacks
on the database server, it provides no guarantees for
users logged in during an attack on the application
server. CryptDB cannot compute over data encrypted
with different keys as in Mylar’s multi-key keyword
search.
ShadowCrypt [21] allows users to
transparently switch to encrypted input/output for text-
http://www.ijettjournal.org
Page 202
International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016)
based web applications. ShadowCrypt is designed to
be secure against potentially malicious or
compromised web applications. ShadowCrypt aims to
ensure that any data entered into a secure input widget
that encrypts the data with key k is only visible to
principals with knowledge of the key k. ShadowCrypt
runs as a browser extension, replacing input elements
in a page with secure, isolated shadow inputs and
encrypted text with secure, ShadowCrypt do not aim
to protect against denial-of-service attacks by the
application.
Table I : Comparing Privacy Preservation Methods
Sr.
No.
Author
Name
Proposed
Concept
Limitation
1
Yun Pan
et al.[9]
k-annonymity
Not prevent
Attribute
leakage attack.
t-closeness
Does not
preserving the
privacy against
identity
disclosure
Attack.
2
Ninghui Li
et al.[12]
3
Benjamin
et al.[10]
ℓ-diversity
Fails to
preserve the
privacy against
skewness and
similarity
attacks.
4
Qiang Wang
et al.[11]
(α,k)Anonymity
model
Does not
address identity
disclosure
attack.
5
A. J.
Feldman
et al.[18]
SPORC
Does not allow
server side
computation.
6
Raluca Ada
Popa et
al.[20]
CryptDB
Not handle the
request if data
encrypted with
different key.
ShadowCrypt
Does not aim to
protect against
denial-ofservice attacks
by the
application
7
Warren He
et al.[21]
III.
PROPOSED WORK
In Big Data major challenge is security
and privacy issues while sharing data and ever
growing public databases. To prevent the leakage of
ISSN: 2231-5381
sensitive data, Mylar is framework that protect the
confidentially against the attackers.
A. Problem Statement
Big Data refers to the massive amounts of
digital information. Big data phenomenon arises from
the increasing number of data collected from various
sources, including the internet. Due to its large scale,
privacy and security are some of the critical
challenges today for big data which brings serious
threat to protect the individual's sensitive information.
The existing anonymization method protects the
privacy of individual's sensitive information. The data
anonymization method fails to take into account
privacy of sensitive data. The privacy problem can be
solved by computing with encrypted data using Myler.
Mylar, protect the data confidentiality against the
attackers that will prevent the loss of confidential
information.
B. Objective
Objectives are:
1. Authentication: It is the process of uniquely
identifying the clients of your applications
and services. These might be end users, other
services, processes, or computers.
2. Authorization: It is the process that governs
the resources and operations that the
authenticated client is permitted to access.
Resources include files, databases, tables,
rows, and so on, together with system-level
resources such as registry keys and
configuration data.
3. Auditing: Effective auditing and logging is
the key to non-repudiation. Non-repudiation
guarantees that a user cannot deny
performing an operation or initiating a
transaction.
4. Confidentiality: Confidentiality, also referred
to as privacy, is the process of making sure
that data remains private and confidential,
and that it cannot be viewed by unauthorized
users or eavesdroppers who monitor the flow
of traffic across a network. Encryption is
frequently used to enforce confidentiality.
5. Integrity: Integrity is the guarantee that data
is protected from accidental or deliberate
(malicious) modification.
6. Availability: From a security perspective,
availability means that systems remain
http://www.ijettjournal.org
Page 203
International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016)
available for legitimate users other users
cannot access the application.
2.
Client-side library
It intercepts data sent to and from
the server, and encrypts or decrypts that data.
Each user has a private-public key pair. The
client-side library stores the private key of
the user at the server, encrypted with the
users password. When the user logs in, the
client-side library fetches and decrypts the
users private key. For shared data, Mylars
client creates separate keys that are also
stored at the server in encrypted form.
3.
Server-Side library
It performs computation over
encrypted data at the server. Specifically,
Mylar supports keyword search over
encrypted data, because we have found that
many applications use keyword search.
4.
Identity provider (IDP)
For some applications, Mylar needs
a trusted identity provider service (IDP) to
verify that a given public key belongs to a
particular username. An application needs the
IDP if the application has no trusted way of
verifying the users who create accounts, and
the application allows users to choose whom
to share data with. The IDP helps Mylar
perform this verification by signing the users
public key and username. The IDP does not
store per application state, and Mylar
contacts the IDP only when a user first
creates an account in an application;
afterwards, the application server stores the
certificate from the IDP [8].
5.
Hadoop
C. Motivation
Hackers try to access the sensitive data of
user. The huge amounts of information are being
collected on the servers this information contain the
user sensitive data. How the users possibly feel that
their data is safe with them. Try to prevent attackers
from breaking into servers. Web applications are
depend on servers to store and process confidential
information. If anyone who gains access to the server
can obtain all of the data stored there. Mylar is
framework, which protects data confidentiality against
attackers.
IV.
ARCHITECTURE OF MYLAR
The architecture of Mylar is shown in Figure
II. Mylar embraces the trend towards client-side web
applications. Mylar design is suitable for platforms
that:
1. Enable client-side computation on data
received from the server.
2. Allow the client to intercept data going to the
server and data coming from the server.
3. Separate application code from data, so that
the HTML pages supplied by the server are
static [8].
Figure II : Mylar Architecture
Mylar architecture consists of the five following
components:
1.
Browser Extension
It is responsible for verifying that
the client-side code of a web application that
is loaded from the serve has not been
tampered with.
ISSN: 2231-5381
Hadoop mainly consist of two
component i.e HDFS (Hadoop Distributed
File System) and MapReduce, HDFS used
for Storing the Structured (relational data)
and unstructured data (File, multimedia).
HDFS having to component such as Name
node, to store the Meta data. And Data node,
to store the actual data, HDFS stores files
system metadata and application data
separate. MapReduce is a parallel processing
framework which processes the large volume
of data in parallel approach and provide high
performance to process the data stored in
HDFS. It process through two main
components i.e. Job Tracker and Task
http://www.ijettjournal.org
Page 204
International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016)
Tracker both are control the job and gives
high performance in data processing.
V.
CONCLUSION
Computing with encrypted data, using
systems like Mylar, will become one of the primary
strategies for protecting confidential information.
Mylar stores sensitive data encrypted on the server
and decrypts that data only in user’s browser. Mylar
increases the security to the data in the database
during the process of searching the data in big data, it
ensures that client-side application code is authentic
even if the server is malicious. Mylar introduces a
cryptographic scheme to perform keyword search at
the server over data encrypted with different keys.
[10]
[11]
[12]
[13]
[14]
[15]
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
Agrawal R., Srikant R.,``Privacy Preserving Data Mining.,''In
the Proceedings of the ACM SIGMOD Conference.2000.
P.Kamakshi,"Survey On Big Data and Related Privacy
Issues", IJRET, 2014
Hirsch, Dennis D. "The Glass House Effect: Big Data, the
New Oil, and the Power of Analogy" , Maine Law Review
66 (2014).
Katal, Avita, Mohammad Wazid, and R. H. Goudar. "Big
data: Issues, challenges, tools and Good practices." In
Contemporary Computing (IC3), 2013 Sixth International
Conference on, pp. 404-409. IEEE, 2013.
Salini . S, Sreetha . V. Kumar, Neevan .R, "Survey on Data
Privacy in Big Data with K-Anonymity ", Volume
2,International Journal of Innovative Research in Computer
and Communication Engineering, Issue 5, May 2015.
Krishna Mohan Pd Shrivastva1, M A Rizvi, Shailendra
Singh, "Big Data Privacy Based On Differential Privacy a
Hope for Big Data", 2014, IEEE.
Lei Xu, Chunxiao Jiang, (Member, IEEE), Jian Wang,
(Member, IEEE), Jian Yuan, (Member, IEEE), and Yong ren,
(Member, IEEE), "Information Security in Big Data: Privacy
and Data Mining", Volume 2, IEEE, October 20, 2014.
Raluca Ada Popa, Emily Stark, Jonas Helfer, Steven Valdez,
Nickolai Zeldovich, M. Frans Kaashoek, and Hari
Balakrishnan MIT CSAIL and Meteor Development Group."
Building web applications on top of encrypted data using
Mylar .
Yun Pan, Xiao-ling Zhu, Ting-gui Chen," Research on
Privacy Preserving on K-anonymity", Jurnal of software,
2012.
ISSN: 2231-5381
[16]
[17]
[18]
[19]
[20]
[21]
Benjamin C.M, Fung, Ke Wang, Ada Wai-Chee Fu and
Philip S. Yu, "Introduction to Privacy-Preserving Data
Publishing Concepts and techniques", ISBN:978-1-42009148-9,2010.
Qiang Wang, Zhiwei Xu and Shengzhi Qu, “An Enhanced KAnonymity Model against Homogeneity Attack”, Journal of
software,2011, Vol. 6, No.10, October 2011;1945-1952.
Ninghui Li, Tiancheng Li, Suresh Vengakatasubramaniam,“tCloseness: Privacy Beyond k-Anonymity and ℓ-Diversity”,
International Conference on Data Engineering, 2007, pp106115.
A. Arasu, S. Blanas, K. Eguro, R. Kaushik, D. Kossmann, R.
Ramamurthy, and R. Venkatesa,." Orthogonal security with
Cipherbase", In Proceedings of the6th Biennial Conference
on Innovative Data Systems Research (CIDR), Asilomar, CA,
Jan. 2013.
S. Bajaj and R. Sion."TrustedDB: a trusted hardware based
database with privacy and data confidentiality", In
Proceedings of the 2011 ACM SIGMOD International
Conference on Management of Data, pages 205–216, Athens,
Greece, June 2011
G. Ateniese, K. Fu, M. Green, and S. Hohenberger,
"Improved proxy re-encryption schemes with applications to
secure distributed storage". In Proceedingsof the 13th Annual
Network and Distributed SystemSecurity Symposium, San
Diego, CA, Feb. 2006.
D. Akhawe, P. Saxena, and D. Song, "Privilege separation in
HTML5 applications". In Proceedings ofthe 21st Usenix
Security Symposium, Bellevue, WA, Aug. 2012.
J. Li, M. Krohn, D. Mazieres, and D. Shasha, "Secure
untrusted data repository (SUNDR)". In Proceedings of the
6th Symposium on Operating Systems Design and
Implementation (OSDI), pages 91–106, San Francisco, CA,
Dec. 2004.
A. J. Feldman, W. P. Zeller, M. J. Freedman, and E. W.
Felten," SPORC: Group collaboration using untrusted cloud
resources". In Proceedings of the 9th Symposium on
Operating Systems Design and Implementation (OSDI),
Vancouver, Canada, Oct.2010.
P. Mahajan, S. Setty, S. Lee, A. Clement, L. Alvisi, M.
Dahlin, and M. Walfish, "Depot: Cloud storage with minimal
trust". In Proceedings of the 9th Symposium on Operating
Systems Design and Implementation (OSDI), Vancouver,
Canada, Oct. 2010
Raluca Ada Popa, Catherine M. S. Redfield, Nickolai
Zeldovich, and Hari Balakrishnan, "CryptDB: Protecting
Confidentiality with Encrypted Query Processing", ACM,
2011.
Warren He, Devdatta Akhawe, Sumeet Jain,"ShadowCrypt:
Encrypted Web Applications for Everyone", ACM,
November 2014.
http://www.ijettjournal.org
Page 205
Download